![]() |
|||||
|
|||||
About Friday's GroupWise Outage (23 February 2007)At 9:22am on Friday, Drew's GroupWise system experienced a failure resulting in a complete disruption of email service. All services were restored by 9:50am, just under 30 minutes after service was interrupted.
We apologize for the outage and wish to assure the Drew community of our continued commitment to maintaining high levels of service. We recognize the value that these services have to the community and consider any disruption of a major service during the business day to be unacceptable. CNS is committed to maintaining 24x7 service for email and other enterprise services within reasonable constraints. To this end, we are taking specific action in response to the outage this morning. The purpose of this message is to provide, for those who are interested, additional technical details about what occurred and details about the further actions that we are taking.
The GroupWise system has had an excellent availability record since it was installed. Barring environmental and other issues that have affected all of Drew's systems simultaneously, GroupWise has been continuously available since most users at Drew began using it last spring. GroupWise has been implemented specifically at Drew in such a way that it can automatically recover from most routine problems. Like many other services on campus, the GroupWise system uses a technology known as server clustering to ensure that services can continue to operate even if individual servers within the cluster fail. This along with other high-availability technology in use and proactive monitoring help to ensure a high degree of reliability.
What happened
For the past several weeks we have been tracking a minor issue with the individual GroupWise servers, for which we have an open support call with the vendor. While we are still unsure as to all of the circumstances surrounding the exact cause of the Friday morning failure, steps we were taking to troubleshoot this problem were in part responsible.
In a rare case where the clustering technology actually works against the availability of the system, a communications issue, which we believe was caused by the troubleshooting we were doing, tripped a safety mechanism built into the clustering software that is designed to prevent data corruption. This resulted in all servers being forcibly evicted from the GroupWise cluster simultaneously. Since the entire cluster was shut down, additional time was required to restore services because we had to perform what is known as a "cold startup" of the cluster.
What we are doing
We are taking several actions as a result of the Friday morning GroupWise failure:
Once again, thank you for your patience and understanding.
Have a question or a comment? Please contact us at cns@drew.edu or respond to this message in the Drew Community Forums.
|
|
|||||
|