|UPDATE 5:00 AM EDT: RESOLVED: Login / Connection Problems On A Few Database Servers|
UPDATE 5:00 AM EDT:
The full cluster is back up and running. We're seeing traffic to all parts of the cluster and performance looks to have stabilized. We'll be keeping a close eye on things.
UPDATE 4:38 AM EDT:
All of the databases are back up and online. One of the servers is not back online yet, so two of the database instances are running off of a single piece of hardware. While the databases are starting up, and until we get the last server back online, things might be a little sluggish. However, all of the databases look to be starting up properly.
We appreciate your patience during this matter and apologize for the inconvenience. If you care to read more about the technical details, feel free to read on.
Here are some more details on what happened:
VCNSQL81, VCNSQL82, and VCNSQL83 are part of a Microsoft SQL Server Cluster node. There are a number of servers in this cluster that handle each SQL server instance. They all mount storage from a local NAS device (networked storage) that allows for easy backups, quick growth, and the ability to move volumes from one instance to another.
The storage the servers are attached to had an issue, which caused the storage server to go down momentarily (and failover to its pair temporarily). This happens from time to time and is usually not an issue.
In this case, however, when it failed over, the SQL Server Cluster didn't properly release and remount its volumes, and it worked itself into a state where it could no longer see the volumes, release them, remount them, or basically do anything at all.
With a good bit of help from our vendors, we were manually able to force the volumes back into co-existence, allow them to run their disk checks, and bring them back online. To do so, however, it required taking one of the servers out of the node (part of the trickery to allow us to remount the volumes).
Right now, we're in the process of getting the additional hardware back into the cluster, such that things will be back at full capacity.
In the interim, though, all databases should be back up and running, albeit sharing more resources than they normally do. We don't expect it will take very long to get the additional hardware back into the cluster.
UPDATE 4:10 AM EDT:
We've been able to bring one of the volumes back online, and hope that we'll have the second volume online shortly. There's are some disk scans that need to run to make sure things are in good shape, but they're coming up slowly.
UPDATE 1:10 AM EDT:
Our team continues to debug the issue, alongside our vendors. We think we may have uncovered what the issue is, and we are waiting on a fix. We hope to have more info soon.
UPDATE 11:00 PM EDT:
Our Network Operations team continues to work with our vendors to debug this issue. Again, we appreciate your patience and will continue to keep this alert us up-to-date as possible.
UPDATE 9:20 PM EDT:
Our Network Operations team continues to work on debugging this issue. We have enlisted our hardware and software vendors to help debug this storage issue. The data is accessible on the storage device, but does not seem to "mount" onto the affected MSSQL servers.
We're currently experiencing an issue with a small number of our MSSQL database servers. Two of our servers (VCNSQL81 and VCNSQL83) are connected to a device which experienced a short outage. Unfortunately, when it came back up (a few seconds later), the two servers could no longer mount the storage.
|- 04/19/10 at 15:11 ET|