March 22, 2004

Routine 4am Visit to Data Center Turns into Short Nightmare

Went to the data center at 4am to take down a machine to add a network card. Should have been a piece of cake, but when the machine started up the A1000 raid array (with all critical data for the site) was unavailable. It was starting to look like the same situation a year ago when the controller in our A1000 got hosed. After a little work I got the machine up without the data and called Sun. We did a bunch of looking around, trying diifferent options. With their help we discovered that it wasn't the A1000 controller, but 75% of the disks were failing, which caused the array to report a critical error.

My grey hair count had doubled in one hour.

It was a huge relief when the Sun support person pointed me to the "revive" option in the raid manager and we were able to revive all the disks and be back online in 30 minutes (as opposed to having to get new disks and restore from backup which can take up to 12 hours).

I'd like to see the "revive" option available more often. In fact, it would be nice if all hardware had a revive button which will execute a self-repair and bring itself to good-as-new state. In some cases companies may also want to provide a "resuscitate" option for pulling something back from near death. In those cases "revive" might not be enough to do the trick. ;)

