March 31, 2003
Weekend at the Datacenter
Just completed a 38-hour saga restoring services to our machines, including hours and more hours with Sun (on the phone and in person).
Saturday afternoon I got a page from one of our machines, HTTP service was down. I looked at the machine, which was completely siezed up. Some of the typical commands used to determine what was going on were hanging (df, ls). Our console server couldn't get a prompt, so I tried to shutdown the machine. Shutdown hung as well. Time for a trip to the datacenter . . .
Once I got to the datacenter I rebooted the machine, it hung after finding the boot disk. Tried several boot commands without success and finally decided to call Sun.
3 hours and dozens of reboots later they decided it's a hardware issue. By this time it's midnight and we've been working on the issue for 8 hours. Told the engineer to meet me at 6am and went home to sleep for a bit.
6am on Sunday the engineer shows up with a main board, cpu, memory and a scsi backplane. Within 5 minutes he's figured out it's the controller in the A1000 RAID array and the harware he brought is of no use.
For the next 14 hours we work with the Sun guy while he orders a controller, installs it, realizes it's got old firmware, orders another controller, installs it, spends several hours trying to set it up, get's stuck, orders a third controller and new hard drives, and at last finishes the project using a GUI tool running on our laptop via X11. During the process our data gets wiped.
The Sun guy had warned us midday that we might need a restore from backups. The data center operator said he wasn't sure how to do a restore on our box, but showed us how he did it on another box. We gasped when he logged into an NT box and showed us where on the Start menu he needed to click to restore data to the box.
So, I worked with the data center operator and we (I) figured out how to push a restore from backup server (which runs Solaris) out to our box. We were delighted to find that all the tapes were on-site. The operator got all the tapes mounted into the backup machine (backup and restores are done with legato over the network) long before our A1000 was fixed. It was nice to know that the second the drives were ready we could start feeding the data onto them.
38 hours after the service went down the restore had completed and we were fully running. This has been our first major failure and the first time we've called on Sun to exercise our service contract. It took much longer than we had thought to get the A1000 up and running, and we didn't expect to have our data wiped. Having sounded like Sun really sicked, we enjoyed working with the engineer and felt he had technical expertise and made good decisions. Most of the time waiting was because hardware wasn't quite right or some instructions the the engineer had weren't quite right (and he always kicked himself for following the Sun recommended path instead of the one he knew).
It was a good thing it was over the weekend when few students needed the system. Even though we spent a lot of time there, it was really the Sun guy who was sweating bullets. It sucked to be at the datacenter all weekend, but then again that means I've got good reason to have not gone in today (still debating about tomorrow) and be taking a nap in a few minutes . . .
Posted by mike at March 31, 2003 1:50 PM