Last week I got in to work on Tuesday morning after the long weekend, to find one of our servers was down. We have a half-rack co-lo in the Primus data centre in Melbourne, and since the server wasn't responding to anything, and wasn't coming back up when the NOC guys rebooted it, I had to go down to check it. This server was a fairly major part of our business, running a database which powers our company website, as well as a legacy application supporting a number of clients.
So I went down to the DC and found that the hard drive running the OS had failed. Now I've only had limited experience with this particular server and how it's set up, but unfortunately the guy who knows about it is skiing in Canada. So I poked around and found that the OS drive has no backup, but the database was on a RAID 0+1 setup which was still working fine. So not an ideal situation, but not the end of the world. I manage to get the guy in Canada on the phone, and we work out a solution - copy the data files onto a backup server, reconfigure the website/application and then we'll worry about fixing the server. As it turns out, the backup server only boots about once in every ten attempts, so any reboot required me to go down to the DC.
If you're not from Melbourne you may not know that we had some pretty extreme weather last week, peaking at 45°C on Friday. So not the best weather to be having to trudge back and forth between the office and the DC several times each day. Normally I'd have my bike to make the trip quicker, but it was being serviced so I had to walk. Boo hoo.
So after a few hiccups and a lot of downtime we managed to get the backup server running. By the end of the week everything seemed to be running as expected, and we were looking at getting the original server back to working order. If only it were that simple. On Sunday, the DC had a complete power failure. This is a major data centre, supplying routing for a number of ISPs in Victoria and Tasmania, meaning that when the lights went out, so did a large percentage of South-Eastern Australia's internet connections (including my home connection). I haven't yet heard an official response as to the cause, although I'm sure they'll attribute it to the weather conditions, which had already caused blackouts across the state (although the weather had eased considerably by Sunday).
Power was restored within a few hours, but of course our server didn't boot up, so this morning I went down and got it going again. Since the database hadn't been shut down cleanly, we needed to fix up the uncommitted transactions, which meant more calls to Canada and more downtime. In the meantime we had organised a new hard drive for the original server, so later today I went back to start setting it up. The plan was to install a fresh OS image, set up the database and support software, then reconnect the database. I popped the new hard drive in the server, and wouldn't you know it, it told me the new drive was failing. This was a brand new SCSI drive, and I was a little sceptical. I moved the drive to another bay and was able to install the OS without problems. Aha, so maybe the problem wasn't the drive to begin with, maybe it was the drive bay. I popped the original drive back into the new bay, and of course, it boots up without problem.
So it seems if I'd thought of moving the drive to another bay to start with, the whole problem could have been sorted out last Tuesday, saving a whole lot of hassle. Now the plan is to monitor the original server for a day or so to make sure it's stable, then transfer the database files back from the backup server. Of course, that assumes that nothing else goes wrong. Ha.