It’s 3am, and you get woken up by a flood of calls from angry customers. You stumble over to the computer to see what’s going on, and sure enough, you can’t reach your server. You send your hosting company a trouble ticket, thinking the server must have crashed again, a nuisance, but no big deal, you’ll get back to bed once they reboot it. Waiting for what feels like an eternity, you finally get back the terse reply “I’m sorry, but that server was cancelled by mistake. None of the data can be recovered”. Your day just went from bad to worse.
As someone whose business revolves around their hosting, the above scenario may be one of the nightmares that keeps you up at night. We all understand that mistakes happen, but what can we do to protect ourselves from situations like this? After hearing about stories like this playing out over and over again at other datacenters, I asked myself exactly that: how can we prevent this from happening to our customers? That is the subject of today’s Hosting Best Practices Part 1: Server cancellations.
Mistakes like the ones outlined above stem from two causes: Uncertainty, and high consequences for failure. Whenever you combine those two things, disaster will happen eventually. Since cancellations are a critical time that can cause disaster for any number of reasons, we set out to reduce both uncertainty, and the consequences of making mistakes, by using the following cancellation policy:
1) If a server is set to be cancelled, on the cancellation day, you power off the server, you do not wipe the server.
2) You then schedule to have the server wiped 3 days later.
3) If a server is scheduled to be wiped, you first verify that the server had already been powered off, before wiping the hard drives.
It’s a deceptively simple policy that dramatically reduces the chances of anything terrible happening. First you eliminate the uncertainty: if the server you want to wipe is not already powered off, you’re working on the wrong server. Second, you eliminate the consequences of mistakes: if you power off the wrong server, you can recover from that mistake. The above might seem like a simple thing. In fact, the policy is quite simple. Any hosting provider could easily start doing the above. Based on the feedback and horror stories that we hear, it doesn’t seem like best practices like these are put into much use in the hosting industry, which is a shame.
This policy works regardless of the reason for the mistake. Maybe you accidentally asked for the wrong server to be cancelled. Maybe someone hacked into your support account and asked for your servers to be cancelled. Maybe your server has gone past due and the overdue notices had gone into your spambox by mistake. Regardless of the reason, a simple policy like this provides an important safety check: so long as you notice your server being down within 3 days, it is unlikely your server will be wiped by mistake.
Imagine what would have happened in our opening scenario if your hosting company had this policy? You still would have been rudely awakened (sorry), but you would have been informed that your server had been powered off by mistake, and that now your server has been powered back on again. A minor inconvenience, but life would have gone on.
It would have been easy enough to tell the customer that “mistakes happen and you should keep backups to protect against them”. It would be equally easy to just assume that “we have great staff, they won’t do things like that”, or just throw up your hands and say “we’re doing the best we can, there’s no way to eliminate mistakes 100%”. I see all of these excuses being made all the time on webhostingtalk where hosting providers defend other hosting providers for making mistakes like in our opening paragraph. To a customer whose data has been lost through no fault of their own, that’s hardly any comfort. We have to do better.
These are the kinds of problems keep me up at night. What would I do if something like this happened to my customer? How can I make sure that never happens? Putting ourselves in our customers shoes, and caring what happens to them, is really all that’s required to come up with solutions like this. At I/O Flood, we couldn’t imagine doing business any other way.