Failing Safe – Drive Wipe Policy

Some of you may recall recently we posted a blog post about using checklists to help avoid workplace complacency and customer inconvenience.  If you missed it, you can go back and read it here:

No matter how comprehensive your checklists may be, there will always be weak points that can cause confusion, concern, or complications that can ultimately lead to serious issues for the end user, up to and including data loss. To prevent this, we embrace the idea of “failing safe”. In today’s blog post, we detail how we apply the idea of “failing safe” to situations where a customer might normally lose their data.

The hosting industry can have more layers than your average onion, so it is not unusual for communication between an end user and the bare metal host can be delayed or incorrectly communicated. A common time where miscommunication can cause data loss is during the cancellation process. Asking to cancel the wrong server, or a bill being paid for the wrong server, are two common problems that risk customer’s data being lost.

With that in mind, here at IOFlood we decided to implement an additional step to help mitigate data loss at a server cancellation level, which is what we refer to as a ‘Three Day Wipe Hold’.

To add an extra layer of protection, we have separated the processes of service suspension (shutting off a server) and service termination (wiping data).  Once a service has been cancelled and the server is powered off, we leave the server shut off for a minimum of three days before proceeding with wiping the drives.  This gives time for a customer to notice if a server has become unavailable that they expected to remain online.

When it does come time to wipe the server, we also verify that the server is still powered off. Powering off a server is something we do prior to wiping it, so checking that it still is, helps us avoid two problems. First, it helps us avoid wiping the wrong server! If the server is powered on, it is possible the server we are logging into, is not the one we originally suspended. The second thing this does, is we know for sure the customer has had some notice before their data is wiped. It sometimes occurs that a server will “power itself back on”, and so, making sure the server is actually off before wiping it, gives confidence that the customer has had at least 3 days to notice the server is off before the data is wiped.

This is important because it allows us to catch any problems in the cancellation process before any data is deleted. Although that may not be necessary, we do whatever we can to ensure that our processes try to avoid even small risks when customer data is involved. All of these sanity checks mean that no single mistake is enough to cause data loss. Multiple things would need to go wrong simultaneously to cause data loss during the cancellation process.

Sometimes, a drive has to be removed from a server because of an upgrade, drive failure, or server cancellation. In these cases, we take a similar level of care and attention to prevent data loss. In the instance that a drive is removed from a server, the drives are then independently labelled with the server ID, drive serial number, date of drive removal, and reason for removal. The drives are then set aside for a minimum of three days before being wiped. 

This way, if the drive is needed again for any reason, be it a mistake or miscommunication, or for some technical reason, there is an opportunity to notice the problem and identify the removed drive, before the data become irretrievable.

While it may be a small step, and it may not be necessary in the majority of instances, in those moments that a customer has come back after cancellation to restore the service or because they’ve realised they’re missing something, the small delay has proved to be more than worthwhile.

The above process we have for wiping drives is part of a philosophy at IOFlood: Mistakes are inevitable; Consequences are not. For any process where a mistake can cause a catastrophic result, we take extra efforts to ensure that, not only are mistakes less likely, but also, make sure that several mistakes would need to occur simultaneously before anything bad might happen. This ensures that our processes “fail safe” — a mistake at one part of the process can be noticed and corrected, rather than causing a catastrophic outcome. In this example, the consequences of powering off the wrong server, is much milder than the consequences of wiping the wrong hard drive.