Air gaps and backups
WIRED recently published an article about the NotPetya cyberattack. In addition to being an interesting story in its own right, the article is a handy reminder of the importance of “air gaps” in designing a backup system to survive disaster. An air gap is a deliberate break in the network connectivity of a complex system – something that requires manual intervention to bridge. In a world of hyper-connected systems and unpredictably subverted software, such low-tech barriers are crucial.
As the article recounts, part of the disaster for shipping giant A.P. Møller-Maersk was the loss of their Microsoft domain controllers (DCs). Those elements were effectively the directory of everyone and everything – at least in an IT sense – at the company. They were known to be a critical component, so you might expect that there would be a defense strategy aligned with their importance.
The DCs were set up to sync their data with each other. For a small-scale problem, that meant that the DCs operated as each other’s backups. If the data of one DC was lost, no problem – it could get the data from another nearby DC after the failed DC recovered. But although that approach worked well most of the time, it couldn’t handle a near-simultaneous failure of all the DCs. The very aggressive and fast-spreading NotPetya attack meant that all of the DCs on the network were equally messed up.
If Maersk had used a conventional backup strategy, they would have been OK. In such an approach they would have periodically copied the contents of each DC onto a magnetic tape, which would then have been sent to some sort of a warehouse for safekeeping. In this sort of situation, any one of those tapes would have saved the day. It didn’t even need to be especially current – they just needed something that was better than starting from nothing at all. But there was no such backup at Maersk. Was this an instance of IT negligence?
You can argue it both ways. The conventional backup strategy has been losing favor for a long time. It involves both an investment in capital equipment (the magnetic tape writing and reading) near the servers being backed up, as well as ongoing operating costs for transporting and storing tapes. In addition to those expenses, the track record for tape backup is actually not very good in terms of reliability. All too often, there are breakdowns in the process of recording, transporting, and storing the tape that are not discovered until it’s too late – when people are working to restore data from the backup and find that it’s empty or corrupted. So it’s not surprising that it would be appealing for Maersk to avoid those expenses while also getting a backup process that would be more robust. It must have seemed like a win-win: lower costs and better performance.
That said, backups and disaster recovery do need to include an air gap somewhere. The problem with a fully-connected system, as NotPetya vividly demonstrated, is that any computer and any network can be subverted. In the case of Maersk, they didn’t adequately ensure that there was an air gap somewhere in the system when they started using the DCs as backups for each other. That oversight was nearly disastrous.
However, Maersk got lucky: it turned out that there was an accidental air gap in the system, caused by communication failures. A DC in Ghana had been knocked offline by a power outage, and had remained out of contact with its peers ever since. Since that DC was isolated from the attack, it served as the (accidentally) clean backup that Maersk needed to restore its DCs.
What lessons should we draw from this story? To protect yourself against unpredictable and aggressive subversion of software systems, you need to depend on physics. You literally can’t depend on software – any software! – in the worst-case scenario. You also can’t depend on anything that’s controlled by software or produced by using software. (You can read about Thompson’s Hack in chapter 22 of Bits to Bitcoin if you’d like to understand more about untrustworthy software). So what can you trust? You can be pretty sure that an unplugged appliance won’t suddenly start working: the electricity won’t leap from the wall socket to the plug if there’s a physical separation. Similarly, you can be pretty confident that messages won’t spontaneously cross from one machine to another unless some kind of network connects them.
To have a useful air gap, we have to construct a situation where the automated and connected operational system is literally unable to reach the backup to corrupt or destroy it. That kind of isolation requires some thought.
What are some possibilities?
- Have the backup recorded in a format that is not rewritable, like a CD-ROM. (Note that even when the data is physically protected from modification, we still have to be concerned about software-controlled processes that might cause the destruction or discarding of relevant media).
- Have the backup rewritable but not actually accessible to any machine, like a magnetic tape that is sitting on a shelf far away from any computer.
- Have the backup rewritable on a running computer, but isolate that computer from all networks.
Notice that in all of these cases, there is no way for software attacking the running system to change data in the backup. Also note that there must be a physical separation, not just a logical one or one enforced by software. It’s important not to make the mistake of having a computer connected to a network that is only isolated by software rules. If you can’t ensure a physical break between the backup data and the rest of the world, then you can’t trust the air gap!
For example, although virtual private networks (VPNs) and routing are common ways of creating logically separate networks, you can’t trust those techniques to maintain an air gap. Likewise, since you can’t visually inspect a wireless network to ensure an absence of connections, you can’t really use wireless networking at all on the “safe” side of the air gap. Even if you put filters or rules in place on the wireless device, those are all software and accordingly subvertible.
One subtlety is the possibility of an air gap that requires manual intervention, but where the manual intervention is prompted by software. For example, there might be a tape clerk who is responsible for fetching a tape from storage and inserting it into a tape drive. (Aside: I was a tape clerk during the summer after my freshman year of college). If the tape clerk always does exactly what is requested, this is not really an air gap – it’s just a “meatware” implementation of a link in a software-controlled system. It’s still relatively easy for an attack to simply request all of the tapes and wipe them all out. Instead, there must be some element of the backup/restore process that is based on human-to-human communication or human judgment, and not readily replaced by rogue computation.
We might be able to see how this kind of heavy-duty disaster planning applies to a global corporation, but is this relevant to an individual or a smaller organization? I think so. Anyone who has data they don’t want to lose should be keeping backups. Beyond that, anyone who doesn’t want to be exposed to loss from cyberattacks like NotPetya should incorporate an air gap into their backups.
An air gap isn’t hard to implement, but it does necessarily involve an element of inconvenience. My personal solution is to have a pair of small (“pocket”) hard drives, as well as a small storage unit in a facility about a mile away from my house. At any time, at least one of the hard drives is in the storage unit. Each week, I update the backup on the hard drive at my house and take it to the storage unit, to swap with the one that’s there.
I wouldn’t recommend this approach if you were starting from scratch – the hard drives would require a couple of hundred dollars to purchase, and the storage unit has an ongoing rental cost. But I already had a couple of hard drives among the hardware that I had accumulated, as well as a storage unit being rented for other purposes – so the scheme is nearly free for me. My only costs are in the weekly routine of updating the local drive’s backup, and then swapping the drives.
It’s worth underscoring that maintaining an air gap is an area where cloud services probably can’t help you. Even if the cloud service offers you versioning – so that you can go back some number of changes, or some number of days, or even back out of every change you’ve ever made – you are still vulnerable to a software/system failure at the service. Whatever the implementation at the cloud provider, it is possible to subvert it. Because of the scale of cloud providers, they necessarily automate all of their functions – and so it’s essentially impossible for them to provide this kind of software-proof air gap.