Blog : Prevent and survive – why time matters in network outages

2016 saw high-profile network outages hit even the largest organisations. Delta Airlines, for example, experienced major system outages across the business following a power failure at its main datacentre in Atlanta. This resulted in 740 flights being cancelled and thousands more delayed. As storms hit Sydney, Australia in June, Amazon Web Services in the region were downed for around 10 hours, disrupting a range of services from banking to pizza deliveries.

Outages like this aren’t cheap, with costs ranging from lost sales and employee hours, to communication and PR outlay, to the technical costs involved in isolating and repairing the problem. Analysts including Gartner and the Ponemon Institute have come up with a range of estimates for the average costs of application and datacentre downtime and even the lowest estimate comes in at $100k lost every hour that a critical system is down.

Unfortunately, in today’s dynamic IT landscape, it’s impossible to fully prevent network outages from happening – for a number of reasons. First, humans are prone to making mistakes; one unintentional mistake, such as a simple comma in the wrong place of a network configuration, could take you offline. Second, hardware devices such Firewalls, switches and routers will always fail eventually, and even where you have invested in redundancy it doesn’t mean that your device will simply failover. Third, external breaches by cybercriminals or DDoS attacks are ever more sophisticated, with network devices being targeted and weaknesses in configurations being exploited.

It’s a matter of time

With the cost of the average network outage ultimately being time-sensitive, it’s vital that organisations take steps to ensure that not only is their risk of suffering an outage as low as possible but also that they are fully prepared and able to quickly restore the network when an outage does occur. But what does network outage prevention mean in practice?

There are two key components. First, you need to take control of change management practices, ensuring you backup your network devices not only before planned changes take place, but on a regular schedule to ensure that you can be alerted to unplanned or even suspicious changes. Understanding what’s changed is crucial to restoring normal operations, it’s no good having a configuration backup if it’s not current.

Second, centralisation is key. If individual network engineers take their own backups and store them on local PC’s, memory sticks or a file share you don’t know about, the speed at which you can recover is not only slowed down, but you’re also at greater risk of configurations falling into the wrong hands, as network configurations often contain passwords and other network information that can help them breach your network. For many organisations, having network configuration backup change management, and compliance auditing processes in place isn’t just good practice – it’s essential for them to comply with regulatory frameworks like PCI DSS, ISO27001, HIPAA and NIST.

However, the challenge that most businesses face is that these processes can be very time-consuming and complicated to understand. In the architectures that most organisations deploy, the IT focus regarding backup has historically been on preserving and recovering data and applications, rather than on the network and security devices that makes access to data and applications possible. IT managers now have to somehow consolidate the backups from servers and storage with a wide range of network and security devices that deliver applications and data to users. Little surprise, then, that in a Restorepoint survey, we found that 82% of organisations found managing these activities to take too long and to be overly complicated, meaning that network device backup, recovery testing, and configuration analysis happens only periodically.

As a result businesses often find themselves in a no-win situation when a network outage occurs. Whilst network engineers do their best to restore service, often with outdated configurations requiring them to implement recent changes on the fly, the clock is ticking, productivity is being lost and outage costs are rising.

Bringing speed to the recovery process

No matter how many skilled engineers you have, the time it takes for your organisation to recover from the failure of a network device or a bad change will depend on whether you had planned for the outage and tested your recovery. This is because every network consists of dozens of network devices from different vendors (e.g. Cisco, Juniper, Check Point, F5, and Palo Alto), a reliable backup and recovery method is needed. However, organisations don’t have expensive spares of every router, switch, or firewall sitting waiting on a shelf, which means they are depend on the network engineers and backup processes they have.

This is likely to be a major weakness in most organisations DR plans, because the network administrators typically don’t have the time to ensure every device is backed-up every day. There are multiple vendor methods needed, and whether a backup is performed manually or is semi-automated using scripts, this a fragmented and often-untested process.

As such having a fully tested vendor neutral backup, recovery and compliance solution is crucial to not only reducing outage times to seconds, but ensures a common compliance approach and reporting mechanism for all network, security and storage devices. Ultimately the business continuity that automation enables in the event of a network outage, coupled with the critical time it saves, ensures that any error or exploited weakness are just that – an error or correctable weakness rather than a lengthy impediment to business as usual IT.

Prevent and survive – why time matters in network outages

April 20, 2018

Stay in touch