2016 has been a landmark year when it comes to IT service disruptions and the stark implications these have for business operations. Whilst everybody is aware of the growing reliance organisations are placing on IT services, with each incremental modernisation, the potential for IT service disruption, and the associated costs, increases.
But it’s not just the potential risk that’s increasing, the actual number of unplanned disruptions, their duration and the costs they incur have all risen year-on-year according to the 2016 Veeam Availability Report. The report summarises:
- The average number of unplanned downtime events reported has increased from 13 events in 2014 to 15 events in 2015.
- The average length of each downtime event has also increased from less than 1.5 hours to almost 2 hours for mission-critical applications.
- And as a result of these increases, the average annual cost of downtime to organisations can be up to US$16 million, which is US$6 million higher than that recorded in Veeam’s 2014 study.
These findings suggest a major ongoing trend, but what is the real world impact these disruptions are having on organisations and their users? To find out, we take a look at some of 2016’s major IT service disruptions in APAC and examine the reasons behind them:
19 January 2016: Twitter Users in Asia Hit By Worldwide Disruption
Twitter experienced one of its most extensive global disruptions, preventing many of its 300 million users from staying connected or from logging on to the social network. The problem began at 8:20am GMT, with error messages warning the network is both “over capacity” and suffering an “internal error”. The disruption continued until 12:10pm GMT.
Cause: Technical problems in recent code change.
Downtime: 4 hours.
Significance: Impacted Twitter’s effectiveness as a marketing platform for users and companies.
5 June 2016: Amazon Web Services Sydney Power Outage Affects Major Companies
Amazon Web Services in Sydney experienced an extended power outage, causing many websites and applications to go down, affecting major local companies including Dominos, Channel Nine, Foxtel Play, and Domain. Even though 80 percent of the impacted customer instances and volumes were back online and operational within a few hours, a bug in its instance management software meant instances that were not recovered by that evening experienced slower than expected restoration times.
Cause: Power problems and a latent bug in instance management software.
Downtime: 2 hours.
Significance: Caused nationwide outage of websites and online services including F&B, banking, news and streaming services.
6 June 2016: Nine Polyclinics in Singapore Suffer IT Disruption
Patients to nine polyclinics in Singapore were turned away or had their appointments rescheduled on the morning of 6 June 2016. This was caused by a disruption to the IT systems, which prevented medical staff from accessing patient records. The IT system was unavailable between 8:20am and 9:20am, following which the system was running again and the clinics could return to its usual operations.
Speaking on the day of the disruption, Dr Christopher Chong, Head of the Ang Mo Kio Polyclinic explained to local media, The Straits Times, that the clinic was “caught off-guard” as it only started operations at 8am. “There were some backup IT systems in place but they were slower; hence we resorted to manual means to register patients.” In dealing with the backlog following the disruption, patients in pain or with serious conditions were prioritised, whilst others had their appointments rescheduled.
Cause: Unreported cause, combined with slow recovery system.
Downtime: 1 hour.
Significance: Patients were turned away and appointments were rescheduled.
14 July 2016: Singapore Exchange Suffers It’s Longest Trading Outage
Trading of securities on the Singapore Exchange (SGX) was shut early on 14 July after a suspension was imposed at 11.38am due to a hardware issue. Failing to follow through on two pledges to reopen, the market remained closed for the rest of the day and resumed trading the following morning. Following the outage, Singapore Exchange CEO, Loh Boon Chye stated: “Our recovery time has to be better and we must minimise downtime for market participants”.
Cause: An unreported hardware issue.
Downtime: Over 5 hours.
Significance: Disrupted trading to Southeast Asia’s largest stock market.
9 August 2016: Australia’s eCensus Hit by Major DDoS Attack
Thousands of Australians were prevented from taking part in the census due to a major DDoS attack, which led to a hardware failure, the overload of a router and a false alarm about the attack.
Cause: Lack of DDoS protection.
Downtime: 43 hours.
Significance: The collapse of the website caused embarrassment for the Australian government and caused outcries from citizens and the opposition.
22 and 24 October 2016: StarHub’s Broadband Service Hit by DDoS Attacks
Following the massive internet disruption in the U.S. on 21 October, Singapore telco StarHub reported two DDoS attacks which forced some of its home broadband users offline. StarHub reported that the attacks were “unprecedented in scale, nature and complexity”. It went on to add: “On both occasions, we mitigated the attacks by filtering unwanted traffic and increasing our DNS capacity, and restored service within two hours. No impact was observed on the rest of our services, and the security of our customers’ information was not compromised.” This was the first cyberattack of nature the affect Singapore’s telco infrastructure.
Cause: DDoS attacks from malware-infected routers and webcams.
Downtime: Two hours.
Significance: Temporary downtime and brand damage.
While these incidents represent just a handful of examples throughout 2016, it’s worth asking whether there’s an underlying factor leading to these incidents.
The first point to mention is (as we all know), due to human nature and the ongoing battle between cyber-attacks and defences, nothing is 100 percent secure or reliable. So whether an IT disruption is caused by a malicious attack, a power outage or a software failure, it’s not just a matter of trying to avoid it in the first place, but how quickly you can recover from it. In each of the cases above, the organisations were able to recover following the outage, but not fast enough to avoid backlash and reputation damage due to the downtime.
To learn more about this issue, I met with Peter McKay, President and Chief Operating Officer of Veeam Software. He explained the core issue is that senior business leaders are usually more driven to innovate their products, services and operations as a way to stay competitive. But as with cybersecurity, it’s only when something goes wrong that backup and recovery solutions become front of mind. “Companies are starting to hear more about major outages, downtime and the associated costs, so it’s becoming a bigger issue.”
Image: Peter McKay, President, Chief Operating Officer and a Board of Director, Veeam Software
McKay argues that incorporating robust backup and recovery solutions into digital strategies needs to be front of mind to avoid IT disruption, costs and loss of business. “The technology has changed dramatically over the past three to four years when it was expensive and complex. Now companies like Veeam have come up with a much more scalable cost effect solutions that, if built-in right at the beginning, become much more cost effective.”
But it’s not just a matter of cost, says McKay, it’s also about making back-up and recovery part of integral part of IT planning, in the same way that discussions about security have become critical over recent years. “For people in IT, who have been battling with security and compliance issues, it should not be that hard to have these kinds of discussions internally because it’s very similar; it’s the exposure, the brand and the cost to your business.”
According to McKay, while companies are increasingly becoming aware of these factors, especially for the new applications, a bigger issue exists around legacy software and infrastructure. “It’s more about the older apps that were originally built behind the scenes and now all of a sudden you have millions of people using them, that’s where a lot of the exposure is. However companies are paying less attention to this side of the business because of the cost and complexity involved. As a result, business leaders are more inclined to invest in newer IT services that will drive the businesses, rather than on the legacy systems that underpin it.”
Ultimately, when it comes to avoiding IT service disruptions, it’s not just about an organisation’s cybersecurity defences, but also about its ability to recover and continue operating with minimal downtime. Considerations around back-up and recovery solutions will always be different for each organisation though – depending on the sector they operate in, how heavily regulated it is and how much they rely on IT for their products or services. IT decision makers need to reassess their back-up and recovery capabilities for their existing IT capabilities, and factor in the same considerations when creating new plans. Doing so will not just avoid disruption, costs and loss of business, but also support their organisation’s wider plans to innovate and compete in the modern digital economy.
This article was originally published on CIO Asia.