Lessons Learned from the Telstra Network Outages
Would your business survive a highly-publicized service outage?
How about an outage every couple of months... or weeks?
Telstra, Australia’s largest communications provider, has had at least five outages in the past six months. Thousands of customers have lost phone and Internet access for anywhere from hours to days at a time. To make amends, Telstra has provided several “free data days” to its customers. Telstra’s CEO has also pledged $250 million over the next six to 12 months to improve Telstra’s mobile, ADSL, and core networks, including installing real-time monitoring solutions.
In the wake of Telstra’s unfortunate network incidents, let’s talk about what happened, what went wrong, and how your business can avoid such a devastating, costly situation.
What caused the Telstra network outages?
A mix of hardware and human issues has contributed to Telstra’s troubles. Here’s a quick overview of the issues over the past six months:
The first major outage occurred February 8. Telstra blamed an “embarrassing human error,” explaining that when a node malfunctioned, the company took it down to fix it, but an incorrect procedure resulted in a loss of service for customers.
On March 17 about eight million customers (half of Telstra’s 16.9 million customer base) lost the ability to make calls, text, and access the Internet for about four hours. The outage was reportedly caused by a failed international cable that was affecting operations.
The following day, a software failure caused smaller service outages for Telstra customers in Victoria.
Over 370,000 Telstra customers experienced a major outage that began May 20. A week later, thousands of customers still didn’t have Internet service. What’s worse, the company announced the issue had been fixed twice before it actually was. Chief Operations Officer Kate McKenzie attributed the outage to “a software update to one of our domain name servers” that “caused the server to go down.”
On June 11, 75,000 broadband customers across Australia were affected by an outage, provoking outrage from customers across social media platforms. Then on June 30, another outage affected enterprise customers in Victoria, including banks, retailers, airlines, schools, and hospitals, for up to six hours. The outage was attributed to a “device behaving in a way that wasn’t expected.”
How have these outages impacted Telstra?
Having to provide free data to customers, and spending millions of dollars on infrastructure improvements, have cost Telstra plenty financially. But the embarrassment of repeated outages will have long-lasting repercussions, costing Telstra more than they ever could’ve imagined.
Business owners have said free data is no compensation for the business they have lost because of the Telstra outages. Customers have called Telstra’s service quality “third-world.” And hordes of fuming customers have taken to social media with the hashtag #TelstraOutage.
So what went wrong?
It seems to me that by not taking appropriate precautions, Telstra underestimated their capabilities and those of their end users.
A number of factors, both in and out of our control, can attribute to outages. For that reason, some downtime is inevitable. What’s always within our control is our response. What makes downtime embarrassing is not having a process or solution in place when it does happen. Part of IT operations is to ensure that you have all the necessary processes and technology in place to mitigate risks.
As we’ve said before when writing about the disastrous impact of IT outages, a company this large, with such a large customer base, should have had active and passive systems in place. An update of domain controllers should not bring down an entire network. Having a disaster recovery solution that can be failed over during emergencies could have been a solution in this situation. Another solution would have been to have a test network running in parallel where Telstra’s IT team could test software before deploying to a production environment. They also could’ve implemented a bandwidth stress test on their wireless infrastructure to give them an idea of how their infrastructure was holding up.
Customers are more receptive to planned downtime than unplanned. Part of Telstra’s IT forecasting should have been to identify needs and upgrades and announce to the customer base that improvements for better services were needed… before they were hit with a major outage. Most companies would rather not devote a big budget to IT expenditures as a precautionary measure. But we’ve seen the result of what can happen when you don’t.
Time will tell as far as Telstra’s future in the telecommunications space. For now, you can learn from their mistakes and take action to prevent the same thing from happening to you. Read this article to find out how network monitoring software can help you prevent costly outages.