What Happened in the Delta and Southwest Data Centers?
copy before image
Domestic and international travel was recently interrupted when the computer systems went down at two major airlines—and frequent flyers felt the impact. Now, the average traveler will probably never know the exact details of these major outages, but word on the street—and in the sky—says that there may have been a production failure followed by a failover switch to the backup box that didn’t go according to plan.
Let’s take a look at these incidents through the lens of high availability, where downtime reduction and disaster recovery really come into play.
rest of copy
Is Zero Downtime Realistic?
Being major players in the commercial airline space, it’s fair to assume that both Delta and Southwest planned their HA strategy around zero downtime, which is certainly realistic for planned maintenance, upgrades, and role swaps.
On IBM i, data is not always written directly to disk, so you have the problem of data in transition that may not be reflected by your HA solution. In a planned role swap, you can force the data to the target server and achieve zero downtime without impacting production.
However, if the role swap needs to occur in an emergency—like the unplanned outage that Delta and Southwest experienced—what happens to that data? Did it arrive at the target server? Did the target server just switch over and start allowing users to connect to the new production server (target) without you verifying that it was good to go?
If they use hardware-based replication, Delta and Southwest would probably have had to rebuild the data on the target server following an unplanned role swap. How long does that take? Well, IBM and EMC could tell you that there’s no such thing as zero downtime for one of these bad boys.
With software-based replication, Delta and Southwest would have experienced downtime as they switched to the target system. They may have needed an exit program to kick in and change system values to make the target the now production server. There would be subsystems and applications to start, but even all of this would only take a few minutes. In a matter of five to 30 minutes, the target system would be the new production server.
Did Delta or Southwest Test?
Another aspect of software-based replication is that it is testable and repeatable, which begs the question: Did Delta or Southwest test their role swaps? If not, why not?
I’m confident that the business leaders were sold on the concept of having no downtime. With an immediate switchover to backup servers, end users and customers would only see a short delay during this process. Obviously, that’s not what happened.
I have a feeling that, for these enormous airlines, the prospect of testing the role swaps was so complicated and daunting that they never had a successful switchover—ever.
With so many travelers caught up in the jet stream, these are high-visibility businesses that can’t afford any downtime. So, achieving zero downtime during role swaps would be the desired outcome of any failure, but if you’ve never achieved the role swap, what good is it?
When Should You Role Swap?
Luckily, there are some hard-won lessons you can take away from this terrible IT catastrophe by thinking about how your business would fare in the same situation.
You should test your role swap at least once a year. If you’re in a region that is high-risk for natural disasters, like Tornado Alley in the United States, you should time your test a month or two before the normal season. If you are financial, gaming, ecommerce, or any high-price-to-be-down industry, you should test twice a year.
And here’s the big thing: if your role swap doesn’t work as expected, you need to fix it…or be honest with your team that the technology you’re using is so complicated you cannot test. If it’s the latter, you better find a better solution, and fast!
Remember, you want to minimize downtime but also give your IT team a realistic opportunity to be successful. High availability without testing is not HA, it’s just an expensive IT project that you crossed off the list but are afraid to use because no one has faith in the process. It could be an embarrassing and career-ending situation should you ever need to use it.
15 Minutes or Two Days?
What would Delta or Southwest choose today: the untested, zero-downtime ideal or a 15-minute, tested role swap? After these incidents—which no doubt resulted in billions of dollars of lost revenue and damage to the reputation—I believe they wouldn’t be happy about the 15 minutes, but they would take it if they knew it worked. 15 minutes of downtime is like a light snow storm while a two-day outage would be worse than any major weather event we’ve experienced to date.
So, we continue to strive for zero downtime, but let’s also manage expectations by understanding that zero downtime might not be a reality.
Implementing—and testing, we hope—a high availability solution is a top concern for IBM i shops in the near future. If you need help choosing, we're here for you.