As a global software company we hear a lot of stories, and not all of them are good. What these stories have in common is that they are based on real events and told by the managers and administrators working in the IBM i environments we seek to support. In many cases, these stories are the very reason a company contacts us—they want to make sure they never have to experience a high availability fail again! We know the professional and financial pain a serious lack of availability can cause an IBM i shop, so to ensure some good insights come from these undesirable situations, we’re recounting, anonymously of course, some of the top high availability fails from around the world. After all, one IT manager’s horror story is another’s cautionary tale.
Lost in Translation
The first story begins with a simple question: What does high availability mean to your organization? More precisely, what does it mean to the different groups of people who encounter it? This tale may sound familiar to many of you readers. Picture this: the IT manager in the data center insists that the systems are available, but the help desk is on the phone to angry users who insist that it isn’t. Who’s right and who’s wrong—and does it even matter? If the systems are available, strictly speaking, but a group of users are locked out of a vital application because a TCP port has stopped listening—the systems may as well be down for that group of users. Their direct experience is that it isn’t highly available to them.
The lesson here is that a common, universal definition of highly available systems can never be assumed. Defining availability, whether it is high, reduced, or completely lacking, is most meaningful in the context of those it affects. Managed service providers and those companies who service internal clients by way of service-level agreements know the value of pinpointing such definitions so there is no margin for ambiguity or dispute. Whether the lack of availability stems from the previously mentioned inactive TCP port, an overtaxed system that's processing at a sub-SLA limit or indeed, or some other reason, if fallout impacts productivity then this alone is what will determine accountability for the issue.
Know Your Enemy
The battle for high availability is a strategic one. Part of that strategy must include leveraging your knowledge of the system to safeguard against potential threats. Identifying the cause of historical instances of a lack of availability is a good starting point. Analyzing these past instances can also determine if any routine processes are risk factors for future instances of compromised availability, however severe.
Beyond past experiences acting as an indicator to future instances, administrators should pay particular attention to key areas where threats to availability most commonly manifest. These include applications, objects, jobs, and communications elements. Where problems occur, evidence can be found in performance or status changes that go beyond what is expected in the normal course of processing. How soon these can be identified will have a direct impact on the user base. Unfortunately for those operating a manual or reactive systems environment, without the benefit of real-time monitoring, the fallout on the user base might be the first and only means of knowing of an issue. In other instances, the consequences can be worse still.
One such story that reached us from the real world illustrates this point. A large retail chain of stores was busy coping with their annual peak period of processing during the run up to Christmas. During this time, a remote journal became inactive, but this was not detected immediately and the performance of their system began to degrade until it ultimately failed. The store’s high availability (HA) system switched over to a new primary machine—just as it should in these instances—but large amounts of transactions were lost to users in the switch, causing chaos for their daily accounting totals and re-stocking requirements, a breach in compliance, and a break in their audit trail, which meant the company incurred penalties as a result. Visibility into that one status change could have spared them the extensive impact that followed.
Too Little, Too Late
Knowing about a potential threat to your availability is one thing, but without sufficient time to react, the value of such information is greatly diminished. As with our retailer story above, the critical journal receiver status was not detected immediately. Real-time identification also requires unimpeded escalation paths to ensure that no matter what time or day a potential threat occurs, that important message is fast-tracked to someone who will resolve it.
Automation becomes one of the most powerful aids to achieving this objective and will ensure that critical threats to availability have an opportunity to be resolved as soon as possible. Having the right tools to resolve the issue as quickly as possible represents the other factor of timeliness that can account for the make or break of availability issues.
If a subsystem became inactive, the administrator who had the advantage of immediate insight into which job within that subsystem caused the problem would be able to avert a potential crisis. Faced with the same scenario but no such means of knowing which job was responsible, an administrator would be forced to conduct a lengthy and exhaustive investigation of each job.
For everyone who has experienced their own availability horror story—or for those just wishing to avoid them in the future—Robot Monitor performance and application monitoring software can help you take a proactive approach to system and application availability.