High Availability and Your Enterprise Job Streams

August 16, 2016

The concept of high availability has been on the minds of IT decision makers in the last several years. As the recent IBM Top Concerns survey revealed, comprehensive high availability is the second highest concern among IBM i users, with only keeping current skill sets coming ahead. It's not surprising that high availability has become a common expectation in recent years. One reason why: Evolving technology has made highly resilient architecture affordable for many businesses rather than a select few. Even in cases where high availability solutions represent a significant expenditure, organizations are doing a slightly different calculus. More businesses realize that lacking resiliency can itself be expensive, particularly in the case of systems that are responsible for managing a large amount of the enterprise's workload.

What is High Availability?

High availability (HA) refers to critical IT infrastructure systems or components that are accessible virtually all of the time. This typically involves some sort of redundancy such as backup or failover. The industry standard for an HA system is referred to as the "five 9s" rule, which means it is 99.999 percent available. IBM i 7.1 comes with a few of its own features to build resiliency:

  • Advanced node failure detection
  • Asynchronous delivery mode for geographic mirroring
  • Logical unit-level switching

These advantages are designed to overcome some of the common challenges associated with downtime. For example, node failure detection uses the Hardware Management Console to detect when a node fails. These features automate as much of the recovery process as possible and get operations up again quickly.

If companies had unlimited resources, every system would be highly resilient and have a high-performance backup solution. However, decision makers live in a world of limited budgets and other resource constraints, making risk prioritization a key part of maintaining any system. It is easy to overlook features like resiliency when implementing workload automation solutions, but that can be a costly oversight.

Include Your Scheduler in HA Plans

There are many roads to high availability, but they all ultimately lead to one thing: minimizing the risk of downtime. When an internal system goes down, the outage translates to lost employee productivity. This is one of the primary expenses identified by IDC's Reducing Downtime and Business Loss: Addressing Business Risk with Effective Technology, which estimated that, for midsize companies, downtime costs an average of $70,000 per hour.

It may be tempting to think that reputational damage only happens when a consumer-facing system goes down, but a corporate-facing outage can still make its way to the customer. When work lags due to an outage, customers want to know why. As IDC pointed out, this is a potential risk even for internal systems because consumers who hear of a company's technological problems may lose confidence in the business. 

The risk may not be significant for non-essential assets, but a job scheduling system presents a unique challenge in that one of its primary benefits is also why high availability is essential. Resiliency is often built by eliminating the reliance on a single point of failure, but workload automation packages streamline operations by centralizing job management.

Facing limited budgets, many IT departments have turned to workload automation as a way to gain visibility over their networks while streamlining many processes. These solutions offer a great deal of efficiency, but they can also present a risk if they aren't resilient. If the automation architecture goes down and there is no backup, jobs can't be scheduled and users lose access to work histories. For systems like workload automation, an outage can cripple operations, which makes high availability a crucial feature to consider for job stream software.

Redundancy and Simplicity

The Ponemon Institute's 2011 report Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact of Infrastructure Vulnerability confirmed the significant consequences of downtime. Researchers estimated that a complete data center outage costs approximately $5,600 per minute and length of downtime averaged 90 minutes among data centers surveyed. However, even a partial outage of IT systems could cost as much as $258,000 per incident. In addition, respondents reported that a partial outage affecting a single server occurred the most frequently, with an average of more than 11 incidents happening over a two-year period.

Most businesses have some kind of backup, but the difference between these and an HA solution is time to recovery. The issue may be worsened if the backup solution is complicated. For example, if IT has to take the time to replace the hardware, reinstall all the software, and restore the backup of the job scheduler, downtime could last more than a day rather than a couple hours.

A true high availability solution features redundancy. This means that there is a standby system ready to take the place of the primary as soon as there is a failure. The recovery process should also be as automated as it reduces the risk for configuration and documentation errors. Features such as continuous data replication allow the backup job scheduler to maintain almost the exact same history as the primary system. This ultimately means less of a headache for IT administrators, who can simply send a command to switch to the backup when the system notifies them of a failure.

Make HA a Priority

Job schedule automation can streamline many of the manual processes that can bog down the IT administrator's workday, but it is important to ensure that the system works and will continue working efficiently. High availability should be a prioritized feature on any critical system, particularly when it represents a single point of failure and gathers input from numerous other systems throughout the organization.

