Guide

High Availability in Workload Automation

high availability is an important component of enterprise job scheduling software

 

When designing a software system, most of the work focuses on the functional areas of the product like the user interface, system storage, and processes run- ning in the background. The possibility of long-term downtime—due to hardware, operating system, or software failure—often gets neglected.

IT management has been aware of unplanned downtime for many years. IBM estimated that in 1996, $4.54 billion in productivity and revenues were lost due to systems being unavailable.

High availability is a key way to prevent such downtime. The concept is a design and implementation approach that ensures a software system stays available to its users even when there are hardware or software problems.

 

 

Imperfect Backups and the Redundancy Solution

For years, the standard approach was to back up a system at some accepted interval and, if the system failed, to restore the backup. While this approach works well for non-critical systems, it can spell disaster for systems that are integral to an organization’s operations.

For example, if Kay in accounting discovers her PC is fried after a power surge, restoring a backup during the next business day is probably fine. But that schedule would not be “fine” for Amazon.com if their credit card processing server failed at 6:00 p.m. the week before Christmas. Amazon could lose millions of dollars in 30 minutes of downtime.

To prevent this problem, we must refer to the engineering technique known as redundancy. Put simply, redundancy is designing a system with backup components ready to replace the main component should it fail.

Safety-critical systems are frequently designed with multiple layers of redundancy: aircraft have fly-by-wire and hydraulic controls, large ships have twin propellers and motors, and parachutes always have a backup. They, like all redundant systems, are designed to eliminate single points of failure.

In the case of software systems, this generally means installing and running the same software on at least two different computers and sharing the user’s setup and data between them. That way, if either computer fails, the other can be configured to continue operating with minimum human intervention.

 

 

High Availability in Workload Automation

For workload automation software, offering a high availability solution is very important. Having a single point of control for the scheduling and setup of your enterprise workload is a distinct advantage. It allows users to schedule, prioritize, and coordinate work across a multitude of different computer architectures, operating systems, and platforms. They also can submit work to multiple target agents, schedule in a platform-agnostic way, and view the results of the work in a standardized fashion.

However, this architecture has a drawback. By centralizing  control into a single enterprise scheduler, IT runs the risk of creating a single point of failure for all their business processes. If the scheduling server is not operational, jobs will not be submitted to agents, users cannot edit setup or view history in the user interface, and there is no way to control the agents. For this reason, workload automation vendors are starting to include various high availability features in their products. Most of these features are very complex to set up, which limits a user’s ability to properly test them before purchasing.

 

 

High Availability with Automate Schedule

Automate Schedule’s high availability feature (with version 3.0 and higher) introduces the ability to create a redundant Automate Schedule server on another computer. This system, known as the standby server, mirrors the setup and history information of a single master server, and with a simple command can step into the role of the master server.

Recovery after a system failure is significantly easier with this setup. For example, a system admin- istrator backs up his system daily, and it fails overnight due to a faulty disk. When the failure is discovered, the admin must:

  • Replace the disk and reinstall the OS and scheduler software
  • Recover the backup of the enterprise scheduler and apply it to the new hardware
  • Correct any software licensing issues
  • Direct all the agents to connect to the new server
  • Go through the schedule to figure out which jobs haven't run since the last backup

The last step is by far the most complex and error-prone since the administrator has to manually check the job forecast, job history, and systems that run the schedule to see what work has been done.

The last step also requires an understanding of the business side of the schedule. Re-running an inventory job or point-of-sale process that completed successfully could cause confusion across other departments. It may be equally devastating if the jobs don’t run at all.

Contrast this to the recover steps for the master and standby server in Automate Schedule:

  • Standby server detects the absence of master server and notifies admin.
  • Admin activates standby server by running a single command via remote.

The admin's work is complete, but Automate Schedule on the standby server will be busy:

  • All missed jobs are automatically calculated and managed according to rules provided by the admin (which can include simply placing them in a "bucket" for the admin to look at later).
  • Scheduling and submitting of jobs pick up where it left off on the master.
  • All agents automatically switch to the standby server and report the status of jobs in progress during master failure.

Since Automate Schedule uses streaming database replication under the covers, the product setup and history should be very close—if not identical—to the masters at the moment of failure.

The last step is by far the most complex and error-prone since the administrator has to manually check the job forecast, job history, and systems that run the schedule to see what work has been done.

The last step also requires an understanding of the business side of the schedule. Re-running an inventory job or point-of-sale process that completed successfully could cause confusion across other departments. It may be equally devastating if the jobs don’t run at all.

 

 

Conclusion

Automate Schedule keeps your mission-critical processes protected from long-term, unscheduled downtime with a replicating database and a simple recovery procedure. Beyond system failure and data protection, workload automation solutions like Automate Schedule minimize the risks that result from less common areas, such as process chains, manual input, or undocumented dependencies.

Finally, workload automation provides greater reliability and control over your entire network of processes, job streams, and tasks across multiple servers and platforms.

Visit Automate Schedule for more information or call 1 877-506-4786 for a live demonstration of Automate Schedule’s high availability feature.

 

 

Related Products

Related Solutions