IT Centralization and Robot

Introduction

Text

In recent years CIOs have been called on to deliver increasingly broad and ambitious goals with smaller teams and shrinking budgets. This has led many to feel that relying on existing IT models has them stretched too thinly. Many CIOs now look toward the centralized IT model, which promises to make it easier to deliver more ambitious results and higher efficiency.

Business units, development, quality assurance, and operations teams that traditionally kept a safe distance from each other are now finding that meeting these new goals requires unprecedented levels of communication and collaboration, as well as a need for sophisticated systems management tools that can add efficiency and help bridge the gap.

This paper puts forward a case for IT centralization with IBM i and explains the important role automated systems management technologies like Robot have to play in centralized IT environments.

How Centralization Impacts IT Teams

Text

Service Metrics

In centralized IT environments, business units are increasingly dependent on consistently high levels of service from central IT to achieve their goals. Their budgets often support central IT so it’s common to agree to formal Service Level Agreements (SLAs) between both parties. SLAs are service contracts that define particular aspects of the service, for example scope, responsibilities, and quality.

Service metrics or key performance indicators (KPIs) are used to continuously measure performance against SLAs. A significant challenge in centralized IT environments is consistency in delivery on KPIs. For example, a trouble ticket with the potential to impact KPIs generated at 10:00 a.m. is generally quicker and easier to resolve than one generated at 10:00 p.m. This is because staff required to take action on the ticket are more likely to be at work at 10:00 a.m. Even with shift-working, there tends to be skeleton or on-call crews working at certain periods.

Staffing Pressures

Meanwhile, “Do more with less!” is a common senior management battle cry within many organizations, particularly during periods of economic downturn and recovery. A significant cost within I&O is staffing. Business pressures to reduce permanent IT staff to a minimum can lead to an overworked and frustrated IT workforce where the most experienced staff get the most work, much of it routine.

Part of the solution may be to use temporary expert staff to bridge the gap, but they are typically only available at pre-agreed times. Another important part of the solution is automation.

Automation

Text

Automation is an inevitable response to the rigors of SLAs and pressure on IT staffing, coupled with the growing trend toward centralized IT. Centralizing IT has the consequence of reducing the number of systems within an organization. Fewer systems running greater workloads introduce more opportunities for human error. Automation helps remove the risks associated with human interaction and helps ensure 24/7 consistency.

Gartner has identified that there is no single automation technology to cover all areas of centralized IT. Instead, specific, targeted, best-of-breed automation technologies provide the best results. Gartner identifies job scheduling, workload automation, and event management as areas of I&O where automation is most frequently deployed.

Traditional automation is task- and workflow-based, aimed at routine tasks that are deterministic in nature. These automation tools have helped IT teams keep the need for human intervention to a minimum. However, the challenge with task and workflow automation tools is that they have limited flexibility and intelligence. With these tools, when unpredictable conditions arise, they still introduce operational delays and still need frequent human intervention.

A new breed of heuristic automation technologies is being developed. The first stage of heuristic automation is the ability to proactively monitor systems and apply predefined, intelligent steps (initially pre-defined by humans) when specific events happen (or don’t happen), providing a corrective plan B or corrective plan C. This event-driven approach greatly reduces the need for human intervention and allows for improved I&O efficiency.

In the future, automation technologies are expected to evolve to further embrace heuristics. Future technologies are expected to use mathematical algorithms to “learn”. For example, they will be able to generate modular, re-usable processes to resolve an incident and then store this new corrective process in a knowledge base to be re-used to handle future unforeseen or unknown incidents of a similar nature.

The strength of future heuristic automation technologies lies in their knowledge base architecture, making them more similar to a human administrator. Over time, all standard incidents could be prevented proactively

Automated Job Scheduling on IBM i

Good

The traditional approach to IBM i job schedule automation is to use a built-in task- and workflow-based scheduler such as the IBM Job Scheduler. Operators would configure jobs in the scheduler in the order they wish them to run or at the time they wish them to run.

This approach is preferable to running jobs manually, but it requires the operators to keep eyes on glass, monitoring around the clock just in case a job fails and blocks the job queue resulting in negative business impact.

With this approach, job streams can take a long time to complete. This is because, wherever there are dependencies on jobs in other steams or other systems, either manual intervention is required or a job must be sufficiently delayed to ensure prerequisite jobs of unknown run duration complete.

Better

An improvement over the traditional approach is to develop scripts that programmatically manage related jobs within the schedule. For example, the jobs associated with an end-of-day procedure for an accounting system can be programmatically managed as one unit. Although groups of jobs are being programmatically managed within a single job stream, there is still little interconnectivity with other job streams, other systems, or interfaces to other platforms.

With this approach, there are many scripts (CL programs) to maintain and there is a significant potential for bugs to introduce problems. It is rare for operators to get completely away from the 24/7 eyes-on-glass approach of monitoring. Plus, if you add something new, you have to modify a script, which might break the process.

Best

Robot Schedule provides a heuristic, eventdriven approach to job scheduling. Each IBM i job is defined within the knowledge base, along with a set of conditions that help ensure each job runs successfully every time. For example, a definition of parameters that each job must receive and any prerequisite conditions.

Rather than specifying a specific date and time to run the job, we specify a broader definition. For example, run this job every Friday except public holidays after 1:00 a.m. but do not start the job if it’s after 7:00 a.m. Some of these rules and conditions can be specified using a flexible operator assistance language (OPAL) specifically designed for the purpose.

Robot Schedule uses the job definitions from its knowledge base to maintain dynamic job schedules continuously. It has learned from previous runs how long specific jobs take and allows for this in the dynamically maintained schedule. If jobs overrun, underrun, or start late, it can adapt the schedule accordingly. Dynamic parameters are used for every job to eliminate all hard coding and this significantly reduces operator workloads.

Some jobs must only run when particular events occur. For example, running the ERP data import job only if a specific data file has been successfully received. The heuristic, event-driven capabilities of Robot Schedule allow these unpredictable events to initiate new jobs. Again, the schedule is dynamically adjusted to accommodate this previously unscheduled event while still respecting the rules of all pre-existing jobs. This heuristic, event-driven job scheduling approach makes more efficient use of system resources by eliminating idle time, gaps in job schedules leading to inefficiency, and eyes-on-glass monitoring. It also eliminates human error, which is a common cause of application downtime For centralized IT environments that can have thousands of jobs to process every day and rigorous SLAs to meet with minimal staff available for repetitive routine tasks, Robot Schedule is the best option. Its highly mature and robust technology is used by hundreds of organizations in centralized IT environments.

Multi-Platform Job Scheduling

Good

Most IBM i systems in centralized IT environments have interfaces and dependencies on Windows, UNIX, or Linux systems. The traditional approach to solve this is to manually use FTP (or other means) to share data and objects across platforms. Once data is generated on the source system and is shared with the target system, it can be imported or processed on the target platform.

Traditionally, operators keep eyes on glass to see when files are ready for transfer and then take manual steps to transfer, check status, and initiate import jobs. This process can be effective but requires a lot of human input, often introducing idle time, and is vulnerable to human error. It’s not uncommon to accidentally delete a file before it is sent.

Better

An improvement over the traditional approach is to develop scripts on source and target platforms that communicate with each other. The source scripts typically send a flag to the target scripts when a file has been transferred, which triggers a job on the target.

Although this cuts down on human errors, human workload, and idle time, it still suffers from several limitations. Namely, the scripts need to be maintained. In addition, some of the scripts do not have sufficiently robust error handling and can fail if unpredictable events occur. What happens if the target system is unexpectedly busy and needs to delay the import job? Most scripts can’t react to that without human intervention such as manually holding jobs. In these scenarios, it is difficult for operators to get away from the 24/7 eyes-on-glass monitoring approach.

Best

Robot Schedule Enterprise is a seamless extension of Robot Schedule that uses agents on Windows, UNIX, and Linux systems to allow sophisticated and flexible multi-platform schedules controlled from IBM i. Windows, UNIX, and Linux jobs can be defined to the knowledgebase in the same way that IBM i jobs are—no platform-specific technical knowledge required.

Job flows can be created spanning multiple platforms and viewed and changed graphically with the Robot Schedule Enterprise blueprint function. Heuristic, event-driven technology is now seamlessly extended to all the platforms with which your IBM i needs to interact.

This is the best option for centralized IT environments because dynamic schedules respect overruns, underruns, late starts, and other unpredictable events across all platforms. There is little or no need for 24/7 eyes-on-glass monitoring, human errors are eliminated, and this leads to more efficient, centralized I&O.

Automated Message Management

Good

Centralized IBM i systems generate a lot of messages. The IBM i platform is very chatty when it comes to telling operators what is going on via messages. It’s not uncommon on larger centralized systems to receive hundreds, sometimes thousands, of messages every day. Managing messages correctly is essential for keeping IBM i systems running efficiently and avoiding downtime and errors.

The traditional approach to message management is for operators to manually monitor the system operator’s message queue plus other important message queues around the clock. Their level of experience would determine which messages are important. Their experience would also determine corrective actions. For example, if an important end-of-day accounting job sends a message with a message identifier code of CPF5272 and the text description “Records not added to member”, it means that an IBM i file or table has reached its maximum currently allowed size and this job has been paused pending permission to increase the file size.

The corrective action requires some knowledge. For example, the business volumes may be growing and the file therefore may need to be larger. In this scenario it is usually acceptable to give permission to increase the file size. However, if this file should never be so large under normal circumstances, it could indicate a programmatic problem, in which case permission might not be given.

Although the traditional approach can work well, it requires a lot of operator time. Having relatively expensive operators stare at messages is hardly an efficient use of their time. Much of the I&O budget can be consumed with overtime payments to operators to work the night shift, mostly to manually monitor messages.

The higher the volume of messages, the higher the chance that human errors can be introduced. Important messages can be missed at busy times. Sometimes accidentally, possibly due to fatigue, important messages may be given an incorrect response, introducing errors that can cause application downtime.

Better

A natural evolution is for operators to create scripts and use auto-reply lists to automate some of the handling of messages. This reduces the number of messages that operators must look at and respond to. However, there are limitations to this approach.

Scripts must be frequently maintained and are often prone to bugs. They generally do not go through the same stringent quality assurance processes that independent software vendor tools go through.

Automated reply lists are also very simplistic. For example, if it sees a specific message, it always responds in the same way. As discussed earlier, the same message might require different corrective actions depending on the circumstances. This approach still requires eyes on glass and it does not eliminate the opportunity for human errors.

Best

Robot Console is the most sophisticated solution for message management in centralized IT environments. It uses heuristic, event-driven rules coupled with a knowledge base to manage all messages reliably for an IBM i system. Robot Console learns which messages are generated most frequently. For example, it is typical that the 20 most frequently received messages represent 80 percent of total message volume. Complex rules can be set up that leverage OPAL to provide corrective actions. Messages of low priority can be filtered out and messages important to mission-critical applications can be highlighted.

Not all conditions that require operator attention generate a message. For example, we don’t get a message if a job doesn’t start or a subsystem isn’t active. Robot Console has a proactive resource monitoring capability that continuously monitors resources such as lines, controllers, subsystems, jobs, and job queues to ensure they are in a correct state. For example, we may decide that between 9:00 a.m. and 6:00 p.m. job queue QBATCH should not have more than five jobs waiting to run. If it does, then Robot Console can run an OPAL routine to take corrective actions such as automatically moving some of the jobs to other queues.

Operators can access an optimized and centralized view of all important messages and the status of monitored resources if they wish; however, the goal is to automate as much as possible. This approach eliminates the eyes on glass issue as well as human error.

Real-Time Notification and Escalation

Good

Some events require human intervention. Let’s say the nightly tape save has run out of tape cartridges to complete the save operation. This requires a human to put in a new tape cartridge. Sometimes, if many unexpected jobs have caused schedules to be extended such that jobs are running late and might miss a morning deadline, a human needs to be informed.

The traditional approach to this is, you guessed it, to have eyes on glass 24/7. Operators are manually monitoring messages and taking appropriate actions. This works, but has limitations in that it consumes a lot of operator time and is prone to delays and human error. It’s not uncommon to hear “Oops, I didn’t see that message until 7:00 a.m.”

Better

An improvement to the traditional approach is to implement a stand-alone tool that converts high priority messages into emails and sends them to an operator’s email address. This has the benefit of allowing operators to move about and do more important things than monitor around the clock.

This approach, while a step in the right direction, has some limitations. A typical stand-alone email tool is passive and relies too heavily on individual message IDs appearing in a queue and too little on heuristic, event-driven automation. As a result, operators may be bombarded with too many repetitive emails, many of which could and should be handled automatically.

Worse than this, operators may not receive emails when certain important conditions occur. In addition, operators sometimes miss emails. Sometimes they may be off duty. Many stand-alone email tools do not understand who is on call and don't even know if an email has been read by the recipient.

Best

Robot Alert is the most sophisticated solution for real-time notification and escalation in centralized IT environments. It is tightly integrated with all other Robot technologies, such as Robot Schedule and Robot Console, so it takes advantage of heuristic, event-driven automation.

It doesn’t just passively monitor for message IDs. Let’s say Robot Console receives a message during a nightly tape save that indicates an unreadable tape. The corrective action would typically start with attempts to solve the problem without escalation. First, try to re-initialize the tape. If that doesn’t work, skip the tape and move to the next in the stack. Only if that doesn’t work do we need to escalate.

Email alerts can be triggered from multiple Robot products, from OPAL, and from customer programs. Robot Alert can be told whether a read acknowledgement is required. Then, the operator receiving the email has to respond to show they have read the email.

Robot Alert also keeps a knowledge base of who is on call and whom to escalate to if there is no acknowledgement. Emails allow operators to respond remotely and can contain attachments to give remote operators more information.

If your organization uses an information technology service management (ITSM) solution, then a ticket can be automatically generated via SNMP messages. The result is a highly automated and effective system of alerting and escalation that saves time, improves I&O efficiency, and reduces the potential for error in centralized IT environments.

Multi-System Environments

Good

Large centralized IBM i systems typically consist of multiple logical partitions (LPARs) within multiple individual IBM i servers.

Traditional approaches to job scheduling and message management are manual and treat each LPAR as a standalone system. Job schedules and messages cannot span more than one LPAR. Coordinating workloads and messages manually between LPARs is labor-intensive and can introduce idle time and human error. Some organizations may have high numbers of LPARs in their environments, so the manual approach quickly becomes impractical.

Better

An improvement over the traditional approach is to create custom scripts that send flags between LPARs to trigger jobs or take other actions. Although this is an improvement over the manual approach, it’s also clumsy. Scripts need to be maintained. They can have bugs and they often run into snags.

Best

Robot Network is the most sophisticated solution for multi-system environments. Robot Network is tightly integrated with other Robot products, such as Robot Schedule and Robot Console. It allows them to virtualize multiple LPARs and multiple systems.

For example, when setting up a job flow in Robot Schedule, we can now identify the LPAR on which each job should run. A job on one LPAR can wait for the completion of jobs on another LPAR before it runs a job on yet another LPAR and so on.

Robot Console could wait for the arrival of certain messages on multiple LPARs before triggering an event on another LPAR. Large centralized IT environments could utilize a map view to see the status of all LPARs as nodes on a single graphical display and drill down to see more detail.

In such environments, a lot of operator time can be saved and errors eliminated by maintaining a centralized set of rules in a knowledge base and distributing these to remote systems.

Robot Network also allows the management of IBM i systems from anywhere via a browser-based interface. It contains high-level status displays to view multiple IBM i systems on a single screen in terms of performance, message status, and Robot product metrics, which indicate activity levels in a granular way

Conclusions

Text

DevOps, I&O, cloud computing, big data, analytics, mobile applications, and enterprise application software are driving the trend toward IT centralization. As CIOs increasingly pursue IT centralization as a solution for delivering more ambitious results with higher efficiency, delivering consistently on SLAs can be a challenge without automated tools. Meanwhile, reductions in IT staff lead to an overworked and frustrated workforce.

Luckily, IBM i offers an excellent platform for IT centralization. There are sophisticated options available for the heuristic automation essential to efficient I&O, especially in the most common areas of job scheduling, workload automation, and event management.

Robot systems management solutions provide heuristic, event-driven, automated job scheduling across IBM i, Windows, UNIX, and Linux, as well as best-of-breed message management for centralized IBM i environments. Its enterprise-class solutions for real-time notification and escalation, along with its ability to automate all your IBM i servers and LPARs as a single environment, make successful management of your centralized IT environment easily within reach.

Let’s Get Started

Text

Call us at 800-328-1000 or email [email protected] to set up a personal consultation. We'll review your current setup and see how the Robot products can help you achieve your automation goals.