Runaway jobs and processes, including database server jobs, are infamous for consuming resources and degrading system performance right under your nose.
With all the user-submitted queries, external applications, and batch and interactive jobs hammering away at your system, how do you stay a step ahead of these never-ending CPU gobblers?
That’s where this guide comes in!
These five articles are fast reads full of insights that will illustrate the value—in terms of both time and money—of proactive job monitoring and automated notification. Read on to learn:
- How Licensed Internal Code tasks impact CPU behind the scenes
- How to stop looping QZDASOINIT jobs in their tracks
- How a handful of job monitoring issues can cost you over $1,000,000
Don’t let rogue jobs put a stranglehold on system resources any longer. This guide will show you how to get visibility into job performance issues and resolve them faster using sophisticated monitoring tools purpose-built to handle even the most taxing workloads.
Where Does CPU Go When It’s Missing?
Have you ever noticed that the CPU used by individual jobs does not add up to 100 percent? As it turns out, your jobs and subsystems are only part of the story. For the stunning conclusion, we return you to the Licensed Internal Code (LIC).
Fans of science fiction—and science fact, for that matter—are certainly familiar with the concept of parallel universes or worlds that exist next to ours but are separate from it. Most often, the people, places, and events in these parallel worlds are similar to our own world but have many fundamental differences.
While parallel universes are a staple in science fiction and advanced physics, you don’t need to probe into either to see them in action. In fact, look no farther than to your trusty IBM i to witness a case of parallel universes.
On your IBM i, two universes coexist:
- The world of the IBM i operating system
- The world of the Licensed Internal Code (LIC)
While these “universes” share the same physical foundations (Power Systems servers, etc.), the world of the operating system—which we can communicate with—and the world of Licensed Internal Code are quite different.
Work Management with Jobs and Threads
A great example of this is with work management. Work management in the IBM i operating system involves jobs, subsystems, memory pools, job queues, routing entries, and so on. If you are an administrator or operator, you deal with this world the whole time.
In this universe, jobs use CPU. If you want to find out where all that good CPU is being used, you turn to your preferred tool for displaying jobs. Each job consists of one or multiple threads. Threads are pieces of computer work that can execute independently from each other but are also united under a common job. This enables the threads inside a job to share resources more efficiently than they could if they were individual jobs. Think of them as subdivisions within a job.
Jobs and threads: that is the familiar world of the operating system.
Work Management with Tasks
Now let’s take a look at the parallel universe of Licensed Internal Code (LIC). You can’t access this world unless you are equipped with IBM service software or in-depth knowledge of the System Service Tools.
In the strange and unusual world of LIC, the main work management concept is not the job but the task. In terms of work management, it really doesn’t get any more atomic than this, unless you get down to the actual processor cores. There are no subtasks.
Tasks come in four flavors:
Initial tasks are the tasks counterpart to the SCPF job in IBM i. During an IPL, one initial task is started, which then starts up the other tasks. It has the same job as the init process on Linux and AIX or the session manager process (smss.exe) on Windows.
Resident tasks and non-resident tasks
Resident and non-resident tasks are used for LIC-specific functions. These tasks don’t have a direct counterpart in the IBM i world. They are used to manage things like database operations, communications, and paging. (Resident tasks are memory-resident; they remain in main storage as long as they are active.)
Lightweight process tasks
Lightweight process tasks are the LIC’s counterpart to IBM i’s threads. Each lightweight process task corresponds to an IBM i thread and vice versa. That they are called “lightweight” is bit of a misnomer since they represent all the work that applications perform on the system.
This double distinction into operating system and LIC work management, and of LIC tasks into four different kinds, will of course lead you to the one, inescapable conclusion…
The LIC is the place where your missing CPU goes!
CPU Usage and System Tasks
Have you ever noticed that the CPU used by individual jobs does not add up to 100 percent? It’s not a matter of rounding—that only contributes a few percent difference maximum. No, the CPU that your jobs use—or really the LIC tasks that embody the threads that make up those jobs—is only one part of what uses up CPU on your system.
The other CPU consumers are the resident and non-resident LIC tasks (the initial task does not really account for CPU usage after the IPL). These tasks, together called system tasks, represent under-the-hood work, the mechanics of the LIC universe.
In some cases, the system tasks consume a larger portion of CPU than the jobs themselves. A classic example of this is the restore of a large number of objects. While the restore command is run inside a job, the OS offloads the brunt of the actual restore work (think unzipping) to specialized LIC assist tasks.
As a result, when a large RSTOBJ or RSTLIB is performed, you see the total system CPU go up—sometimes to over 50 percent—but you won’t see any noticeable CPU increase for any job. On the contrary, your application jobs may well be using less CPU because they are temporarily being displaced by the assist tasks!
If you ever notice what seems like an unusually high amount of CPU in comparison to the CPU used by your jobs, then the best place to look is system tasks.
Some System Tasks and Their Functions
It’s important to know which system tasks are using your CPU because it opens a window into a world (the LIC) that is normally closed to you, but which can very concretely impact the way that applications work in the IBM i universe. Luckily, Robot Monitor ships with a display for exactly that: CPU usage of the top CPU-using LIC tasks.
What sort of tasks would you see? Unfortunately, documentation from IBM on this topic is spotty at best. One source is an old IBM Redbook containing a partial list of these tasks in an appendix. Some of the most important system tasks are contained in this list, but many are still missing. Still, it’s better than no documentation at all.
Thanks to our good friends at IBM, we were able to find out the function behind some additional system tasks. These are listed below.
|Task Name||Task Function|
DBIO01, DBIO02, …
|Database storage management, one per DASD unit|
|DbpmServer||Database; pre-brings variable-length fields into main storage|
|LDFX01, LDFX02, ...
|Load/Dump (i.e., Save/Restore) tasks|
|RM*||RM* tasks are related to Collection Services or extending storage/bytw stream files|
|RMTMSAFETA||A timer task that services LIC timer function requests that are not handled by an interrupt handler task already in main storage; usually related to communications activity|
|SMMIRRORMC||Storage managment; ASP mirroring management|
|SMPOLnnn||Storage management, page-out task|
|SMXCAGER01, SMXCAGER02, ...||Expert cache task, one per shared storage pool that is set to *CALC|
|SMXCSPRVSR||Expert cache supervisor|
|VIO-WORKER||VIOS and HEA (virtual Ethernet card) worker tasks, servicing with virtual SCSI, virtual fibre channel, and virtual Ethernet activity|
This information should come in handy the next time you have to determine why your CPU is increasing.
Getting to Grips with QZDASOINIT Jobs
When QZDASOINIT jobs make CPU spike, it’s usually the poorly-written SQL code running in them that causes the issue. But how do you know which database server job or which SQL statement is to blame?
When an IBM i environment experiences rapid CPU consumption or other conditions that start to impact memory, there are some jobs that will make the list of usual suspects more often than not. While QZDASOINIT jobs (JDBC/ODBC connections) almost always make the lineup, they are rarely solely to blame.
Often, they are simply guilty of keeping the wrong sort of company—it's usually the poorly-written SQL code running in them that causes these issues, provoking QZDASOINIT jobs to gorge themselves on as much CPU as they can until they are identified and stopped!
So, what is the true nature of the QZDASOINIT jobs and how can we keep them from running wild in the system? QZDASOINIT is the job name for the database SQL server jobs. These jobs are used to serve SQL to JDBC and ODBC client applications and normally run in subsystem QUSRWRK. System i Navigator jobs also use this job name when running a query through the SQL window.
When CPU spikes on the system, it can be very difficult to determine which job or series of jobs are contributing to the problem since they all share the same name. Potentially, there could be hundreds of QZDASOINIT jobs that are collectively creating a big impact on CPU rather than a lone runaway culprit.
Get Visibility into QZDASOINIT Jobs
Administrators need visibility on the issue as a starting point. This can be achieved by running the command WRKACTJOB followed by manual batch investigation and resolution at the job level (repeating the process for each system). The information returned on this command still leaves important questions unanswered: Who is running these jobs? What proportion of overall CPU is being consumed?
Answering these questions requires a greater degree of insight to give more meaningful understanding and context to any issues for faster problem resolution. After all, the longer the investigation process takes, the more CPU is consumed. It’s clearly in everyone’s financial interests to resolve this type of problem quickly.
With the appropriate real-time monitoring solution in place, administrators will have the ability to answer these questions. Consider visibility into JDBC/ODBC activity, especially those jobs running SQL commands. Robot Monitor provides drill-down access for administrators dealing with QZDASOINIT job issues.
In this case, administrators have real-time visibility of a dedicated QZDASOINIT job's CPU and also immediate access to offending jobs for resolution. To accommodate the particulars of their environment and resources, they can also set threshold levels for early detection, effectively forewarning them of a potentially escalating situation before it ever has an opportunity to take hold.
Administrators set up this type of monitor by first creating a data definition that will be qualified by any or all systems, and by a particular job name. They can then choose to add custom thresholds to each monitor and issue proactive alerts when jobs exceed these thresholds. Within the data definitions, administrators can also select groups to add to these monitoring parameters.
An example of this would be for a group of business data analysts. This level of granular monitoring can be extended to include subsystem, accounting code, user, current user, job, and function. With this monitor in place, you gain proactive visibility into this group and its threshold in the context of total CPU being consumed by QZDASOINIT jobs and total system CPU consumption.
QZDASOINIT and Memory Issues
QZDASOINIT jobs can also be problematic where memory issues are concerned. In a typical example of this type of scenario, a batch job’s memory is flushed by interactive jobs, leaving the batch process to perpetually try to access jobs from the memory that are no longer there. This troublesome process is known as “thrashing”. The key challenge to resolving this lies in its identification, as the runaway process is most visible by its symptom of an increase in page faults.
As with our previous CPU example, the necessary investigation to determine which subsystem or jobs are being impacted by non-database page faults could be both lengthy and expensive without a real-time monitor. Administrators could first access the System Status screen to show the number of page faults in each memory pool but would still be left wondering which jobs are responsible for causing problems in these memory pools, and which subsystem(s) are using these memory pools.
In tackling the issue, Robot Monitor employs the same data definition qualifications to create an appropriate monitor. It provides real-time visibility via a dedicated NDB bar showing the overall system faults/second and gives immediate access to the offending job for resolution. This monitor has all the same threshold and alert capabilities that will keep proactive administrators one step ahead of escalating resource issues.
The practice of managing by exception means jobs that are guilty of misbehaving have no opportunity to hide under a generic name shared by thousands like it—and remain under the radar until they are discovered. Instead, at the first sign of trouble, an alert is sent directly to the administrator who has all the information needed to resolve the issue—and help their business avoid a million-dollar mistake.
How Much Could A Rogue Job Cost Me?
In a busy network that supports a large user community, a single, unruly job could easily cause productivity to plummet—among other issues—to the tune of $1,000,000, as this mini job monitoring case study shows.
Any given IBM i machine, partition, or network has the potential to suffer at the hands of a rogue job, regardless of how well that system or network is managed. When jobs do not perform as they should and instead spend their time looping, devouring CPU, or simply becoming inactive, the consequences are immediate and measurable—consider the unproductive user community, financial penalties, and additional resource expenditures, including man-hours wasted in searching the system.
Sudden changes in a job’s status or performance are excellent indications that there could be trouble if these conditions persist undetected and unattended. It’s not uncommon for a rogue job to set in motion a series of increasingly damaging and expensive issues that can leave IT managers racing to resolve multiple system problems—all which stem from a single, unruly job.
In a busy network that supports a large user community, the cumulative effect of these problems and their impact on user productivity means that job monitoring—or more precisely, a lack of it—can quickly become a million-dollar problem.
A Mini Case Study for Job Monitoring
Let’s say that Company XYZ is a large financial services organization that is struggling with job issues on their centrally managed IBM i network. The network supports 21,000 users nationwide and the company generates $4.2 billion in revenue annually. In a review to outline the extent of this issue, they calculated the associated financial impact over the previous year’s operations. The total cost was far more than they originally thought.
|Job Performance||Consistently high CPU usage as several QZDASOINIT jobs take more CPU than they should||
|Job Performance||A looping job is consuming vast amounts of temporary storage||
|Job Performance||Poor performance of certain jobs due to a high rate of faulting in a specific pool indicating too many jobs or insufficient memory in the pool||
|Job Status||An important accounts-related job becomes inactive but is not detected immediately||
|Total Annual Cost of Job Performance and Status Issues||$1,018,700|
The chart illustrates how an effective job monitoring solution could have prevented this unfortunate company from paying over $1,000,000 as a result of job-related issues on the system. The company and circumstances are hypothetical, but the costs associated are all too real—as any IT manager who has struggled with these same issues will know.
Protecting IT Budgets from Problem Jobs
When faced with the challenge of identifying and resolving problematic jobs, you must be equipped with the resources that allow you to tackle the issue proactively, including:
- Real-time insight into which jobs are causing problems
- The means to reduce investigation time
- A system of automatic response for known or anticipated outcomes
- The ability to identify jobs in a lock-wait scenario
- Knowledge of jobs that have exceeded their maximum wait time in queues and jobs that have a historical track record of causing problems
Armed with these resources, you can catch job issues when they first appear on the system and radically reduce their financial impact. The trick is to isolate the issue at an early stage before it has the chance to create secondary problems like interfering with user productivity—one of the biggest sources of expense.
Similarly, real-time detection that provides detailed information about the questionable job eliminates the investigation time required to search for it across the system or network—a task which also incurs a substantial financial burden.
Finding (and Fixing) Rogue Jobs on IBM i
Resolving offending jobs on the system requires more than simply knowing which job is at fault—you must also know the nature of the problem. If you’re monitoring jobs for a break from typical patterns of behavior, it can be easy to establish the cause and resolve the issue quickly, especially if your monitoring tool allows you set thresholds based on your preferred levels and automatically notify you when a threshold is breached, as Robot Monitor does.
Proactive monitoring means jobs that are guilty of misbehaving—like QZDASOINIT database server jobs, for example—have no opportunity to hide under a generic name shared by thousands like it…and remain under the radar until they are discovered. Instead, at the first sign of trouble, you receive an alert right away and can resolve the issue before it impacts users.
Rogue QZDASOINIT jobs typically demonstrate exceptional conditions that affect either job performance or job status. Robot Monitor monitors all these parameters—and hundreds more—in real time to help you identify which jobs are problematic and why.
- Response time
- Transactions per hour
- CPU transactions
- CPU usage
- Interactive CPU usage
- Database CPU usage
- Faults per second
- Lock-wait time
- Temporary storage per job
- Thread count per job
- Disk I/O
- Job queue maximum wait time
- Job queue average wait time
- Interactive job count
- Job count
- Job queue active job count
- Job queue count
- Job status
Robot Monitor provides comprehensive, real-time job monitoring functionality, so you can drastically reduce the risk of rogue jobs impacting the broader system environment, as well as the subsequent expenses.
Robot Monitor also improves visibility and gives IT teams control over problem jobs as they occur with real-time alerts, escalations, and commands.
With centralized, proactive control for jobs running throughout the network, you can ensure that jobs are running correctly by keeping a watchful eye on various job components, creating user-definable CPU usage restrictions, and being informed of jobs that experience a change in status or those that breach their anticipated or scheduled run times.
Hinder Jobs to Make Life More Productive for Users
Did you know? Robot Monitor allows you to control the run priority of a job based on its job CPU usage, so that you can ensure your IBM i has the right jobs consuming the right resources at the right time.
Below the calm surface of a hard-working IBM i there is something of a scramble going on. More precisely, it’s a scramble for resources and inevitably. In this job-eat-job scenario, there’s little room for sentimentality.
You see, the system doesn’t care that a payroll job is more important than a large but routine print job. It will happily give that large print job all the CPU it wants, leaving the payroll job to wait for available resource regardless of how long that might take.
Thankfully, Robot Monitor provides a command that can put this to rights: MONCHKJCP.
Running this command not only gives you valuable insights into the jobs using excessive CPU and disk I/O, it also allows you to hinder these jobs when set parameters around their CPU or disk I/O usage are exceeded, clearing the way for other jobs to run.
Control CPU Usage
Parameters can be set up in the Robot Monitor GUI on the user’s PC to govern when this action should occur. This is done by setting a percentage limit for the overall system CPU usage and selected jobs CPU usage for a defined number of consecutive samples. Action options include Hold, Lower Priority, or No Action. Warning messages that these conditions have occurred can be sent to the QSYSOPR message queue to keep operators informed, which could in turn be intercepted by Robot Console and escalated.
In addition to being able to set up the jobs on a PC, Robot Monitor also allows the user to show the details of the monitor when jobs violate the threshold and start to be hindered. In this example (run on defined QZD* jobs running in subsystem QSYSWRK), the central screen shows the top 10 CPU-consuming QZDASOINIT jobs. The run priority has been changed to 70 on the top consumer based on the parameters of the MONCHKJCP command.
To customize the command according to your needs, there are a number of definitions available in the setup options for Robot Monitor. In addition to the percentage of system CPU (and number of samples mentioned earlier) before the command is invoked, specifics can be assigned to these same conditions before a job is released from its held status. Other parameters can be customized around programs, users, subsystems, and jobs to include or exclude.
The power and precision of MONCHKJCP will save a substantial amount of time that would otherwise be required to investigate issues arising from these jobs. And that’s not all. MONCHKJCP will also save you money by helping you keep your users productive and happy.
Your Fast-Resolution Solution for QZDASOINIT Jobs
How are you currently set up to see which users or processes are impacting your IBM i environment? How quickly can your team identify potential runaway queries? What will it cost—in terms of time, lost productivity, and employment security—if a runaway database server job causes your system to grind to a halt due to insufficient resources?
For companies running IBM i, one recurring frustration IT teams must handle is runaway database jobs caused by external requests from business intelligence (BI) tools or users submitting and resubmitting queries. When these jobs are not caught in time, they pose a significant risk for total system availability loss. Even for the newest systems, runaway jobs can tax hardware resources to the point of failure.
QZDASOINIT Doesn’t Have to Be a Bad Word
With so many jobs sharing the same name, there could be hundreds of jobs contributing to escalating CPU, for example. The longer the process of identifying the offending job takes, the more CPU is consumed and the greater the risks and impact on resources.
Monitored groups in Robot Monitor show Total QZDASOINIT CPU and Total System CPU. System administrators benefit from proactive visibility into this group, its threshold, and the comparative context it provides. And if you’re not in front of the Robot Monitor display, the software will send you notification when these workloads misbehave.
One-click drilldown offers a detailed view of the jobs consuming the highest amounts of CPU to pinpoint the offending jobs instantly without running time-consuming manual searches in the QUSRWRK subsystem or running WRKACTJOB.
You can also click over to the SQL Statement tab to view the exact SQL statement causing system or application performance to plummet.
With Robot Monitor, you can:
- Proactively monitor for QZDASOINIT runaway jobs
- Drill down to identify the exact SQL statement
- Reduce investigation time and its burden on resources
- Protect resources from rapid and needless consumption
- Monitor QTEMP, temporary storage, CPU, and disk
- Set thresholds for early warning of escalating issues
So, how long did it take to locate that last runaway QZDASOINIT job on your IBM i? Seeing Robot Monitor in action will help you determine how proactively identifying these potential threats to system availability can save you time and money. Get started with Robot Monitor today.
HelpSystems aligns IT and business goals to help organizations build a competitive edge. Our software and services monitor and automate processes, encrypt and secure data, and provide easy access to the information people need. More than 13,000 organizations around the world rely on HelpSystems to make IT lives easier and keep business running smoothly. Learn more at www.helpsystems.com.