Article

What Really Matters in IBM i Monitoring?

IBM i
Posted:
September 14, 2017

 

If IBM i monitoring is your part of day job—and let’s face it, your nights and weekends job, too—there are some critical metrics that you won’t want to miss.

Maybe you’re trying to tackle them manually, maybe some of them have somehow fallen off or never made it onto your checklist. Either way, it’s like that these 15 metrics are taking up a lot of your time.

The monitoring process is very labor intensive if you’re doing it manually. If you aren’t monitoring these metrics, you might not have realized it, but you can probably trace most of your time spent troubleshooting back to either the triage or treatment of one of these IBM i metrics gone rogue. And that’s if you’re lucky.

Read on to see what else could go wrong—and how you can avoid it with ease.

 

WHAT SHOULD I MONITOR? WHY DOES IT MATTER? CAN I AUTOMATE MONITORING?
Cache Battery Life If not changed regularly, cache batteries can degrade system performance, which causes overall application slowdown. This severely impacts the user’s ability to perform their job. In extreme cases, unchanged cache batteries can even lead to downtime. Yes. However, if you want to take the time-consuming, manual approach, information on cache battery life can be obtained via System Service Tools (SST). You can also apply a program temporary fix (PTF) to run a program that will indicate battery life within a spooled file.
Mains Failure
Running on uninterruptible power supplies (UPS)
Electricity is generally very reliable, although power failures and surges can damage all electrical equipment. A UPS only lasts for a limited time, depending on the number of batteries and physical battery life. If the UPS runs dry, you suddenly lose power to the data center, which can result in data loss and an extensive recovery period. Yes. By monitoring when a system is running on UPS, you’ll be better equipped to plan a controlled system shutdown as opposed to a sudden failure.
Backups
Are they clean?
Backups are often the only source for full or partial system recovery. An object not cleanly saved cannot be restored, resulting in data loss. You also won’t have visibility into damaged objects until you make an attempt to save them. Yes. Being able to recover your system is mandatory. Real-time notifications that can indicate when a save is incomplete provide you with both time to investigate the underlying reason and an opportunity to reattempt the save.
High Availability (HA) Replication Nodes
Are they in sync?
Software-based replication is comprised of multiple nodes, which should be kept synchronized at all times…because disaster can strike at any time. Having visibility into HA nodes, backlogs, and exceptions is necessary to keep the systems in sync. Yes. Many HA replication issues can be short-lived. You can leverage automated monitoring tools to ensure that you are only alerted to real issues that are the most likely to impact replication.
Disk
Did organic disk growth suddenly spike?
Looping jobs, files being extended exponentially, or spool generating at an alarming rate can all contribute to sudden increases in disk space usage. Unchecked disk space usage spikes lead to a heavily degraded system, ultimately resulting in overall system failure. Yes. Keeping track of this manually would involve very regular checks across each partition around the clock—that’s a full-time job by itself. On the other hand, automated monitoring takes less than 2 minutes each partition in total. The time saving alone is immense.
Lock Wait Lock wait situations (LCKW) occur when a required system resource is exclusively being used by another user or process. If ignored, they’ll delay both system processes and users requiring access to resources. If the wait becomes excessive, a program failure or message wait (MSGW) situation can result. Yes. If you can capture lock waits automatically, you’ll be able to deal with them in a timely fashion, preventing message wait hold-ups. Additionally, you can also establish the resource that holds the lock at the time, which is almost impossible to do after the event.
Message Wait Message wait (MSGW) indicates that something on the system requires human intervention in order to continue. This can range from a printer being out of paper to a business critical file becoming full. Regardless, the longer it takes for said human to realize something is in MSGW, the more likely it will develop into a problem. Yes. Many MSGWs have known responses that can be automated to reduce the time-lapse before they receive a reply. The System Reply List built into IBM i offers a way of doing this, but it’s extremely inflexible.

For those MSGWs without known responses, being alerted to them as soon as they appear enables more timely manual responses.
CPU
How long has usage been too high?
CPU spikes occur when a system or individual process is doing its job and has arrived at an intensive piece of work—no problem there. However, prolonged periods of high CPU may indicate a looping job, a badly written program, or possibly an interactive job that would be better run in batch at a lower priority. Yes. Detecting anomalies in CPU usage allows you to make better use of system resources while also improving both system throughput and response times to end users. It may even prolong the life of system, delaying expensive upgrades.
Journal Receivers Journal receivers can become huge and consume a lot of disk space, depending on what is being journaled. They are often a key part of software-based HA solutions and the first port of call when attempting to firefight high levels of disk space usage. Yes. Due to application constraints or audit requirements, journal receivers sometimes need to remain on systems for a defined number of days. Automating the ageing of these—and deleting them when they are no longer required—saves many hours a month.
Integrates File System (IFS) When tracking disk space usage, it is not uncommon to overlook or neglect the IFS. However, the IFS allows an unlimited number of hierarchy levels and many non-green-screen applications use it heavily. It should actually be one of the first places to check, not an afterthought. Yes. The IFS is historically harder to monitor than the traditional member/object/library relationship and the built-in IBM i monitoring tools are limited. Automated monitoring tools make detecting IFS object values such as size, last used, and last saved a fast and easy task.
System Value Changes System values control how the system looks, feels, and responds to end users. Any changes to system values should be carefully considered and certainly not changed on a whim due to the fact that they alter system behavior, often immediately. Yes. Being alerted to system value changes enables an administrator to react, potentially limiting the damage such a change could have.
System Service Tools (SST)
Who has access?
SST is not an area that you need to access very often, if ever. The fundamentals of running the system can be changes from here, so access to SST should be limited to a set of trusted staff. If unwelcome access or unwanted functions have been performed and you don’t know about it, it could be too late. Yes. IBM i logs when users access SST to the built-in audit journal, but it offers no automated way of notifying you, leaving you to have to query the audit journal after the event.
Objects, Libraries, and Applications
How big are they?
The more objects, libraries, and applications grow, the more disk they consume. When left unchecked, growth can get carried away. Do you know the sizes of key objects on your system? Yes. Unless you’re taking regular snapshots, you will only ever know the current sizes and will not be able to draw any comparisons to older objects to track growth.
Spooled File Backlogs Both applications and system functions are capable of generating spooled files. The more spooled files on the system, the longer it takes to find the ones that matter. Plus, each spooled file contributes to the “jobs in system” number. Yes. Want to tackle this one manually? It’s not uncommon for systems to have hundreds of thousands of spooled files on hundreds of output queues. Sorting through all that is exactly the kind of housekeeping that gets put off again and again…until you automate it.
Job Queue Backlogs Traditional batch work routes through job queues before becoming active. A buildup of jobs on job queues can be a result of a number of scenarios—most commonly overrunning jobs or jobs in front of them in MSGW—and can impact application availability. Yes. Automation tools that monitor job queues can alert you when there’s a bottleneck so you can attack the problem before it hits your organization on a larger scale. You can also move waiting jobs automatically.
Get Started

Most companies find that automated system and performance monitoring pays for itself within 18 months of implementation. Isn’t it time you took a look at an automated monitoring solution? If you need help justifying the investment, use our interactive return on investment (ROI) calculator to enter values specific to your project and see where automation will pay off.

Related Solutions