Article

Astronauts Rely on Checklists (and So Should You)

IBM i
Posted:
August 30, 2016

 

Astronauts know a thing or two about what it means to be responsible for critical functionality. Their mastery of complex and demanding systems has enabled humankind to make our proverbial giant leaps. It’s not a job for the faint hearted and not without huge risks or the need for radical problem solving. Astronauts know that the longer it takes them to respond to a change in critical functions, the more serious the consequences. That’s why they all rely on checklists, and so should you.

Checklists are nothing short of brilliant. Firstly, they act as a form of external memory. Instead of astronauts and IBM i administrators keeping complex procedures in their heads and following this procedure correctly every time, the humble checklist alleviates the burden of detail for their brains so they only have to remember one thing—follow the checklist precisely!

The second major benefit that checklists bring to the table is consistency. If you perform your checks from memory, chances are you will leave out a check sooner or later. (We’re only human after all!) Even the most experienced system administrators in the world, who have performed these checks hundreds of times, are vulnerable to this fallibility. It could boil down to having a bad day or being distracted by an urgent phone call, but the potential to forget an important check is all too real. And following Murphy’s Law, the thing that you forget to check will probably be the very thing that goes wrong that day, even if it's been working perfectly well for ten years. With a checklist however, there is no opportunity to forget! The checklist forces you to go through each item on it. And this simple process means checklists keep checking consistent.

Checklists give you peace of mind. They compensate for the limitations of our human memory and allow us to perform consistent checking of systems that, because they are complex, dynamic, and tied to vital functionality, are risky and need to be managed. In short, checklists reduce risk.

With such incredible benefits, you may be wondering what cons could possibly exist? Let’s demonstrate with an example. Your checklist might be something like this, twice a day you have to:

  • Check the status of 5 subsystems
  • Check the status of 30 jobs, including whether they are active at all
  • Check how much of your auxiliary storage is used
  • Check the status of 5 MIMIX data groups
  • Check if 3 files are not exceeding a size limit
  • Check 3 job queues for jobs piling up in them
  • Check QSYSOPR for inquiry messages
  • Check QSYSMSG for inquiry messages

So far, so good. Seems feasible.

What if we increase the amount of things to check? So instead of 5 subsystems, we check 30; we check 300 jobs instead of 3, and so on. Plus we also check to see if QSECOFR is logged on via FTP.

Things are starting to get tricky. And now let’s crank this example up by another notch. Let’s say you are not happy with only getting this information twice a day. Instead, you want to go through the checklist every 30 seconds around the clock. Suddenly Houston, we have a problem…

So, our first con becomes an issue of efficiency. Checklists mean manual checking. Because you are performing the checking manually, your #1 concern is low efficiency. Low efficiency leads to one of two things: The first, you spend too much time. Checking many things, many times a day means you spend hours on just checking. The second, and more likely when you factor in what’s practical, you limit the things you check. You limit them to what you can get checked in the limited time you have. In other words, you compromise on quality.

Another issue surrounding checklists is communication. Checklists are meant to help you to remember things and to provide accountability by allowing an auditor to check whether you checked all the checkboxes. Checklists are not really meant to communicate the state of the system to anybody else, except for a very general top-level, “OK” / “Not OK”. Checklists are meant to be brief. Checklists are limited by being text-only, containing little context and by not having an interactive dimension. This means they score low as a communications device. Finally, checking off checklists isn’t exactly anybody’s favorite activity, so there’s that too. (Boring!)

With these pros and cons to checklists tallied up, what’s the way forward that makes best use of their positive traits and overcomes the negative ones?

Here’s a screenshot of Robot Monitor. It’s what manual checklists aspire to be!

 

“Applications OK”, and “Storage OK”.

These status items are actually summaries of whole trees of elements below them. These elements can be file size elements, Ethernet bandwidth, job status, or many other things. Each can be set up for a lot of different files, Ethernet lines, jobs, etc. There is no need to display them all on screen until you have to, so for now they are hidden behind those three rectangles.

All of those things are being measured every 30 seconds on IBM i. They are being measured in the background. If any of the measurements “underneath” finds a problem, for example, the main payroll job runs into an error and is sitting in Inquiry Message Wait, wasting precious processing time, this is what we see:

 

Payroll is not being processed. Employees are close to meltdown. There will be bloodshed!

But here’s what happens: You notice there is a problem. And you notice right away. You don’t notice it the next morning while going through your checklist. You don’t notice because you get that hectic phonecall from Accounting saying, “My screen is frozen!” You notice within 30 seconds.

 

So now you can drill down to investigate the issue source and context.

This is just one singular example, but Robot Monitor has over a hundred different kinds of measurements that you can use to keep track of your jobs, disk, and system, and you set these up for as many of your jobs, communication lines, files, etc. as you like. Similarly, a message management solution like Robot Console provides an equally powerful checking capability from multiple message sources on multiple systems—QSYSOPR, the Security Audit Journal, History Log, and so on.

HelpSystems monitoring solutions can take the manual checklisting requirements and automate them to give you the best of all worlds—with the benefits of memory and consistency but also the volume, speed, and visual interpretation both online and in reports that you need to overcome issues of efficiency and communication. It’s your own giant leap forward for IBM i systems management.

Get Started

Learn how Robot Monitor helps you take a proactive approach to system threats by resolving problems before they impact performance and productivity. Request a live demonstration and we’ll show you. 

Related Products

Related Solutions