“It was a dark and stormy night…”
What a great setting for a scary story. You can tell right away from that tingle down your spine that something wicked this way comes.
But not all horror stories happen at night. In fact, many of the most disconcerting data center incidents happen in broad daylight.
The monitoring experts at HelpSystems have taken more than a few trips around the block in the ECTO-1. They’ve seen the good, the bad, and the gruesome results of poor monitoring, and they’re here to tell you that you can avoid your own monitoring horror story—if you can get out in front of it fast enough!
Paul’s Resource Poltergeist
I was working onsite at a gaming casino that had just received a brand new IBM system. They’d been running with it for a couple of weeks and were experiencing heavy CPU usage on the new box. However, when they viewed the performance statistics, there was no obvious job or process using up the processing requirement.
I loaded the HelpSystems monitoring solution and immediately noticed a heavy resource consumption. After a bit of investigation, we noticed that the job number for this job was unusually high for a brand new system—something like 758,832—yet the system had only been active for a couple of weeks.
“How could there have been nearly a million jobs already running in such a short space of time?” I asked myself.
I decided to look into the audit journal to investigate the cause, and there we discovered the issue. A web application had not been installed correctly, resulting in failed authority. The PC application was constantly trying to start a job on IBM i, failing due to the incorrect authority, and then trying again. This process was going around and around and around.
On their own, each individual job was not taking much resource, but collectively they were using huge amounts of CPU. This also explained the very high job number, as new jobs were starting every 10 seconds!
We reported back to the Windows administrator, they corrected the security on the installation, and the CPU dropped drastically—a very simple solution to a very serious resource problem.
Chuck’s Dead-Ending Jobs
I was on the phone with a manufacturing company who had job scheduling under control, but nothing in place in terms of monitoring or notification. Even so, their Robot jobs were terminating due to an authority issue.
As a result, their system was down for a full day! Employees were sent home, presumably because the manufacturing planning wasn’t ready for them—they didn’t know what to make or how much.
I showed them the HelpSystems monitoring and notification solutions. Even if they would have had notification only, they would have been aware of jobs abending.
This was eerily similar to the situation I was in with a large shoe manufacturing and retail company a few years ago. They missed the fact that a job was taking up an increasing amount of temporary storage, maxed out disk space, and their system crashed, causing a system outage. Here again, they didn’t have a monitoring and external notification tool in place.
Sara’s Spooky Spooled Files
I picked up the phone the other day and the man on the other end of the line sounded very tired. I asked how things were going. He replied that they were good, but he had had a long night. A user had submitted a job that went rogue and created thousands of very large spooled files.
Luckily, I had helped this man set up and configure the HelpSystems monitoring solution two weeks earlier and I remembered how he was especially excited about being able to monitor spooled files.
He had received notification that his threshold had been exceed and, instead of coming into the office to find the system down, he was able to log in to find the job that was creating the spooled files, stop the job, and remove all the incorrect reports.