It’s The Exceptions That Matter
When you’re in charge of managing several servers or partitions, management by exception is what counts. Exceptions are the problems that you need to know about immediately so you can resolve them before your end users are aware of them. Management by exception should be the credo for all computing environments—the road to computer operations nirvana.
In real life, we’re often faced with the “needle in the haystack” problem: How do we know that something is a problem with so much information to wade through? System administrators around the world share this issue.
Here, in Robotland, we conquered this problem years ago by using a rule-based approach: Define rules to act on events and match them to identifiers. For example, for message management on IBM i servers, we use the seven-character message ID to match a problem, or type of message, to a rule. A message that needs an answer is important, so we can use rules to alert the on-call staff, escalate the issue, or execute a series of commands (scripts).
Monitoring makes it easy to tell when you have a problem. Automation takes monitoring a step farther. With automation, the goal is to apply rules to your exceptions to handle them without involving the on-call staff or the operations center.
Recently, I had lunch with a customer who complained that his cell phone/e-mail device is no longer a good choice for notification. It beeps constantly with new e-mail and notifications, and he honestly wished he had his pager back. As his environment had grown, he hadn’t weaned himself of the habit of being notified for everything—including things that are just nice to know. I told him it’s time to lose the “nice to know” and start managing by exception.
That’s good advice for everyone.





Comments
Post new comment