Capacity Planning and the Culture of Proof

Melding system monitoring and capacity planning in Power Systems environments
May 31, 2019
Melding System Monitoring and Capacity Planning


Installing a modern system to run enterprise applications is a pretty tough task, but the job is not done once the iron is in and the databases and applications are up and running. Sophisticated organizations, whether they are large or small, do sophisticated application and system monitoring to keep an eye on how things are running, they have job schedulers that synchronize work to optimize for performance as well as drive up system utilization, and they use all of the information gathered to better plan for the kinds of systems they will need in the future.

There are many aspects of the systems that need to be monitored to keep them humming along, and that is largely the job of system administrators, whose main purpose in life is to ensure that all of the value embodied in the capital assets – in this case, the IBM i platforms and the applications that companies either build or buy—can be extracted out of those systems as they, in turn, drive the business forward. With hundreds or thousands of applications running on systems, sometimes concurrently and other times serially, enterprises have a much more difficult job of keeping track of what is running, how well it is running, and guessing how it might run at some point in the future based on revenue and transaction prognostications from upper management.

The hyperscalers like Google, Microsoft, Facebook, and Amazon are masters at workload management at scale, and they have built some of the most sophisticated automation tools ever created to watch how their vast fleets of systems are doing as they churn through their work. They also have tools to figure out what future application loads might look like so they buy the right number of machines for specific jobs. But in a way, monitoring and capacity planning at scale is simple, since these companies have relatively few workloads to run on their machines. In most cases, jobs are scheduled across thousands to tens of thousands of servers and they are not really running very many applications as such. The complexity is operating their systems at scale, not making a few applications scale across lots and lots of machines to serve hundreds of millions to billions of users.

To use an analogy, the hyperscalers are juggling five or six bowling balls of different sizes for an arena full of viewers, but the modern business of even modest size is juggling a few baseballs, a slightly larger number of tennis balls, handfuls of ball bearings, and boxes of BBs for townhall. The mix of the round objects that businesses have to juggle is not static, either. It keeps changing, too. There are processing waves that go up and down on hourly, daily, weekly, monthly, and annual scales, and they can add up to very high peaks where performance can be adversely impacted without proper job scheduling and capacity planning. The hyperscalers are all rich, too, so they can just buy more iron; this is not the case with most enterprises.

It is not tough to see which type of company actually has the more difficult task. But a lot of enterprises grab themselves by the seat of their pants and tug firmly when it comes to capacity planning on their systems, and this is as true of many IBM i shops as it is for companies using other kinds of systems in their data centers. The good news is that they don’t have to fly by the seat of their pants. There are tools to help them do this correctly, helping to better ensure they have the right systems for their changing workloads.

Two Great Things That Taste Great Together

Monitoring and capacity planning are just two sides of the same coin. The main reason you monitor the workloads on a system and how it responds to changing conditions is to manage the performance of the applications so it meets the expectations of the end users and the needs of the business. Capacity planning is really just taking a long view on performance management, with a way to do what-if analysis on a system as different things change on it and then putting a budget together to meet the long-term performance goals.

No two companies have exactly the same workloads and the same business growth rates, and therefore no two companies have the same capacity planning challenges. Doug Mewmaw, director of education and analysis for the MPG division of HelpSystems and before that technical support manager at Boise Cascade Office Products, gives an example.

“I went to two sites last year. They were doing the exact same upgrade to POWER8 systems,” Mewmaw explains. “And both companies were doubling their workloads for special projects. Now, these were two, exactly the same Power Systems machines, and they didn’t even know it. The first customer got a 35 percent improvement, which was great. For the second customer, I was expecting the same results and it only got a 5 percent improvement. So what does that teach us? We can have all of the fancy hardware, flashy SSDs, and whatnot, and it still all depends on your workload. That is why IBM stopped giving us best-practice guidelines so many years ago because IBM figured out the applications and the loads put on them at any two companies is hard to generalize.”

Capacity planning is one of the most important things that a company will do, and yet there are still many companies that do it on the back of the envelope, making correlations between, say, revenues and workload growth that may not pan out. To do capacity planning right, you have to have the widest possible set of performance data extracted from the system, over the longest period of time possible, and that means constantly gathering data today so it can be used tomorrow. In one recent example of how this doesn’t pan out, Mewmaw recalls a customer who was sizing up an IBM i system upgrade that was going to be installed to support 15 percent revenue growth for the next three years. They just did a simple ratio between revenue growth and Commercial Performance Workload (CPW) ratings for Power Systems processors and thought that was that. Until Mewmaw pointed out that, in the machine they had installed, CPW consumption actually increased by 42 percent to drive the same amount of growth. At that rate, the new machine the customer was getting set to buy might only have enough performance to drive a year or a year and a half of revenue growth and would then run out of gas.

People are often not being sensible about gathering up information to do a proper capacity plan manually, much less to feed into a performance analysis and capacity planning tool such as the Performance Navigator from HelpSystems.

“I think there are probably 5 percent of the shops in the IBM i market that really understand capacity planning,” says Mewmaw. “I was lucky. I was taught by a lot of good folks when I was at Boise Cascade and that’s how I got into it back in the 1980s. We get new hardware all the time, and if you don’t know the hardware and don’t know all the little particulars, it is hard to do capacity planning. There are not many people that know it, and even at business partners, you would be shocked at how times I have seen a capacity plan based on 30 days of data – sometimes those 30 days may not even have an end-of-month run in it, so it misses one of the peaks of the workloads.”

With the growing complexity of Power Systems machines, which now have sophisticated I/O, a choice of internal or external disk arrays with various features, many different types of flash drives, a wide variety of disk drives, and a huge span of memory configurations and processor options—all of which can dramatically affect performance by themselves, or not as the case may be—it is hard to do proper capacity planning on a piece of paper or within a spreadsheet. There are just too many factors to weigh.

Just trying to figure out the effect of moving from internal disk drives to external arrays is a good example. It is no secret that Big Blue wants IBM i customers to embrace its Storwize V5000 and V7000 series arrays and sometimes its very high-end DS8800 SANs instead of relying on RAID controllers and disk and flash drives under the skins of the servers. According to the past four years of the IBM i Marketplace Survey performed by HelpSystems, there has been growing adoption of external SANs. Back in 2016, 24.1 percent of respondents to the survey had external arrays for storage, and in the 2019 survey, it had risen to 40 percent. Within a few years, it could cross the 50 percent threshold. But making such a move requires some planning—some capacity planning to be precise—and woe be unto those who do not do it.

The Culture Of Proof

System administrators and maybe even chief executive officers that want to keep their jobs might think hard about marrying system monitoring and capacity planning, what Mewmaw calls in his training classes the culture of proof.

“I have been through Doug’s class, and he taught us the culture of proof,” explains Chuck Losinski, director of technical solutions for the Robot scheduling and performance monitoring product line at HelpSystems. “It means implementing ongoing data collection so that one, two, or three years from now you have got solid evidence as to where you need to go with your upgrade if you do need one. In the larger environments, if you are going to spend half a million dollars to millions of dollars, that culture of proof is even more critical because you are putting your career on the line. What if it goes bad? You spent all that money on an upgrade, you load your application up there, you restore your data, you start operating in that new environment, maybe even running side by side in parallel or whatever, but you cut over and it goes south on you.”

This is a very bad thing, particularly if a system is in lockdown because of the impending holiday season or the end-of-year cycle. Under normal circumstances, at critical peak times of the year, no one is allowed to touch anything on the system simply because it is too risky to make changes during peak workload. But now you are stuck. You have to do an emergency change order, and if you are lucky, you have some latency CPU or memory capacity in the system that you can have IBM activate on demand to get you by. Maybe that is not the issue. And because you were not monitoring the system for years on end—or you haven’t kept all of the information so a tool like Performance Navigator can chew on it to help you figure out what might be wrong—you are guessing and you can’t prove to people that the underlying infrastructure is configured correctly and is not the problem.

“The cool thing with a tool like Performance Navigator is that you have got all the factors built into it describing all the hardware components and you can make that what-if statement: What if the number of transactions goes up 25 percent a year for the next five years? That’s the growth that you’re going to plan on. So at least you’re being analytical about it. You are not throwing a dart and guessing,” says Mewmaw.

There are a lot of darts being thrown around IBM i shops when it comes to capacity planning, apparently.

“Speaking as a former IBM i customer, when I did not have a tool like Performance Navigator, it was throwing a dart at a wall, and when I started using the tool and created that culture of proof, I went to the CIO and said we can actually do studies of the applications to know when programmers make changes and look for the impact of those changes,” explains Mewmaw. “In other words, we can be proactive; we are not waiting for things to break. We can see cause and effect and we will know what to expect when we go live. My CIO saw that it changed our culture forever. We started to measure everything, like Google, and there was no more guessing, no more having a feeling about what might be causing such and such. We knew when we were implementing a particular project that I/O went up 210 percent and then we knew we needed to go fix an I/O issue, upfront before the project even rolled out.”

People get other capacity planning and performance issues outside of the upgrade cycle. Take JDBC access to the Db2 for i database, for instance.

“I think our number one request, from a real-time performance monitoring standpoint, is monitoring the SQL requests via JDBC,” says Losinski. “You know customers have got websites that are being programmed by younger programmers, and I am not bashing them, but many of them do not write good, orderly, efficient SQL requests. And when orders ramp up, this messy SQL comes into the system and hammers the Db2 database. With Performance Navigator, with a strong set of historical data, we can, for instance, eliminate disk I/O and eliminate a processor constraint. And so, we at least eliminate the hardware side of capacity planning so that we can focus on something more like SQL analysis, adding indexes to the database, and so forth to get out of the performance rut.”

This article originally ran in IT Jungle's The Four Hundred enewsletter as sponsored content on April 24, 2019:

Get Started

The historical performance data on your servers is a treasure trove of information regarding actual usage over time. But you must access and interpret this data to get to the bottom of performance issues or inform future hardware investments. Performance Navigator can help! Request a live demonstration and we’ll show you how.

Stay up to date on what matters.