A dialogue concerning two chief IT executives1 (possibly on Wall Street):
Simple CIO: Capacity planning? Pfft! You cannot sell prevention.
Sal Viati: Then, explain the multi-billion dollar diet-pill business?
A Bridge in Troubled Waters
Whether you’re an IT manager, an application developer, or a software engineer, you’re probably not involved in the bridge-building business, but you assume that those engineers who are, would never compromise your safety by cutting corners on the architectural plan. That may be true for the engineers, but what about their management?
“I can understand people being worked up about safety and quality with the welds, ... but we’re concerned about being on schedule because we are racing against the next earthquake.”
This quote comes from a Caltrans executive manager who is responsible for the new $7 billion Bay Bridge under construction between Oakland and San Francisco. An upper-deck section of the existing Bay Bridge collapsed during the 1989 Loma Prieta earthquake. He is referring to his disagreement with and firing of the independent inspectors who found cracked welds in some sections of the new bridge. But why would he present such a crazy defense of his decision in public? He doesn’t think it is crazy, because the project is behind schedule.
Effectively, he is saying: let’s increase the risk that the new bridge will fall, in order to avoid the significantly lower risk that the current bridge might fall if and only if another Loma-Prieta-sized earthquake should occur before the new bridge is completed. In other words, risky scheduling is better than safe planning. Or, as the title above says: it’s ok if the new bridge falls, as long as it falls on time.
In the opening dialog, Simple CIO points out to Sal that it’s impossible to sell prevention. So, let’s look at the attitude of management after the $7 billion bridge falls, or since that thankfully hasn’t happened, when a similarly priced, $10 billion dollar atom-smasher fails.
The latest director of Europe’s new atom-smasher—the Large Hadron Collider (LHC)—says he will be more cautious than his predecessor, following the very public and expensive failure last September when a section of superconducting magnets collapsed only days after the LHC was fired up.
“The LHC will be double checked by outside experts before any attempt is made to switch the machine back on, probably in July. I want to be sure that everything works, so I’ll also let an external group make additional checks on the accelerator.”
What a difference a good failure makes: more caution, double checking and independent inspectors are suddenly de rigueur. Unlike the pre-failure Caltrans director, the post-failure LHC director has clearly been chastened. But why does it so often take a catastrophe to force management to become wiser? And is it really their fault?
It’s the Planning, Stupid!
Perhaps such catastrophes only happen in the world of physical engineering and not in the virtual world of IT systems? Apparently not. Web 2.0 is the current rage in information technology. In 2008 we had a number of high profile failures there also, each ostensibly due to a lack of capacity planning:
- Twitter.com (a major conduit for many online businesses)
- Amazon Elastic Cloud services
- Cuil.com (the putative Google killer. Remember them?)
- Apple iStore (downloads crashed when 1 million iPhones were sold in a single weekend)
- Google Gmail
So far, this year, we’ve had major outages at Twitter and Google, all of which has led IT pundit Larry Magid to think of Google as a single point of failure for Web 2.0, and Silicon Valley luminary Rob Enderle to think private clouds will dominate public clouds because of these capacity management and reliability issues.
My personal favorite, however, was the launch of the long-awaited semantic search-engine called WolframAlpha. After months of building up expectations, on launch day (May 15, 2009) Stephen Wolfram sheepishly confessed to the LA Times:
“We ran into a small snag (last night). One of our tests was to use one cluster to simulate traffic and run it against the other cluster. We found that the throughput degraded horribly.”
For me, this begs the question: Why were you only doing that level of load testing the night before launch day? Psst! It’s called capacity planning.
Why is capacity planning still such an oxymoron in the IT industry? No doubt, the management at all these web sites were prepared to spend serious money on any number of servers to support their applications. So, the capacity part of capacity planning seems to be understood. It’s the planning part that remains unrecognized. Or, to paraphrase Bill Clinton’s immortal line from the 1992 presidential election: it’s the PLANNING, stupid!
Brisk Management Vs. Risk Management
Why has the planning part of capacity planning not been groked in the IT world? The simple answer is the one Simple CIO gave in the opening dialog: You can’t sell prevention. But that’s too simple-minded. Obviously, you can sell prevention. As Sal Viati rejoined, just look at the fitness business or dietary-supplements business. They’re huge! This suggests that capacity planning would be an easier sell if there was a perceived personal benefit or reward. And there’s the rub.
It turns out that management is caught in a kind of catch-22 situation. Managers, almost by definition, don’t believe the risk of failure is high for their project. That, by the way, is the same strategy Wall Street adopted with credit default swaps and we know where that led.
This “won’t happen on my watch” attitude is pervasive because it contains a statement about risk—perceived risk. But there’s a big difference between perceived risk and managed risk. Managers are employed to look after projects or product schedules. Capacity planning is generally viewed as something additional that stretches schedules, thereby making projects take longer. Taking longer likely means missing the market window. Like the Caltrans director, if the schedule is allowed to stretch, a manager will be viewed as having let his project get away from him, and that means he will have failed as a manager. Most managers would therefore prefer to have their project be seen as a failure than have themselves be seen as a failure.
That’s the insane catch-22 logic we are dealing with.
Guerrilla mantra 1.7: Management will let a project fail; as long as it fails on time!
From another standpoint, this brisk approach to risk management appears justified because, as we all know, time is money. If the project or product is delayed, revenue will be lost and, what’s more, probably lost to a competitor! Although not a completely false statement, it is false economics. Even if your product reaches the marketplace on time, according to schedule, if it fails due to a lack of capacity planning, that failure will be exposed and it will be exposed in the public marketplace, not the test lab. That public failure, in turn, can lead to product aversion on the part of both existing and potential customers and ultimately those lost sales show up as lost revenue—the very thing adhering to the schedule was supposed to avoid! The management catch-22.
Back to the Future
Is there anything that can be done to remedy brisk management or bad risk management? It seems to me that corporate executives could start by being less deferential to the short-term demands of Wall Street. The deleterious impact of Wall Street on capacity planning arises from the required quarterly reporting period. That alone, has tended to produce three-month planning horizons, a real oxymoron! Wall Street itself has proven that this strategy simply does not work. Just like a bridge completed under high-risk scheduling, that strategy has collapsed.
Moreover, in the ensuing economic recession, perhaps more than ever before, companies everywhere are going to have to become more globally competitive by providing goods and services that are more robust over the long haul. We’re already seeing the impact of the Wall Street collapse on the U.S. automobile industry. Their executive management became infatuated with short-term capital gains at the expense of preparing for the inevitable, long-term demand for fuel-efficient vehicles. There’s really no excuse based on U.S. customers not wanting to purchase fuel-efficient vehicles during the past decade. Toyota also knew that, but they still invested in their Prius, and now they’re ahead of the game. That’s called foresight. Something Wall Street punishes.
It’s clear that we need a new kind of corporate leadership. I’m reminded of the annoyingly redundant corporate phrase, “Going forward...” Who ever says: “Going backward, we will...”? But maybe they should. Going forward is only invoked after some corporate catastrophe becomes publicly known. The point of such a phrase, of course, is to avoid dealing with the failed consequences of bad risk management. The implicit verbal directive is to keep one’s eyes averted from the corpse of the catastrophe; a tactic long used by Wall Street. But it’s now clear to everyone that Wall Street risk mis-management has not only failed, but failed globally. It doesn’t get any bigger than that. So, maybe it really is time to go backward; back to the sanity of doing things the right way instead of the expedient way.
The good news in the U.S.A. is that America has a fine tradition of doing things the right way, but we have to go back in time. Consider the Johnson & Johnson Credo of 1943:
Notice that stockholders are last on this list; certainly not the order Wall Street likes to see today (although, possibly it was acceptable back then). It’s corporate cultures like Johnson & Johnson that made American companies great in the past. TeamQuest itself is another American company that does it right because they are privately held and therefore not subject to the whims of Wall Street. Capacity planning, rightly viewed, helps to provide customers with a robust product, not simply an expedient product.
Just like the false economy of credit default swaps, the omission of capacity planning in IT has proven to be a false methodology. For example, the commonly held idea that it’s cheaper to over-engineer the hardware architecture to ensure adequate capacity is patently false. Here’s the simple counter-example. If performance testing is skipped in order to meet the release schedule (and who knows if that’s really valid?), and the deployed application ends up running single-threaded with lousy performance, a boat-load of the cheapest servers from China won’t improve that.
This, and other simple-minded falsehoods, should be challenged and avoided as part of a revived corporate IT culture. Referring back to the opening conversation between our two whimsical CIOs, it is possible to sell prevention if it is considered important enough. Most people implicitly believe their health is important. They don’t need to be sold on that. The financial health of a company, on the other hand, is determined by the people running it, i.e., executive management, and capacity planning should be an explicit part of corporate IT fitness from the top down.
The bottom line is not really new. The sagacity of looking beyond the end of your nose is a truism, but incredibly that truth has been lost in the irrational exuberance4 of false Wall Street economics. A robust economy and IT customer satisfaction both come from foresight, not just eyesight. In fact, it’s the second word in capacity planning.