Comparing Different Methods for Calculating Health and Risk

How do you calculate IT health and risk?

There are different methods you can use, depending on your needs.

The most common methods for determining IT infrastructure health are:

Threshold comparison
Enhanced threshold comparison
Event detection
Variation from normal
Allocation comparison
Queuing theory for health

On the other hand, the most common methods for calculating IT infrastructure risk are:

Linear trending
Enhanced trending
Event projection
Allocation projection
Queuing theory for risk

There’s pros and cons to each of these methods. It’s hard to know which one to rely on. So consider this your guide to making health and risk calculations easy.

How to Calculate IT Infrastructure Health

1. Threshold Comparison

Threshold comparison is all about using measurement statistics to set thresholds.

These measurement statistics typically include things like:

CPU utilization
Memory utilization
Queue length

Once you determine your measurement statistics, your next step is setting up thresholds for these measurements. Your IT infrastructure health is determined by how many and which of these thresholds are exceed.

For instance, thresholds for CPU utilization are often greater than 70 percent. If that threshold is exceeded, your IT infrastructure health will be compromised.

On the pro side, threshold comparisons are:

Relatively easy to set up
Customizable

The cons of threshold comparison are:

Assumptions that a threshold is a boundary between radically different states
Requirements of significant knowledge of the environment to set the right threshold
Conditions that frequently change within the environment will require thresholds to change
Inaccurate health representations due to weak or inaccurate thresholds

If you choose to monitor IT infrastructure health with thresholds, be wary. If you don’t set and monitor your thresholds, you could waste time on false positives of poor health—or not recognize truly unhealthy situations.

2. Enhanced Threshold Comparison

Enhanced threshold comparison is the next step up from threshold comparison.

These thresholds are usually more complicated. Multiple measurement statistics and formulas are used within a given threshold. And threshold severity is often introduced.

You’ll determine IT infrastructure health same way you do for threshold comparison: by looking at the number and type of thresholds that are exceeded. These thresholds just happen to be more complicated.

In the pro column, enhanced threshold comparisons are:

Moderately easy to setup
Customizable

The cons of enhanced threshold comparisons are:

Assumptions that a threshold is a boundary between radically different states
Requirements for even greater environmental knowledge
Conditions that frequently change within the environment will require thresholds to change
Inaccurate health representations (though fewer than standard threshold comparison)

Using enhanced thresholds can minimize some of the inaccuracy of standard thresholds. But you’re still taking a risk with your IT infrastructure health.

3. Event Detection

Event detection utilizes alarms, alerts, and other techniques to recognize something noteworthy has occurred (i.e., an event). Thresholds, logs, and other types of data are typically used to recognize events.

You’ll determine IT infrastructure health based on which events have occurred—and how many.

On the pro side, event detection is:

Relatively easy to set up
Customizable

The cons of event detection are:

Inaccurate health representations resulting from inaccurate or missed event detection
Requirements that events occur in order to detect health issues
Requirements for an understanding of the environment in order to set up alerts and alarms
Conditions changing in the environment will require you to change which events are detected
Significant resources might be required if big data is involved

Event detection is a great reactive measure for IT infrastructure health. But being proactive will require a better method.

4. Variation from Normal

Variation from normal is what it sounds like.

You define normal situations for IT infrastructure health. This is typically based on historical data and identification of “normal” behavior and events. So situations that vary from the set “normal” are considered unhealthy.

It’s a simple concept, but you will need some complex algorithms and the right resources to make these determinations.

On the pro side, variations from normal:

Work best for situations that do not experience significant variations in workload and configuration changes
Have the ability to adjust what is considered “normal” over longer periods of time.

But the cons of the variation from normal method are that it:

Takes at least 90 days of data to get a very basic understanding of what is “normal” This means the ROI takes longer than for most of the other methodologies
Results in inaccurate health representations in the beginning (or if the environment changes)
Requires significant resources for analysis

Variation from normal can take longer to hit ROI than other methods. But it can also be more precise.

5. Allocation Comparison

Allocation comparison determines health by looking at the available capacity versus the capacity that has been allocated. If your allocated capacity of resources gets close enough to the available capacity of resources, then your infrastructure’s unhealthy.

On the pro side, allocation comparison:

Applies virtualized environments and disk space resources
Adapts well when capacity is added or removed
Allows for customization

But the cons of allocation comparison are:

Inaccurate health representations because you’re counting what’s allocated—not what’s used
Difficulty determining available capacity
Knowledge requirements for your capacity needs

It might be a good idea to use allocation comparisons if you’re an expert in your capacity already. But this might not be the right method for you if you lack that expertise.

6. Queuing Theory for Health

Queuing theory involves analysis on system utilization, throughput, queue length, and response time. These are typically based on the amount of work running on systems and system configuration.

You can use queuing theory to determine IT infrastructure health by measuring these components against your optimal scenario.

On the pro side, queueing theory for health:

Applies to CPU and IO activity in both standalone and virtualized environments
Adapts well to workload and capacity changes
Allows for customization
Provides very accurate health results for system, CPU, and IO resources

But there are cons to queuing theory, like it:

Isn’t applicable to memory or disk space health
Requires more resources for analytics that threshold methodologies (but less than variation from normal and event detection)

Queuing theory for health might be a smart choice if you have the right resources to do analysis. (And if you’re more focused on CPU, IO, and system health.)

How to Calculate IT Infrastructure Risk

1. Linear Trending

Linear trending involves using historical data to create a trend line.

In terms of IT infrastructure risk, this means using that line to project future values of your historical data. This creates a simulation for when you’re established thresholds will be surpassed.

On the pro side, linear trending:

Is relatively easy to set up—once appropriate threshold are established
Provides fairly accurate projections for CPU and IO activity—at least for resources with moderate-to-low utilization and consistent growth
Allows for limited customization

But there are cons to linear trending, like:

Behavior projections are linear in nature, so they’re not useful for busier resources
Overprovisioning since you have to set lower thresholds to avoid inaccurate linear trends
Inaccurate risk representations with higher utilization
Inability to take configuration or workload changes into account

So using linear trending to predict IT infrastructure risk might be enough if your utilization is moderate-to-low.

2. Enhanced Trending

Enhanced trending uses basic algebraic quadratic functions. These usually take multiple statistics into account in one equation.

You can use enhanced trending to predict IT infrastructure risk. Here’s how. You can use quadratic function to project out future values from historical data to determine when your thresholds will be exceeded.

In the pro column, enhanced trending:

Provides fairly accurate projections for CPU and IO activity—at least for resources with moderate-to-low utilization and consistent growth

Delivers the best risk prediction for memory and disk space—when correct functions are provided
Allows for customization

The cons of enhanced trending are that:

You need a very strong understanding of the environment to establish the quadratic functions (and that takes a strong math background)
The results will be more accurate for resources with low to moderate utilizations than linear trending, but still is not accurate for higher utilizations
Overprovisioning happens, since you have to set lower thresholds to avoid inaccurate linear trends
You get inaccurate risk representations
Configuration or workload changes aren’t taken into account

Enhanced trending can be a good choice—if you have the strong math resources in your IT department. Otherwise, it can be difficult to do well.

3. Event Projection

Event projection uses historical data about events to project when future events will occur.

You can use this to calculate IT infrastructure risk based on which events will occur and when. In this sense, event projection is similar to variation from normal.

On the pro side, event projection:

Works best if you don’t have significant variations in workload or configuration changes
Can adjust what’s “normal” over longer periods of time.
Allows for customization

The cons of event projection are:

Inaccurate risk representations because it takes time to determine “normal”
Events need to happen in order to calculate risk
Requirements for understanding the environment to determine events
Conditions changing in an environment will cause events to change
Significant resources are needed when big data is involved
90 days of data is required to understand “normal”
Analytics depend on the availability of your resources

Event projection can work—if you’re okay with being reactive, rather than proactive, with IT risk.

4. Allocation Projection

Allocation projection determines risk by looking at the total amount of capacity available versus the allocated capacity.

So when the allocated amounts of capacity close enough to the available capacity of resources, risk is goes up.

On the pro side, allocation projection

Applies to virtualized environment placement and disk space resources
Adapts when capacity and work growth is well understood
Allows for customization

The cons to allocation projection are:

Inaccurate risk representations because uses what’s allocated instead of what’s used
Overprovisioning when the work being performed doesn’t equal the allocation

If you’re primarily concerned about risk in virtualized environments or disk space, allocation projections might make sense for you.

5. Queuing Theory for Risk

The same metrics used in queuing theory for health—system utilization, throughput, queue length, and response time—are used for risk.

You can determine IT infrastructure risk by comparing the predicted values to what you need to maintain service levels.

On the pro side, queueing theory:

Adapts well to workload, configuration, and capacity changes
Allows for customization
Provides very accurate risk results for CPU and IO activity

The cons of queueing theory for risk are:

It doesn’t apply to memory or disk space risk determination
Analytics require more resources than trending methods (but usually less than event or allocation projection)

Queuing theory for risk might make sense if you have the right resources to do the calculations.

IT Health and Risk—Made Easy

There are many routes you can take to IT health and risk management. The right one will depend on your organization and your environment.

But it doesn’t need to be difficult to decide the right method.

Capacity management makes it easy to manage IT health and risk.

calculate risk, how to calculate risk

Interested in Automated Capacity Planning Made Easy?

Vityl Capacity Management may be what you need. Try it free for 30 days.

START A FREE TRIAL