Monitoring and Planning in a Solaris Environment

The challenge in our Solaris environment was to optimize hardware utilization, conserve datacenter floor space, power, and cooling, build a scalable architecture and keep up with growth and performance requirements. The solution entailed consolidation using Solaris virtualization by creating Logical Domains (LDOMs) and Solaris zones to provide a scalable environment with necessary isolation. Along the way, we discovered that we had too many standalone servers. But with the consolidation completed, 50% of those servers are gone.

The realized benefits include:

  • Reduced rack space by consolidation and reclamation of datacenter floor space,
  • Dramatic decrease in server deployment time using LDOMs and zones
  • Greater flexibility of migration and capacity planning.

That said, anyone using Solaris virtualization should be using TeamQuest (now a party of HelpSystems), says the team lead. After all, if you don’t manage the environment and how you add capacity, you will be a step behind. TeamQuest provides predictive insight into workloads, performance and resource consumption for all our Solaris virtualized servers. This helps us provide more in-depth and predictive analysis for variable workloads on the servers to avoid any anomalies during the lifecycle of a project.

Performance and Capacity

TeamQuest helps the our operations team analyze the capacity and performance of servers so they can plan and predict application behavior more accurately by drilling down into the Solaris process level details. It assists us in anticipating the impact of new project roll outs, and has provided valuable data, for example, from QA load tests to predict the user ramp rate and plan the architecture appropriately.

As part of this, TeamQuest provides customized and multi-stage alerts for CPU, memory and disk space utilization. This assists our support teams in predicting any failure before it becomes an outage. Overall, we have a better idea of what is happening inside servers and in the environment as a whole. Note, though, that if you have too many alerts, it can be overwhelming. TeamQuest has intelligence built in to filter out the unnecessary alerts and make sense of them so you know where to look.

In our environment, TeamQuest resources have been broken into sections based on environment for ease of reporting. Individual physical hosts have their own system groups created, and each group contains the VM inventory of the host (LDOMs). A heat map feature provides a single point of visibility for all our servers. We have it set up with customizable alerts, thresholds and colors for better control. For us, having everything in one pane of glass is important.

TeamQuest drills down to identify LDOM and zone-level relationships. It is the one tool that understands the complexity of Solaris virtualization. You can observe, for example, how many zones are affected by an LDOM. Menu-driven reporting also includes customizable reporting based on roles.

Effective capacity planning at the CPU level enhanced our capacity planning capabilities. We also have the ability to look at historical data on a daily, weekly, monthly or even annual basis. In terms of kernel level utilization, TeamQuest provides information about load average stats which is valuable in better understanding and diagnosing application behavior. If the load average in a server is spiking, for example, this feature gives us tremendous value on kernel level utilization. We also set up alarms at the kernel level to deal with our complex environment.

Additionally, information about file system utilization data is provided. This helps out when we are reviewing the behavior of servers and applications from a disk space utilization perspective.

Another useful feature is that it provides the capability of creating customizable reports for specific server groups and parameters. This includes limiting access for reports to be shared with specific groups. This offers a one-stop shop to obtain current and historical data for critical applications and servers.

Effective Performance Analysis

TeamQuest provides the ability to pick a specific data point down and drill down to it to pick the relevant process which caused a specific spike. This is invaluable when it comes to getting to the bottom of Solaris server utilization. We look at TeamQuest and can see a list of all the processes running at that specific data point on that server or group of servers. We also gain valuable feedback for application teams when they are looking into processes which don’t last for a long time.

The TeamQuest Administration Console provides a single interface for the installation, configuration, and updating of agents across the environment. We have the ability to create tags for the separation of servers in groups (such as Dev/QA/Production).

Updating these agents manually would be a labor-intensive, time-consuming process. When coupled with the TeamQuest Update Server, the TeamQuest Administration Console provides a simple interface for keeping all agents updated with the most recent version of the software. The TeamQuest Update Server can either securely (using SSL) download all updates from TeamQuest directly, or updates can be manually added to the update server.

Update policies are defined in the TeamQuest Administration Console and applied to groups of systems. These policies can be configured to automatically download and install the updates for every agent on the servers in the environment at a specific time. The TeamQuest Administration Console is configured to automatically update newly registered servers with a standard alarm policy using a rule. This prevents the UNIX team from having to code alarms from scratch on newly installed servers.

Other configuration policies can also be set, including workload and derived statistics policies. Alarm policies, for instance, can be customized for different level of thresholds and notifications.

Keep up with growth and performance requirements

Gain insight into your infrastructure, plan for future needs with accuracy, and prevent problems before they occur.