On-Demand Webinar

Right-Sizing Cloud Infrastructure with What-if Analysis

Solaris, Windows, UNIX, Linux, AIX


In order to have consistent IT service delivery, you have to know what your capacity requirements will be. Learn how to accurately capacity plan with what-if analysis in this hour-long webinar. 

  • Reduce or eliminate over-provisioning by identifying the least expensive way to accommodate service level requirements.
  • Get unbiased, vendor-neutral guidance that identifies what you really need as opposed to what your vendor says you need.
  • Accurately provision systems to avoid costly and time-consuming performance bottlenecks.
  • Guide business decisions with objective information.

Stacy Doughan:                 00:00                     Hello, everyone. My name is Stacy Doughan and I'd like to welcome you to our webinar titled, Accurately Capacity Plan With What-If Analysis.

                                                                                Our presenter today is Jeff Schultz. Jeff Schultz is a product specialist who has answered questions for customers and tire kickers for more than 20 years. He boasts of a background in helping IT organizations get up to speed on their IT environment for their capacity planning and performance management objectives.

                                                                                As a reminder, at the end of the webinar, we will have an opportunity to ask questions, but please feel free to submit those questions at any time via the chat box.

                                                                                At this time, I'd like to hand the presentation over to Jeff. Jeff, will you get us started?

Jeff Schultz:                        00:42                     I sure will. Thank you very much, Stacy. Today I'd like to focus on the perspective of right sizing your cloud images, and where we're going to start is using the what-if analysis.

                                                                                I guess if we step back for a second and look at some of the challenges around some of the cloud initiatives, and everybody's moving towards some sort of cloud strategy and taking advantage of whether it's hybrid type cloud or total public clouds, there's different levels of visibility obviously that are available to us to predict capacity. The key is to begin with, is to really size your cloud instances properly. As we all know, if I've got an eight core instance, and I can really run it on a four core instance, it really reduces the cost by half.

                                                                                It's critical from a costumes perspective to really size your cloud instances properly, and we'll talk about that briefly.

                                                                                The other part of capacity planning process, I mean when we talk about cloud, it's we can just expand into the cloud and to ensure we've got scalability. The struggle with that is that we certainly have that flexibility, but when it comes to cost, we want to be able to do some capacity planning so we understand what cost we're going to incur as our requirement for the cloud instances begin to increase. That's the part of it I'm going to really focus on today.

                                                                                The third bullet there is around monitoring the service levels. That's really challenging because a lot of times when you're working with a cloud provider, arguing over your service levels and what you're really paying for is somewhat of a challenge. Instrumentation obviously is something that organizations provide, like we can provide to help ensure that you're actually getting what you're paying for.

                                                                                The last bullet is about being, getting ... Being in an environment when you're looking at the different cloud providers, I guarantee you just like vendors of applications, they'll never suggest an environment that they're going to fail in, right? They'll always be selling more capacity, positioning more capacity than what you're really looking for. I encourage you to ensure that you really understand what sort of requirements you have before you get into the cloud environment. A vendor neutral or somebody independent being able to help you size those things really properly is important.

                                                                                I won't spend a bunch of time on actually sizing what those images might look like, but just briefly talking about the number of CPUs or logical CPUs that you really need to support your applications as well as the memory, because those seem to be the two biggest things that will impact your performance.

                                                                                Obviously disk is there, but disk IO is almost like electricity. You turn it on and you pretty much have what you need. The disk requirement certainly is something needed to be considered as well as disk space and the network requirements. Those have gotten to be in a lot of cases, at least from a cloud perspective, they're just there and there's adequate, but being able to measure how much you need is or how much you're willing to pay for is another part of that.

                                                                                When you're going through your test and development cycles, that's what a lot of organizations get their insight as to what size those VMs need to be or those cart images need to be as you go to plan for your requirements for cloud. I see a lot of organizations doing this locally on their on-prem type of environment that you can do most of your testing and validation verification there to minimize cost.

                                                                                Although there's a lot of organizations do their testing in the cloud and that's fine as well, but just be conscious of where you're doing your testing and how you're doing your testing and the costs associated with those things.

                                                                                What I'd like to do today is walk through a scenario here, and I'm going to call it the shopping cart application. The current infrastructure is I've got four cloud instances. My for the web tier, my application tier is hosted as part of a VMware environment. There's three VMs that support the application layer or tier, and the database tier is hosted by two AIXL PARs. The current load or the current, they average support of the environment is a hundred transactions per second. That's the current shopping cart application.

                                                                                What I want to do, is I want to use our TeamQuest Predictor to actually use some what-if type things, because the business is going to come and they're going to say, "I want to be able to increase the work load or I want to change the environment, that I want to be able to do different types of things, and I want to know what sort of infrastructure that's required to do that."

                                                                                What TeamQuest Predictor provides is an interactive interface that will allow you to accomplish that or be able to answer those business questions.

                                                                                What I've done in this case is I've actually loaded the infrastructure into the TeamQuest model. Again, this is somewhat technical, but I want to, I just want to show you a little bit about what the interface looks like that a capacity planner would go through.

                                                                                As I'm going to point out the different components that represent the tiers of servers in the shopping cart application. I've got a [inaudible 00:06:59] service represented here and you can see that it's an Amazon instances. The application layer is represented by the [inaudible 00:07:08] one system here and the AI action environment is represented by this AIX system here. Now, you can see the configuration of it off in the right hand side.

                                                                                I'm going to stop here just for a second and I'll go to the next slide here because I think it'll articulate what I'm hoping to describe. The first thing I do from a modeling perspective is I have these three tiers that I've described. I've got my web tier, my app tier, and my database tier. What I'm doing in this dialog box is I'm building my workload that goes across those tiers.

                                                                                Now, I'm using a single system because generally I'm going to make the assumption that these tiers are [inaudible 00:07:58]. I don't have to have all of the systems in the model, I can pick one system is representative of that tier as I've done here.

                                                                                The web tier is represented by system one, which is the AWS system. There's actually four systems in that tier. The second tier is the app tier and there's three systems in that tier. and the AIX system is representative of two systems in that tier.

                                                                                If the systems aren't load balanced, then I would run multiple scenarios. I May take best case or worst case, and if the systems are not homogeneous I may pick best case and I may take worst case when I build my models.

                                                                                Now, this is a three tier model, I've worked with applications that are, seven tiers is the most complicated one that I've worked. Although I know there's applications I'm sure in everybody's environment that are much more complicated.

                                                                                In this next slide that I have here is after I've loaded the environment, you know the model and I've gone through and I've built my multi-tier, now I need to calibrate the model. In other words I've got my representation of my measured data, and what I'm doing is I'm building a calibrated model that represents the model, which is a mathematical representation.

                                                                                In the first circle that I'm describing here, the current load of the web tier is running at by 17.5% and you can see that when I calibrated that tier, you can see that I've got very close results. When I calibrated the application tier, you can see that the numbers are very close as well and when I calibrated the database tier in the model, you can see that the results are very, very close.

                                                                                Rule of thumb, industry standards is if I can get within 5%, I've got a very good model. You can see that what I've achieved here is a much better model. Again, as you do your projections right, the model that has very accurate results in the beginning from a foundation's perspective, gives me a lot more confidence as I predict out into the future. This is kind of the sanity check when I build a model and calibrate it to ensure that I've got a solid [inaudible 00:10:34].

                                                                                From a business perspective, the question that they ask was; we've got our baseline a hundred transactions per second. The scenario that I want to be able to model is that I want to go to a hundred transactions per second or per month and I want to do that over four months. I'm going to start out at a hundred then I'm going to go to 210 and third month, fourth month and the fifth month I'll be running up to 540 transactions per second. Do I have enough capacity to support that?

                                                                                In the model I go in and I won't show you that part of it, but in the model I go in and I just do an incremental growth of a hundred transactions per second for those four months. When I look at the output of the model, I can see that I've got a result in here. The rest of the infrastructure didn't change because I didn't add any growth to that, but my shopping cart application starts out at a hundred transactions per minute and it, second I'm sorry and it continues to grow and you can see the end results here, is I have a value of 2.6.

                                                                                Now, I'll stop here for a second and we can just talk about what what we're doing here. This is an analytical queuing model, which is advanced analytics for capacity planning. This takes into consideration the active resources, and the point I want to make here is if this value gets above two, that's where I start to see my performance really start to degregate. The key of this risk or value is to keep it under two, because again, once I get over to, I know I have a performance problem I need to resolve.

                                                                                The second thing I generally look at is what I describe as the components or response. As I talked about, I have this application that goes across the multiple tiers and at each tier does work on each one of the active resources. There was time spent actually doing work and then there was time spent potentially waiting. When I talk about killing theory, I'm trying to identify the areas where I'm spending the most time waiting to get service and that's the component that I need to resolve.

                                                                                When I look at this, you can see the green starts to grow a little. It starts to appear at the 320 transactions when it goes to 430 then to four, excuse me to 540 you can start to see this really start to grow. This is a nonlinear behavior that you see with a computer systems.

                                                                                It's very easy for a capacity [inaudible 00:13:29] when they do the model to see specifically at what tier I have constraints and as well as specifically what is that resource that is constrained. I can see that I've got a CPU model lack here that I need to resolve.

                                                                                As I go forward, one of the scenarios that I can do is I can, I'm starting out at four systems that are in the web tier, which is part of my AWS environment. I can actually change that from four to five simple dialogue. I can generate the results, as soon as I make that change, I could generate results in a couple of seconds.

                                                                                You can see that now rather than my risk score being 2.6, you can see my risk score is 1.1. In this case it was very easy to resolve the issue and ensure I've got enough capacity in the environment just by adding one more instance in my AWS tier and my web tier. I can go back and validate [inaudible 00:14:37] response where I start to see a little bit of green as I grow, but again, I've got plenty of capacity there to be able to support that growth.

                                                                                The other part of what I can do with the modelling technology is I can actually change the infrastructure. You can see that I've got this Z on two gigahertz environment. Where I started, where on this I've got one chip, I've got eight cores and it's two threads per core. This is information that we actually get from the data that we're actually collecting, so it represents the baseline of where I'm at.

                                                                                What I want to do from just a what-if perspective, again is rather than growing the system horizontally, which is adding more systems to the tier, I can grow it vertically. I can add more horsepower to each one of the systems representative in that tier. I can go from one chip to two chips and there's a, [inaudible 00:15:38] here.

                                                                                Okay, there's actually a speed up factor of 1.9766. What that's representative, this is not running quite twice as fast. As we all know a lot of times if I double the capacity, I'm not going to get double the throughput, but I'm going to be able to get something close. These are the speeds and feeds that are part of the modeling technology. I don't want to get bogged down on that, but just to understand that that's there.

                                                                                If I go back to the perspective of horizontally growing, you can see that my risk score here rather than being 1.12 as it was when I just added another system in that tier. Basically when I doubled the capacity in the tier, you can see that I'm up 1.01, which is basically no queuing at all. I can validate that based on the components and response as well. You can see that little slice in the bottom there represents the queuing of the CPUs in the web tier.

                                                                                I've been able to very quickly resolved the issue of CPU constraint in the web tier. As we all in the business can come back and say, "Well wait a second. That isn't really what we want, we want wanted to go, we wanted to grow at least 175 transactions per second per month that we want to take that for the next four months. Do I have enough capacity to support 800 transactions per second in month four?".

                                                                                As I feed the information back into the model, you can see that I do fairly well until I get to 800 transactions per minute or per second, I'm sorry. I literally don't have enough capacity to support that, so what is the constraint?

                                                                                If we go back to the components response, we can see that again, I've got a CPU issue in the web tier, which is part of my AWS environment. A we did before, I can take this from our five systems in that tier and I can go to six and generate the results again. You can see here I'm right on the bubble of our risk score of being just a little above two.

                                                                                I guess I can go back to the business and say our performance is going to be right at the bubble for 800 transactions per second, but I can't really support more than that. As the business grows I can, or as the transaction volumes grow I can start to see that where my or my break point is.

                                                                                What I wanted to show here, and again I can go back to the components response and actually see specifically again what that resource is. Now, I've gone through and I've shown you in this example how we can horizontally scale the AWS tier, which is my Amazon tier. I've also got infrastructure that's on-prem. A lot of times when I'm building models I may be CPU bound or IO bound as well and I can change those configurations as well and generate the results.

                                                                                There's a lot of flexibility in the what-if part of the modeling process. Again, this is a little more complicated than just looking at columns and rows and numbers, but the results here are easy to digest and really understand exactly what a person needs to do to solve the model.

                                                                                Again, the automation of a lot of those capabilities are things that we do provide to our customers. I've got, we've got other webinars that actually have described specifically about the automation of a capacity planning. If you go to our website in the resource section, there are a number of webinars there that talk about the automation aspect of it.

                                                                                Just to give you a little bit of insight as to some of the customers that we support across the globe, you can see that they're not industry specific, but go across industry. These are just examples of some of the customers that we, that use our software to run their IT department.

                                                                                Based on the ability to do proper capacity planning will help you really avoid those risks and control costs. Because obviously if you don't size your VMs very well, your costs are going to be much higher. Then being able to work with the business to communicate value. It's all about making IT better.

                                                                                Stacy, did we get any questions at all along the way?

Stacy Doughan:                 20:43                     Sure Jeff. We actually do have a couple of questions here. If you have a question that you'd like to ask Jeff, go ahead and submit that via the chat box and we'll have him answer that for you right now. The first question that came in is; how much data do I need to build a model?

Jeff Schultz:                        20:59                     That's a great question, Stacy and that's one that is really fairly simple to understand.

                                                                                If I look at the performance of an environment, I generally, if I look at a 10 minute sample of the environment, if it's actually representative of the performance of the environment, I can take a 10 minute sample and build a model based on that. A lot of times I'll take an hour because it's a little bit wider view of what's happening in the environment, but I don't need a lot of data to build a very accurate model.

                                                                                Again, it comes down to a representative sample of what the application is doing. I take a snapshot of that and I can build a model, so it doesn't take a lot of data.

                                                                                I don't need 30 days worth of data, but I might take 30 days of the data to identify what the normal performances that I want to model. I might want to model an average representation of the application or I might want a model worst case. If I look at a month's worth of data, I can pick out the points in time that I actually want a model.

Stacy Doughan:                 22:12                     Then we have a follow on question to that one; related to how much data, how do you get the data? Is it agent-based or can you leverage existing data sources or both?

Jeff Schultz:                        22:24                     We can actually do both. We've got a number of platforms that we can actually leverage. What I showed you today was data that TeamQuest has actually collected, but we do have some platforms we support as well.

Stacy Doughan:                 22:42                     All right, so how long did it take you to build and run the different scenarios in this model?

Jeff Schultz:                        22:49                     Oh Stacy, I'd love to answer that question. That's the beauty of TeamQuest Predictor. It took me longer to figure out what data that I actually wanted to use as my baseline, but I could very easily run through all of those modeling scenarios in something less than 30 minutes. It's very quick and easy to build these models and make the projections.

Stacy Doughan:                 23:14                     Then I have one more question here, so if anybody else has questions, go ahead and get those submitted before we end the webinar today. The last question here is; how do you collect data from your cloud instances?

Jeff Schultz:                        23:29                     That's a great question. The challenge with the cloud instances are there virtualized things up there, right? You don't have the advantage like you do with a VMware type of environment where I've got host level metrics. I've only got data collection, I've only got visibility inside the beam, inside those cloud instances.

                                                                                Currently we're actually collecting data inside. We've got an agent that actually collects that data. We're moving towards a remote data collection center or strategy so that we can collect all of that data remotely. Today the way I did it was I actually had an agent installed in the cloud instance to collect the data.

Stacy Doughan:                 24:15                     Very good. I'm just going to wait a second here to make sure that we've got 'em all.

                                                                                All right, it looks like we've got all our questions in, so that brings us to the end of our question and answer period and to the end of today's presentation. Thank you so much Jeff, and a special thank you to all of you online for attending and we hope you have a really great day.

Jeff Schultz:        24:35     Thanks Stacy. Thanks everybody.

Try what-if analysis in intuitive workflows in Vityl Capacity Management

See how Vityl makes what-if analysis easy with a single workflow from problem identification to resolution. Start your free trial today. 

Try right-sizing with Vityl Capacity Management
Stay up to date on what matters.