Capacity may be infinite in the cloud, but your budget isn’t. Since you pay for what you use in the public cloud, you should constantly be monitoring and optimizing your workloads.
How do you keep cloud costs down? Watch this webinar to learn how to improve your total cost of ownership.
You’ll learn how to:
- Find inefficiencies so you can decommission or right-size your cloud capacity
- Combine historical patterns and business forecasts to predict future demand
- Predict impact of forecasted demand
- Identify “opportunity cost” for different cloud provisioning models
Amanda Hendley: 00:00:06 Hi. Good morning. We are ready to get started. I want to welcome everyone to today's CMG webinar.
Amanda Hendley: 00:00:21 If you're not familiar with the organization, CMG is a not-for-profit worldwide organization of IT professionals, committed to sharing information and best practices on ensuring the efficiency and scalability of IT service delivery to drive digital transformations.
Amanda Hendley: 00:00:37 We offer in person and online events like this one which you can find on our events calendar at CMG.org/calendar. We also have a special event coming up on June 19th which is called CloudXchange. It is a free online all day event. You can register at CMG.org/cloudxchange.
Amanda Hendley: 00:01:00 Now I want to thank you for joining us for Optimize Your Cloud Capacity, hosted by TeamQuest.
Amanda Hendley: 00:01:06 Before we get started, I want to make sure you know that you can submit questions through the GoToWebinar panel, and we're collecting those throughout today's session. We will have Q&A at the end of the presentation, and if we do not get to your question, we will post all the questions and answers on CMG's blog, and we'll also make today's video available to you.
Amanda Hendley: 00:01:28 Now, I am pleased to introduce you to Per Bauer. He's the Director of International Solution Services for TeamQuest and he's responsible for the delivery of services to customers worldwide.
Amanda Hendley: 00:01:40 He conceived and is the author of TeamQuest's capacity management maturity model, and has presented papers on the subject in Europe and the United States. He has a combination of deep practical experience, with the understanding of the business drivers for capacity management, which contribute to his role as a thought leader in the industry.
Amanda Hendley: 00:02:02 With that, let me turn it over to Per. Welcome!
Per Bauer: 00:02:05 Thank you. Okay. I hope you all can see my presentation now. This session will be about how to optimize capacity in the cloud.
Per Bauer: 00:02:26 This is the third seminar we're doing in a series around capacity management for the cloud. The previous one was about how to focus on the migration over to cloud and what the challenges were there. We will deal with that briefly in this presentation, but most of the focus will be around how to ongoing optimize your resources in a cloud environment.
Per Bauer: 00:02:58 For you who haven't attended one of these before or haven't met me in person, this is how I look. Amanda gave me a brief introduction so you know who I am. I've been with TeamQuest for quite a while. I'm working out in the field doing coordinating and delivering services around our solutions to the market.
Per Bauer: 00:03:22 Over the last two, three years, we've seen an enormous interest in solutions for hybrid cloud or hybrid IT, and that's what we're going to discuss mostly in this session today.
Per Bauer: 00:03:37 The main topics that we're going to deal with are these three. I'll start by discussing about how you identify and how you address inefficiencies in the cloud, because that's a big part of what capacity management is about. Then we'll move over to the subject of getting predictability and managing costs. Before it used to be resources, now it becomes cost, and how you do that in the best way, and some recommendations and advice around that. That's what we're going to spend this session around.
Per Bauer: 00:04:17 As Amanda said, there's room for questions at the end, so please submit them in the chat window, and I'll deal with as many as possible at the end of this session.
Per Bauer: 00:04:29 Before we get into this, just some brief terminology, I'll do these very quickly because I think the market is starting to mature and most people are aware of these, but there are still some confusion regarding some terminology in this space. I want to make sure that we're all on the same page regarding that.
Per Bauer: 00:04:47 There's three entities that we deal with.
Per Bauer: 00:04:50 There's traditional IT, which is what we've always known, so everything from mainframe up to the latest and greatest virtualized and software defined data centers. But still, things running under our control on prem or in a co-location, stuff we own the resources and we operate the resources.
Per Bauer: 00:05:13 Then you have the public cloud, which is infrastructure as a service in most cases, especially when it comes to capacity management, cloud service run by a third party vendor.
Per Bauer: 00:05:26 Then you have private cloud in between, which is your own infrastructure, but draft into cloud service. You operate it as a cloud, to the consumer it looks like a cloud, but you're still in charge of sweating the assets and making sure that you get the most out of your capital expenditure that you've had or capital investments that you've made in the infrastructure underpinning it.
Per Bauer: 00:05:50 For the cloud, there's three different entities. There's infrastructure as a service, there's platform as a service, and software as a service.
Per Bauer: 00:05:58 We'll focus on the infrastructure as a service here. That's where you need to do any type of capacity management.
Per Bauer: 00:06:04 For software as a service, that is part of the service, so that's taken care of by the vendor.
Per Bauer: 00:06:11 For platform as a service it depends on what type of platform it is. Some types of platforms you still need to do some level of performance and capacity planning for, whereas others it's completely part of the service, the provider takes care of it.
Per Bauer: 00:06:30 We'll focus on the infrastructure as a service aspect in this presentation.
Per Bauer: 00:06:34 There used to be a lot of talk about hybrid cloud in the beginning of the cloud era, where it was... the scenario painted was that you would have a private cloud where you could run your workloads on prem or in a co-location, but completely protected from the public. Then you could freely move things between that private cloud and the public cloud.
Per Bauer: 00:07:02 The reality has turned out quite different. The pace of innovation in the public cloud has made it such that most private cloud alternatives are falling behind and there's no comparability between the two anymore almost. Talking about hybrid cloud and managing an application that is living in a hybrid cloud becomes more and more difficult.
Per Bauer: 00:07:31 Another term that has started to show up more and more is multi-cloud. Basically having multiple different public cloud providers to make sure that you don't end up in a lock-in situation or that you always go for the best alternative.
Per Bauer: 00:07:47 That is primarily true for platforms or you primarily see that for platform and software as a service, not so much for infrastructure as a service even though it exists. For platform and software as a service, it's cherry picking, picking the vendor that can provide the best platform as a service or software as a service alternative. You can mix those, because the interoperability between those are typically not that much, so it doesn't really matter if it's with different vendors.
Per Bauer: 00:08:17 Whereas in the infrastructure as a service, it becomes difficult to manage multiple vendors many times. You can perhaps partition your environment in such a way so you can run with multiple IaaS providers, but most organizations goes for one IaaS provider, and that's what we're going to assume in this presentation as well.
Per Bauer: 00:08:41 Then we have the hybrid IT, which is everything you see on this picture pretty much. Your legacy workloads in traditional IT which is still 75, 80, 90% of what companies do typically. You have the private cloud if you have one. You have the public cloud services that you use for running some of your workloads.
Per Bauer: 00:09:02 Manage all this together, getting this ability and managed ability for all of this, this is the big challenge. Hybrid IT is reality for 99% of the market, and that's what you need to address, and that's what you need to manage. This is the outset of what we're going to talk about.
Per Bauer: 00:09:22 How does the role of capacity management change? If you prepare traditional on-premise IT to IT hosted in the public cloud, it used to be very much about optimized use of limited data center resources. You had a certain amount of servers or metric devices or storage devices that you controlled, and then you had to fit as much as possible into those and make sure you had some growth headroom and growth margins. You had to source new equipment as you grew.
Per Bauer: 00:10:01 Whereas, in the public cloud, capacity, at least in theory, it's infinite. The restraining factor becomes the budget. Basically you need to understand what is the cost of this application, for running this application over a quarter or over a year or over multiple years. That's what you need to optimize on rather than resources.
Per Bauer: 00:10:24 If you look at the driving factors for capacity management, what used to drive companies to do capacity management, most of those or all of those are still true. The only one that maybe goes away is the provisioning lead times.
Per Bauer: 00:10:37 In the past with everything running on prem and in really large organizations with a lot of dependencies and a lot of red tapes and a lot of silos, et cetera, et cetera, sometimes provisioning lead times were up to six to nine or even more than that, six to nine months.
Per Bauer: 00:10:55 Under those circumstances, you need to be able to plan ahead. Otherwise, it's impossible to have a reactive approach if you have to wait for several quarters for resources to be provisioned.
Per Bauer: 00:11:08 Most of that goes away in a public cloud. But the other ones, making sure that your business critical services can continue to grow the way they need to, or making sure that you fulfill the regulatory requirements, or making sure that you're running things as lean as possible and you optimize the cost for your data IT operation, et cetera, et cetera, all those things are still true in the cloud. You do it slightly different, but at the heart, it's the same things that you need to do.
Per Bauer: 00:11:40 Then the last one is another aspect that changes. In traditional on-prem IT where you had capex investments and then you had to live with those over the creation period, and eventually you have to do another capital investment, et cetera, the efficiency gains that you could do by doing good capacity management were, they were realized over time. They weren't immediately necessarily. Because if you had made the wrong decision, you probably had to live with that for a while.
Per Bauer: 00:12:16 Sometimes you could reclaim and repurpose those resources and use it for other workloads, but that was not always the case. Especially before virtualization, that was a challenge to accomplish. In virtualization, it became easier, it was a given that you could do it.
Per Bauer: 00:12:34 Whereas in the cloud, anything that you do in terms of efficiency gains or efficiency improvements has immediate impact, because at the next end of the period, typically by the end of the month or by the end of the quarter, it doesn't show up on your bill anymore, so you save that money. It's much more attractive to do increased efficiency in the cloud because it has immediate payback.
Per Bauer: 00:13:00 Those are some of the things that changes on a high level the roles of capacity management in the cloud.
Per Bauer: 00:13:09 There's two things you need to focus on when it comes to capacity management in the cloud.
Per Bauer: 00:13:13 The first one is optimization. Since the switch is from capex to pure opex, there's no sunk cost. As we said, optimization pays off immediately. It's about delivering the right capacity at the lowest possible costs without performance suffering or growth being limited, your ability to grow is limited. Basically, making sure that you're ahead of your business and make sure that you have the sufficient resources.
Per Bauer: 00:13:46 The other one is predictability, so making sure that you can predict the costs. Capacity is infinite as we said, but the budgets that you have running different applications and workloads are certainly not infinite. You need to be able to do predictions of costs, both based on organic growth, but also forecast the demand, very much like we did in the past, like we've always done in capacity management. You do it in a slightly different way as we'll come back to.
Per Bauer: 00:14:16 Those are the two main roles of capacity management in the cloud, optimization and predictability, offering predictability.
Per Bauer: 00:14:26 How do you do this then? We'll start with how you identify and address inefficiencies.
Per Bauer: 00:14:32 The best time to start this is already at the migration over to the cloud. Most of the workloads that are being deployed in the cloud has a legacy existence in your data center. Certainly some of the workloads in the cloud are brand new and was developed for that environment, but a majority of the workloads are legacy, that comes from your existing operation.
Per Bauer: 00:15:03 In order to get full effect of your cloud operation or your cloud implementation, you should refactor those applications to become cloud native. Cloud native means chopping your application or workload up into microservices that allows you to scale out, so you can run those different microservices on different instances in the cloud, and you can add and retract nodes or instances as you need in order to deal with fluctuations in demand, et cetera.
Per Bauer: 00:15:38 You need to build a level of automation into those cloud native applications as well, because the number of components involved in running a workload will be many times more. In order for that to be efficient, typically you integrate with some sort of automation software, like Kubernetes or Mesos or something that takes care of your application.
Per Bauer: 00:16:03 That is the application refactoring track. That is a front-loaded approach where you do a lot of the effort before you migrate, and then as you migrate over, you operate it that way. The costs of running those applications in the cloud, at least in theory, are much lower. They play well with this dynamic nature of the cloud and you can use the amount of resources that you really need in order to do this.
Per Bauer: 00:16:32 The challenges with doing this is it's high risk. You're basically rewriting your software as you move over. Then you move over to a new environment that you don't really know that well. You're introducing multiple changes at once and you're rewriting your application.
Per Bauer: 00:16:51 Another challenge is that it's much longer time before you see any visible results of this. If you have an impatient business side that wants to see quick results, you may not be allowed or you may not have the time to do this. Your planning horizon is such that you need to move things and show results fairly quickly, rather than waiting for everything to be refactored.
Per Bauer: 00:17:21 The reality is that most workloads actually get moved without any refactoring. It's a simple lift and shift exercise where you move them over as is, probably with the intention to gradually refactor it after the migration.
Per Bauer: 00:17:39 The consequences of doing that, which is a sort of back-loaded approach, is that when those workloads end up in the cloud, they will be more expensive than you expected, or they will cost more than you expected.
Per Bauer: 00:17:51 If you did a TCO analysis based on what the cloud provider offers, those typically assume that you have done the refactoring first, so your application behaves in a cloud native fashion. If it doesn't, the cost for hosting it in the cloud would many times be multiple times higher than you would have expected.
Per Bauer: 00:18:14 Those are the two choices that you have, but not really, there is some middle ground here that we're going to talk about.
Per Bauer: 00:18:25 The typical scenario with lift and shift is to do like-for-like, sort of move over to the cloud. You look at the outer characteristics of a VM and determine what size of an instance you need, and then you look through the instance types. In this case it's a database on AWS. You find the smallest possible but like-for-like type of instance that resembles your VM and move over to that.
Per Bauer: 00:18:54 As an alternative to that, if you spend some more time doing this and you put some more effort into it and you look at the utilization of resources and activity cycles, et cetera, and you evaluate different options like on demand versus reserved, et cetera, you can actually find a much better fit for that VM. You can do some of the work that you perhaps should have done before by doing rightsizing on prem when you do the migration.
Per Bauer: 00:19:21 Doing that, we've seen customers saving 35 to even up to 85% in terms of cost for their deployment. This depends on of course how good a job you've done before you made the migration. But in most cases, there's considerable savings to be made by doing that trimming exercise when you do the migration.
Per Bauer: 00:19:53 Also important to remember, if you don't do that at this point, we're used to from a virtualized environment where you could easily change the hypervisor settings. If you allocated too much say CPU resources for an instance or a VM, it was fairly easy to pull back some of that by just changing the hypervisor settings, then would allow you to put additional VMs on that cluster or that host.
Per Bauer: 00:20:24 That's not possible in the cloud. An instance is an instance. They are preconfigured. You can't really turn any knobs for a specific instance type. If you make a mistake at this point and you oversize here, it's hard to pull back those resources. You basically would have to scrap that instance and redo the migration to a new size of that instance.
Per Bauer: 00:20:48 If your application is stateful, of course you need to bring that data back and redo the whole thing. You can't just throw it away.
Per Bauer: 00:20:57 It's important to pay attention to this when you do the migration. It's a golden opportunity to increase the efficiency and the size, improve the sizing of your instances.
Per Bauer: 00:21:10 When you work with that, when you do that, another thing that is important to remember, cloud services are typically associated with a certain amount of self-service. Customers, tenants expect to have control over their own provisioning, et cetera. Which means that you will have to give up some of the control. You just have to accept that in the cloud, there's less you can do in terms of up-front analysis before things get deployed.
Per Bauer: 00:21:51 This means that you have to focus much more on reactive activities. That's sometimes hard to accept. As capacity managers, we're very used to proactivity being the axiom of what we should do. Everything we do should be proactive, or as much as possible, we should focus on proactive activities. That was the key to success in the past.
Per Bauer: 00:22:15 With that, losing some of that control, that becomes much more difficult. We basically have to accept the fact that reactive clean up and rightsizing procedures needs to be part of what we do. A lot of the focus moves from up front vetting of requirements to reactive rightsizing or clean ups.
Per Bauer: 00:22:40 Those has to be relatively frequent, because the more time that passes between each of those exercises, the more money you spend or the more money you waste. Since you're billed by the month or by the quarter or even by the week, rust never sleeps. If you're not careful and you're not taking care of your environment, you're going to constantly burn money that's not really needed or that is unjustified.
Per Bauer: 00:23:14 What are the keys to this rightsizing? There's some general things that you need to think about.
Per Bauer: 00:23:20 The first one is to categorize the workload. You need to understand what the workload that you've migrated or that is running in the cloud, what is it, what's the basic type of workload? Is it a online transactional workload? Is it a batch-like workload? What is the level of sustained activity? Can you expect it to behave in a fairly static way? Or is it always active during a few hours in the night? Et cetera, et cetera.
Per Bauer: 00:23:50 What is the basic characteristics of that workload? Unless you understand that, it's really, really hard to make the right decisions when it comes to rightsizing.
Per Bauer: 00:23:58 The other one is to understand the business activity cycles. Does this workload, is it used all the time? Does it peak once a week, or even once a month? Or in some cases even at the end of the year. What is the peak that we need to provision for? What is the maximum that we need to manage? You need to have data that covers those business activity cycles in order to make the right decisions.
Per Bauer: 00:24:25 Just looking at the last week or couple of weeks of data, which is many times what you get from monitoring solutions that comes with public cloud services like CloudWatch and Azure Monitor, they don't store that data for very long. Unless you tap that data off and put it somewhere and record it somewhere, it's going to be very hard to understand the business activity cycles and plan for those.
Per Bauer: 00:24:55 The third one is you need to focus on performance impact. Some of the cloud monitoring solutions that are available for public cloud services, they only provide a handful of metrics.
Per Bauer: 00:25:10 Those metrics doesn't necessarily tell you how the instance is actually being used, or how the resources are being used. Rather than just looking at utilization metrics which can be very tricky to do because some type of applications and some types of workloads grab as much resources as they can if they are available, and it will be, if there is a competition for those resources, they will be released and made available for other workloads.
Per Bauer: 00:25:46 Just having a high utilization level doesn't necessarily mean that the application is starved or can't perform. You need to focus on the impact. We always suggest queuing models, building queue models for the systems, because then you can truly understand if the application is subject to queuing, or if it just happens to allocate as much resources as it can and then let go of those if needed.
Per Bauer: 00:26:12 So making sure that you understand the performance impact of the behavior and not just looking at the raw utilization metrics. Because those are many times not the... doesn't give you the full picture.
Per Bauer: 00:26:26 Then the fourth one which is also very important is that you compare apples to apples. If you look at the large public cloud providers, they have a myriad of different configurations. They each have at least three generations of CPUs and other technologies available for those configurations, and they perform very differently.
Per Bauer: 00:26:51 Moving from one type, even though they at face value have the same amount of resources or similar amount of resources available, they may have completely different performance numbers underneath.
Per Bauer: 00:27:05 Using tools that allow you to benchmark or adequately determine how those different types of instances would perform and what kind of performance they would deliver is key to this.
Per Bauer: 00:27:18 The price that you pay for those does not necessarily reflect the performance differences. It's important to have a tool that can allow you to make that fair assessment of performance of an instance and by that calculating the right price performance number for it.
Per Bauer: 00:27:37 Those are some of the keys to rightsizing. If you don't do those and you don't understand those and are careful when you analyze your cloud workloads, the risk when you do rightsizing or what you think is rightsizing is that you actually hurt your workloads or you hurt the performance of your workloads. You need to be careful.
Per Bauer: 00:28:00 Another very important aspect when it comes to rightsizing and to do increased efficiency is the use of reserved instances. In cloud, I'll use Amazon Web services here as an example, there's actually three different types of instances you can have. You can have on demand, you can have reserved, and you can have spot instances.
Per Bauer: 00:28:22 Spot is basically reserved instances that are not being used so you can, on the spot market, you can purchase those for a shorter time period. There's no guarantee of availability over time. It's basically a type of on demand, but very short term. They're discounted compared to an on demand type of instance.
Per Bauer: 00:28:50 Reserved instances is basically, you make a commitment to use the instance for a certain amount of time. In the case of Amazon, it's one or three years typically. If you do that, you get a discount, because it's easy for Amazon to understand what their utilization is going to be long term and they can do better planning, et cetera, et cetera. They'll reward you with a discount by doing that.
Per Bauer: 00:29:16 The type of discount is also dependent on how you pay for it. You can do no up front payment, so it becomes part of your monthly bill, but at a slightly lower rate. You can do partial payment up front. And you can do all of it up front. All up front comes with the highest discount range, partial is in between, and none means slightly less discount range.
Per Bauer: 00:29:42 Then there's also different costs of reservations. You can do a standard or convertible. Standard means that you can't change anything or very little after you've made the reservation. Whereas, the convertible could be turned into something else if you need.
Per Bauer: 00:29:58 There's a lot of different parameters in play here. It's really hard to, unless you have a very, very accurate understanding of your different workloads, it's very tricky to find the right type of reservation to use.
Per Bauer: 00:30:17 There's also affinity involved here. If you have multiple different Amazon accounts that are consolidated into one bill, which is the typical case for most larger organizations, those reservations can float across those different instances or those different accounts as well. There's no complete affinity for a specific account. Everything that ends up on your consolidated bill could be part of the same reservation scheme.
Per Bauer: 00:30:52 To make things even more complicated, capacity reservations. Getting a committed amount of capacity and a specific availability zone associated with a reservation is not floating. Those has an affinity to your account and is not across your consolidated bill. It's a bit of a jungle.
Per Bauer: 00:31:14 It's hard to make any specific reservations, but there are a few ones that you can make.
Per Bauer: 00:31:22 The first one is that in order to do calculations and evaluations and comparisons around this, it's better to use a payback period where you assume 100% usage until there is price benefit, rather than trying to normalize the usage numbers across your different instances. Because then it becomes very hard and is very easy to end up comparing apples and oranges. Stick with 100% usage when you do this payback period calculations.
Per Bauer: 00:31:53 As a rule of thumb, you can say that you get relatively moderate gains for reservations if your instance is used less than 65% of the time. In those cases, it's really hard to improve anything by using reserved instances.
Per Bauer: 00:32:18 Then also another point goes back to the previous slide with all these different complications, especially those reservations can float across your consolidated bill accounts. Doing it per instance, reserved instances is really, really difficult. It very quickly becomes unmanageable.
Per Bauer: 00:32:44 Rather than doing that, you need to group your instances by application or function or service or department or something, and then use that for where you do your reserved instances. Otherwise it becomes really, really difficult.
Per Bauer: 00:33:05 The simplest way to use reserved instances is to assume the old thinking that used to be part of when cloud was introduced. Your sustained activity, that's what runs in your reserved instances, because that's fairly easy to identify and to understand and to manage.
Per Bauer: 00:33:27 Whereas the peaks and the bursts in activity that you normally have in those initial [IDs 00:33:35] would burst into a cloud solution, that's where you would use your on-demand instances. So using a mix of reserved and on-demand instances, where reserved is supposed to take care of the sustained day-to-day activity, and on demand is pulled in to deal with peaks and bursts.
Per Bauer: 00:33:51 If you use that rule of thumb when you do it, it's slightly simpler to get it done. It's not simple still. There's a lot of different pitfalls and a lot of different things that need to be considered. That about ongoing optimization.
Per Bauer: 00:34:08 Then over to the subject of predictability and the cost of running a solution in the cloud. The first topic we're going to discuss is how you track charges versus budget.
Per Bauer: 00:34:22 For all public cloud solutions, I'm going to use Amazon and Azure here as examples. You can get access to ongoing information about your charge. You can get day-by-day spend. You can see by rolled up into your account how much you're using. You can also get an estimated charge at the end of the month.
Per Bauer: 00:34:53 For Amazon, there is an estimated charge metric that up til the very last day before billing will give you an estimated charge. It's a simple linear extrapolation, so it's not a very sophisticated tool and it doesn't allow you to take any other things into consideration like new demands, et cetera. But still, at least it's a ready to use number that you can keep track of and make sure how it stands against your budget.
Per Bauer: 00:35:28 The same exists for Azure. There is a usage API where you can get most of those metrics, and there's also rate card API which gives you the full price list for everything you can run in Azure. Combining those metrics together in a report, you can easily do a simple forecast of where you will end up at the end of the month. You can at least get notified if you're running above budget.
Per Bauer: 00:35:56 It's also very important to use tags in the cloud. Tagging is simple. When you create an instance, it depends on the vendor or the cloud service provider, but at least 15 different tags to an instance, and those should be used. It's an excellent opportunity to bring some order to what different instances are used for. Things like business unit servers, application, owner, et cetera, should be recorded every time you allocate an instance.
Per Bauer: 00:36:30 Of course, you need to have a global structure and naming convention that everyone agrees to and everyone that allocates uses, otherwise it becomes messy very quickly. This allows you to track and report on consumption and cost by those different peers, whatever they are, and then use that for charge out.
Per Bauer: 00:36:51 Highly, strongly advise you to use tags when you're allocating instances in the cloud from a capacity management perspective. It makes life so much easier.
Per Bauer: 00:37:06 Then of course cost allocation. This old idea that by having customers or tenants pay for what they use, they will be much more thoughtful about how they request and consume resources. Especially if you have a self-provisioning policy, it's a must.
Per Bauer: 00:37:28 If people don't pay for what they use, they're always going to overallocate and always going to go for the most expensive, et cetera, to avoid situations where they run out of resources. They're not mindful about how they allocate any resources at all in that environment.
Per Bauer: 00:37:46 Whether you actually do chargeback where you get full recovery and you have the different department or business units pay for what they use, the full burden cost, or if you only do a showback where you only present those numbers and it's used to increase the accountability, that's more of a cultural decision really. Not all organizations are prepared or ready to do chargeback culturally, but at least showback is a must in cloud environments. Otherwise you'll never get your tenants or customers to be careful about how they allocate resources.
Per Bauer: 00:38:29 Those are how you deal with the charge and the costs. Then as a natural progression of that, once you have control of that and you know how much your different applications are costing you and how your different workloads are performing and what the charge is, you can move over to trying to forecast capacity requirements.
Per Bauer: 00:38:52 Unless you have control over which applications are running in which instances and how those are expected to grow, et cetera, et cetera, it will be really, really tough. The charging and the costing information becomes a platform for doing any type of forecasting in the cloud.
Per Bauer: 00:39:10 Forecasting spend in the cloud is fairly simple, straightforward. You have a current period, this is how much you've spent so far. Looking back at previous periods, you can do a simple line extrapolation or linear extrapolation or trend into the future and find out what is the expected organic growth, are we going to increase our spend based on historical trends.
Per Bauer: 00:39:44 But then there's also the migration activities going on. Right now I think most organizations are still in the mode where they are migrating a lot of workloads. There is a constant influx of new applications that are going to end up in the cloud, and that is going to increase the spend in the cloud.
Per Bauer: 00:40:03 Then of course there's what we've always planned for in capacity management. New initiatives, whether it's new applications, or whether it's new business models, or new marketing activities, or new acquisitions, whatever that is, anything from a business perspective that happens that will increase the demand for our services also needs to be baked into this.
Per Bauer: 00:40:30 There's three elements that we need to identify and understand in order to do a meaningful prediction of future cloud spend. Let's see what those three are and how we treat them.
Per Bauer: 00:40:45 The first one is this expected growth. That seems fairly straightforward. In the previous slide, there was a simple linear extrapolation of that information, and then we could determine how much we would grow based on that. But the reality is a bit more complex.
Per Bauer: 00:41:02 This is an example, a snapshot of different types of services, components running in a public cloud environment, and how many instances, and for each of those instances, how it's being used at a certain time, and how many containers that has been spawned in order to deal with this.
Per Bauer: 00:41:22 Typically a snapshot, given a certain number of business transactions, this is how our application behaves, or this is how it has provisioned resources. This is information that we can get from a public cloud provider, through a standard monitoring solution of them. It could also be a third party monitoring solution that keeps track of this.
Per Bauer: 00:41:45 If it is information from the cloud provider, this data is typically only there for a very short time. It's not saved for historical purposes. As soon as you turn off an instance, everything that had to do with that instance in terms of utilization and behavior is wiped and gone. There is no memory, there is no recollection of things from the past.
Per Bauer: 00:42:14 You need a tool that can take those snapshots. You don't need to do it in real time. You don't need to take snapshots every minute. But at least during specific business events or on a daily basis or an hourly basis even, you need to take these snapshots in order to understand how different behaviors in terms of business transactions, what that meant for your application. Otherwise, you're never going to understand the dynamics and how it grows based on different circumstances. This is really important information.
Per Bauer: 00:42:47 Another important factor is the migrations that comes in. We saw that. Most organizations in 2018 are in a state where they keep moving things over to public clouds. There will be an influx of new workloads all the time.
Per Bauer: 00:43:04 For the lift and shift, it's important to determine whether there is a rightsizing before the migration. That has a huge impact on how much you will grow. To what extent are the things that are coming in being rightsized before they move in?
Per Bauer: 00:43:22 It's important to look at all metrics that makes sense. I've seen a lot of companies focusing primarily on CPU, because that is typically the one parameter that describes the size of an instance. But in many cases, storage is just as important, especially in AWS where S3 storage is a separate entity.
Per Bauer: 00:43:49 Just allocating an instance with CPU and memory resources is not enough for a lot of workloads. You need to look at both the EC2 and the S3 aspect of it and make sure that you get both of those into your calculations. That may be an obvious thing for most of you, but more than you would expect that is forgotten about when you do that analysis.
Per Bauer: 00:44:15 Then of course when you lift and shift things into this environment, you also need to understand, how are we going to deal with reservations? What will be the immediate growth? Is there any immediate growth expected that we should account for already when we make the migrations so that we don't have to reconfigure or change instance type after a couple of weeks? If there is an imminent or short-term growth that needs to be accounted for, it's probably better to add some extra capacity to the instances that we're allocating right away.
Per Bauer: 00:44:52 Understanding all of those things has a major impact on the influx of migrations and how they will impact you.
Per Bauer: 00:44:59 From a refactoring perspective, it's important to define your capacity units. What do we mean by capacity units? Capacity units is basically your, for each type of software component, microservice if you've done the refactoring, you need to understand what is the ideal type of instance, or the ideal size of a container that you would launch for each instance of that software component.
Per Bauer: 00:45:25 So understanding the best price performance of an instance that is available for the cloud provider. Should you group several of those microservices together? Or should you buy the smallest possible instances and have them run individually? What is the amount of resources needed to execute normal activity versus what is the seasonality and variability in demand for that to avoid having to allocate and deallocate instances all the time? You may find the perfect breaking point between size of an instance and coping with peaks and troughs in utilization.
Per Bauer: 00:46:08 Understanding the capacity unit of a cloud native application and the different components of that is also very important when it comes to understanding the effects of migration activities.
Per Bauer: 00:46:25 Then the last one, which is the planning for future demand, which is stuff we've done over the years in capacity management. In the cloud, it's slightly different, but not completely.
Per Bauer: 00:46:39 We recommend a planning period of three to six months. Anything beyond six months is typically hard to rely on, because there's so many other parameters and so many other things that may change. Planning beyond six months is really difficult many times.
Per Bauer: 00:47:01 You need to look at things like historical trends of course, seasonality and behavior over a longer time. You need to look at replatforming plans. You need to look at new projects going on. PMO sources like Jira and Rally, et cetera, would be a good source for that type of information. Then looking at business activities that are going on and trying to weave those in.
Per Bauer: 00:47:28 Creating those rolling forecasts that goes three to six months into the future, so one or two quarters into the future, where you figure, focus on the biggest drivers of variance, so the biggest projects, the biggest initiatives, et cetera.
Per Bauer: 00:47:42 Trying to automate as many of the repetitive tasks as possible, so integrating with your PMO data sources automatically and pull that data in without any manual work for each forecast that you produce would make sense. That would allow you to focus more on analyzing the input, and the second part of that sentence, socialized resources is really important.
Per Bauer: 00:48:12 In order to get the right forecast, the right demand forecast from the business units, you need to work together with them. You need to socialize them, you need to become a trusted advisor, trusted partner to them to get them to understand the importance of those forecasts and why they need to be accurate, et cetera.
Per Bauer: 00:48:34 Then of course you need to as always do regular reviews with your stakeholders to make sure.
Per Bauer: 00:48:40 Building this demand calendar, that is an ever ongoing rolling forecast one or two quarters into the future, should go straight into your cloud cost prediction.
Per Bauer: 00:48:56 When you work with that forecast, it's important to always try to continuously improve it. How do you do that? First of all, you need to record the forecast and always compare that to the actual outcome to understand how good you are at doing this and if there is a trend, negative or positive trend, if you're getting better at your job, or if you're actually getting worse.
Per Bauer: 00:49:23 You need to analyze those different growth factors individually to try to understand where does the differences come from. If it's from organic growth and migration, your statistical models may not be accurate. You may not have all the data you need in order to do your job, so you need to focus on those things.
Per Bauer: 00:49:42 Whereas if it's around the new initiatives, it's probably more around the quality of the information that you're receiving from the business, so what is reality versus aspirations? People may be very positive and very hopeful about a new initiative, but it's not necessarily very realistic. Making sure that you try and compensate or if there is systemic errors to the forecast that you get, that you try to sort of discount for those every time going forward.
Per Bauer: 00:50:17 This sort of Deming wheel of constant improvements, plan, do, check, act mantra is as important here as everywhere else.
Per Bauer: 00:50:29 As I said, you need to adjust your forecast with probability scores based on the outcome, but also based on the time factor. Because the further out those predictions goes, the more unreliable they are.
Per Bauer: 00:50:41 If you forecast something that will happen next month versus something that'll happen two quarters out, you're probably in a better position to tell what's going to happen in a month than towards Christmas. Important to understand that and always have those declining probability scores over time. So working on this part.
Per Bauer: 00:51:08 Once you have that, you have the usage profiles, you record data about trends and seasonality patterns, you understand how your cloud service is behaving under different business circumstances.
Per Bauer: 00:51:22 You're in control of the rates and the charges, you have made a bet for the best policy when it comes to on demand versus reserved instances, et cetera, et cetera, you have the best types of instances employed to run your workloads because you've understood the true performance of them, et cetera, et cetera.
Per Bauer: 00:51:43 Then you have this demand calendar which allows you to do some level of qualified forecast into next quarter and the quarter after.
Per Bauer: 00:51:52 Once you have that, it's fairly easy to combine those. They're not trivial in themselves, any of them, but once you have those, you make very good forecasts about future costs and weave that into your reporting and produce those type of reports.
Per Bauer: 00:52:19 To summarize, capacity management in the cloud as we've talked about, there's two things you really need to focus on. The optimization aspect, so making sure that you're running as lean as possible, you rightsize your migrations, you continuously analyze your environment and make sure that you're running it in the best possible way.
Per Bauer: 00:52:40 Then that you tag and charge out your costs. It's important to make sure that the consumers of resources are actually paying for them and are actually being aware of what they cost at least.
Per Bauer: 00:52:54 Then for the predictability aspect of it, you need to model the different growth factors individually. You need to treat them individually. You need to understand them individually. You need to have this demand calendar if you don't have it already. That's not true for cloud capacity management. That's something we've always aspired to build and create as part of our job.
Per Bauer: 00:53:19 Then you put that to work together with those different growth factors. Then you always look to continuously improve how you do things. So reviewing your outcome versus what you had forecasted and trying to find systemic differences and trends, either improving or things getting worse. Certainly in the case of getting worse, you need to address those and find the root cause of that.
Per Bauer: 00:53:47 This sort of summarizes what you need to do to do successful capacity management in the cloud.
Per Bauer: 00:53:53 With 5, 10 minutes to go, I'll open up the floor for questions.
Amanda Hendley: 00:54:02 I have a couple of questions that were submitted online. The first one is, does the capacity management strategy, in particular from a cost perspective, change from a private to a hybrid to a public cloud?
Per Bauer: 00:54:18 Yeah, absolutely. The part that we describe here is mostly around public cloud, because the only thing that matters is the cost. Whereas in private cloud, you need to treat it quite differently.
Per Bauer: 00:54:37 Private cloud you need to make sure that the cloud service is operating and it is scalable and you can respond to the requirements of your customers or tenants with that private cloud. So making sure that your cloud service is fit for purpose to put it simply.
Per Bauer: 00:54:55 You also need to make sure that you sweat your assets. The cost of that cloud service or the efficiency of that cloud service relies on how well you're planning and how you're sourcing your cloud service basically, your private cloud service. You need to make sure that you have the right utilization levels for the resources that you have in your data center, not running out of them, but not over provisioning either.
Per Bauer: 00:55:26 It has elements of traditional capacity planning as well as well as planning for a cloud service where you have a different type of abstraction.
Amanda Hendley: 00:55:40 Great. Thank you. Which tools can you talk about or maybe recommend that can help you keep track of those costs?
Per Bauer: 00:55:54 Okay. It's very tempting to bring up TeamQuest here and HelpSystems. We provide this. There's probably other tools out there.
Per Bauer: 00:56:04 The key aspect of this is really to have a tool that can integrate with... because some of the information will have to come from the cloud provider, because they are the ones who has the full picture.
Per Bauer: 00:56:18 You can still deploy extremely lean and small footprint agents in those instances and get very detailed information about how separate instances are being used. But in order to get the full picture, you need a combination.
Per Bauer: 00:56:33 You need a tool that can pull data on an instance level, but you also need a tool that can pull data from a macro level, so integrate with cloud watch or Azure Monitor or a rate card API or whatever that is and pull that data out. Then that tool needs to have a conceptual model of how those different components fit together and what are the formulas for doing the forecast.
Per Bauer: 00:57:00 A tool like TeamQuest or Vityl Capacity Management in HelpSystems provides you with that. Out of the box, we can do all of those things. But you could probably build it in a spreadsheet if that's your desire as well.
Amanda Hendley: 00:57:13 Thank you. When it comes to predicting your future cloud costs, how high should you set the bar in terms of accuracy and getting that cost predicted?
Per Bauer: 00:57:30 I think it's important to have realistic expectations or set realistic expectations with the rest of the organization or your managers.
Per Bauer: 00:57:38 The complexity, both in terms of accurately forecasting demand, which has always been a challenge in all types of environments, but also predicting the impacts of all those different pricing policies that we talked about here from a service provider, combining those two, it's really hard to get a full predictability down to nickels and dimes. You'll never get that.
Per Bauer: 00:58:06 It's about making predictions that are sort of in the same range as the bill that you will eventually get. Beyond the next bill, you need to make sure that you're on track to stay within budget constraints from quarterly or annual perspective. That's really the ambition.
Per Bauer: 00:58:20 It's not about specifically telling how much the bill is going to be about, but making sure that you're within the boundaries that you agreed are reasonable, or the budget that you've committed to.
Amanda Hendley: 00:58:37 Great. I think we've got time for one more. Are there any tools that you can recommend in regards to running benchmarks on cloud VMs as you compare cloud providers?
Per Bauer: 00:58:50 Yeah. Interesting enough, again, Vityl Capacity Management comes to mind.
Per Bauer: 00:58:55 All those cloud instances if you look at the specification from the provider, they are equipped with a certain type of CPU from Intel I think, there may be GPUs as well that are being used for some instances. But typically, CPUs from Intel, storage devices of certain performance, et cetera, et cetera.
Per Bauer: 00:59:20 If you put those into a modeling tool like our capacity plans, Vityl Capacity plans, you can easily do a benchmark. It will normalize those numbers and bring it down to what you'll actually get out of it.
Per Bauer: 00:59:35 Then there's always the multi-tenant aspect of cloud that may come into play, but that's across the board. You don't really have to account for that when it comes to specific instances type.
Per Bauer: 00:59:49 Doing an objective comparison between different instance types, I think a modeling tool like Vityl Capacity Management capacity plans is a good tool for doing that.
Amanda Hendley: 01:00:05 Great. Thank you. Per, we are out of time for today, but I do want to thank you for putting on this webinar for us. It was certainly very interesting and informative.
Amanda Hendley: 01:00:17 For everyone on today's session, we have this recorded that you'll get a copy of. I'm sure if you have any other questions, you can email them over to us, and we'll be happy to answer those along with the ones we got today and couldn't get to on the blog.
Amanda Hendley: 01:00:34 Per, thank you so much.
Per Bauer: 01:00:36 Thank you very much.
Controlling IT costs, avoiding risks, and aligning with business values are musts for IT organizations today.
And Vityl Capacity Management can help, with components focused on capacity planning, performance monitoring, IT analytics, and more.