On-Demand Webinar

How to Do Capacity Management in the Cloud

Solaris, Windows, UNIX, Linux, AIX
Recorded:
April 19, 2018

 

 

You’ve been managing IT capacity on-premises for years. But as your infrastructure migrates to the cloud, how will that change your approach?

When it comes to the cloud, many organizations make the mistake of overprovisioning. While it’s nice not to worry about having too few resources, it comes at a cost.

That’s why your cloud infrastructure needs capacity analytics to ensure utilization (without overprovisioning) and achieve return-on-investment (ROI).

So find out how to apply capacity analytics to right-size your cloud infrastructure.

You'll get the answers to questions like:

  • What are the differences between on-premises and cloud?
  • What are the new challenges with capacity analytics in the cloud?
  • What does it take to successfully manage capacity in the cloud?

Amanda Hendley:            00:00:07               Hello everyone, this is Amanda Hendley, managing director of Computer Measurement Group. We're very happy you could join us this morning. If you're new to CMG, our mission is to provide a forum for sharing, learning, and networking among professionals who are charged with IT planning and support for their organizations.

                                                                                I want to remind you that you can ask questions throughout the presentation in the question section of our control bar. We will break midway through to do some Q&A, and then at the end we'll have a longer Q&A session. Afterwards, this recording will be made available to you and we will also follow up with all of the questions asked after the presentation if we didn't have time to get to them in the presentation today.

                                                                                Now I am very pleased to introduce you to our presenter, Per Bauer. He is the director of global services for Team Quest and has global responsibility for the strategic services delivered to Team Quest customers. He's also the originator of the company's capacity management maturity model. He frequently presents internationally on the topic and has deep practical experience, as well as an understanding of the business case for capacity management.

                                                                                With that, thank you so much for joining us Per.

Per Bauer:                           00:01:29               Thank you for those kind words. Welcome everyone. This is how I look if you don't know me from before. Today's presentation will be about capacity management in the cloud. These are the different subsections or topics that we will cover. I'll go through each of them fairly quickly and hopefully leave plenty of room for Q&A at the end.

                                                                                Basically we'll build it from the ground up discussing some of the fundamental concepts and some of the misconceptions about capacity in the cloud, then we'll talk more in detail about how to actually optimize private and public clouds.

                                                                                It will be a tool and technology agnostic session where we talk about context rather than specific technologies and how to use those. If you need more details or if you want to have an in depth discussion about those kind of matters, you can always reach out to us at HelpSystems to discuss these things of course.

                                                                                Let's start from the bottom, we're setting the foundation. When you talk about cloud, even though it's a fairly mature concept by now, there's still a lot of confusion around different terminologies.

                                                                                We basically have three different entities that we're working with in capacity management. The first one is traditional IT. That's the stuff we've had for a long time, data center. Ranging from work loads running on legacy platforms all the way up to highly virtualized resources that are from more recent years. Then we have the private cloud, which is basically the same thing but wrapped into a cloud service. So, it's On-Prem infrastructure, or potentially outsourced infrastructure, but it's still infrastructure that you control that is wrapped into cloud metaphor, and it's been operating as a cloud. You have a solution like Open Stack or Microsoft Azure Stack or Oracle Cloud Machine et cetera, et cetera.

                                                                                Then we have the public cloud, which is what's normally associated when you talk about cloud. Things like KWS, Azure, Google Cloud, et cetera, et cetera. So, hybrid cloud, that's the first definition that we need to keep straight. Hybrid cloud is really those two types of clouds. So, it's public and private cloud. It doesn't include the traditional IT. If you want to include the traditional IT, we should rather talk about hybrid IT because cloud and traditional IT is so different.

                                                                                A hybrid solution where you move things in and out of from traditional IT over to cloud. That's typically a one way operation. You don't really move things back and forth between the two. So, often you hear the term hybrid cloud where people actually mean hybrid IT. So, I think it's important to have that kind of distinction when we talk about this. Managing hybrid IT requires you to have a solution that can take care of your traditional legacy IT, but also you cloud solution. It makes the management a bit more complex, but it's a necessity to be efficient when you do this.

                                                                                Inside the cloud there's multiple different categories of solutions. These are very well established as well. So, you have infrastructure platform, and software as a service. When we talk about capacity management and performance management, we normally discuss IAS and PAS. For software as a service, that's included in the service. That's the vendor's responsibility to make sure that the performance is adequate and that the capacity is adequate et cetera. So, what we will discuss here is IAS and platform as a service. Focusing most on the infrastructure as a service, because that's when you get the closest to what normally capacity performance management is about.

                                                                                Another important distinction is to agree on what we're supposed to run in the cloud. Even though it's not the case every time, the cloud is built, or the whole idea about the public cloud is around cloud native applications. So, stateless applications that runs in a scale out environment where you grow them or retract them by adding notes to the infrastructure. Not scaling up, but scaling out. You build micro-services that operates, there are small components that talks to each other through API's or other interfaces, and those micro-services can then be burst to cope with increasing load.

                                                                                So, it's a completely different way of scaling an application or scaling a workload that we're normally used to. There are also cloud native applications that typically also are designed to be highly automated. So, they have integrations with management frameworks like Kubernetes, or Mesos, or [inaudible 00:06:44], or any kind of automation solution that can operate and manage them. If we then move over to the capacity management side of things. The drivers behind for why you should do capacity management in the first place was organizations that has business critical services. Nowadays that's true pretty much for every organization with a digitalization of companies. Every company has some business critical services. Services accounted for to jeopardize or the quality of the service needs to be very consistent.

                                                                                You can't have high variations of seasonality and peaks in your business activity. Which forces you to manage the capacity. You need to provision for the peaks, but you also need to make good use of that capacity when there's ... in the troughs, or in the valleys or in between those peaks. Planning for and managing seasonality and peaks in the most efficient way. Another driving factor behind capacity management is business growth. If you're in an organization that is growing very quickly through acquisitions or that you land new customers on a regular high rate, you need to plan for that growth. And you need to be able to onboard new customers or assimilate new acquisitions into your organization in a similar way and that takes a lot of planning and a lot of understanding of the dependencies between your infrastructure and business demand et cetera, et cetera.

                                                                                It could be regulatory requirements. Things like financial services where the authorities requires you to have a contingency plan. And a big part of that contingency plan is to be on top of capacity positions, current capacity positions and how much headroom you have to grow and what you need to do if there is a sudden spike in demand for services et cetera, et cetera. So things requires you to have a capacity management discipline in place. Could be provision eight times. In the past a lot of organizations were playing with very long lead times for provisioning new infrastructure. That made it impossible for them to have a reactive approach to this. Because it took them a couple of months to stand up new infrastructure or new resources if there was a demand. And that required them to plan ahead and make sure that they incorporated those lead times into the plans.

                                                                                Could be cost optimization. So doing more with less, or reusing existing infrastructure. Past investments, making sure that they actually are being used, fully used. And if not, reclaim them and use them for something different. Could also be around agility. All the other one above, basically boils down to agility. So, being able to cope with changes in a much quicker way and having a good understanding of where you are from a capacity perspective allows you to exploit new opportunities in a better way.

                                                                                All these things, the way we've described them here, basically also relates to cloud, if you're in the cloud. The one that could potentially go away, is provision lead times. Especially if you're using public cloud services, the provision lead times are much faster and it's probably not a concern, something you need to plan for. But all the other ones are basically still true. If you're still growing very quickly, or if you have very seasonalized work loads, et cetera, et cetera, you still need to plan. Moving over to the cloud doesn't deal with that or won't take care of that for you. You still need to do your planning the way we did it before.

                                                                                And if you look at business one market study, there is plenty of those. This is a state of the cloud from right scale for 2017. If you look at the two top one's here, they asked thousands of executives about what their cloud initiatives for this year would be about. The first one is about optimizing existing cloud use. Cost savings in the cloud, that takes planning, that takes capacity performance management, understanding how the infrastructure is being used and whether there are savings to be made. Second one is moving more load services to the cloud, and that's also very much appointed to having a capacity management discipline for the cloud. Being able to translate the requirements of legacy on On-Prem workloads to the cloud and understand what they need to operate in the cloud, and how they should be architected to work in the cloud.

                                                                                A lot of those things that are high on the list of prioritized cloud initiatives actually points to capacity management. So, there is a recognition I think in the industry that capacity management, if anything, becomes even more important when you move over to the cloud. In connection to that, some of the misconceptions about capacity in the cloud. Some of these are based on an immature view of what the cloud can actually offer you, and some of them are based on other kind of misconceptions, but we'll go through a couple of those and discuss why they happened and why they're not really true.

                                                                                The first one is this. Capacity is cheap in the cloud. Yeah, it can be. And certainly if you optimize your workloads, it probably is. But it depends on a large number of different parameters. To get cheap capacity in the cloud, you need to put some effort into optimizing the usage. You need to know the wardrobe profile for example. So, if you have a low sustained level of activity and sudden peaks, and if you allocate resources that covers that peak and leave it like that, it's certainly not going to be any cheaper. You need to understand what type of instance you should use for workloads with a fairly high sustained level of activity. You should probably go for reserved instances, whereas in the case of highly flexible or peaking workloads, it's probably an on-demand solution that is the best for you. And to find the tipping point for that or the ... That is a big part of the capacity management effort.

                                                                                You also need to understand what kind data volumes are being exchanged between the different instances, and potentially between On-Prem and cloud instances because that's also going to have a major impact on the cost for this. So, capacity is not cheap by default or by itself. You need to have a good grip on this and understand what's the break even point between on demand and reserved instances et cetera, et cetera, to actually make it cheap.

                                                                                Another one is that capacity can be added instantaneously, and that's certainly true. We talked about this in a previous life where the problem with long provisioning lead times, that's probably gone. But as a matter of fact, in order to make use of that quick provisioning of instances. First of all, you need to design your applications that they can scale out. If they can't. If you hit some sort of soft limit in your application as to how many instances you can scale across. It doesn't matter if how quick you can add them. Another one is data center location affinity.

                                                                                So, if you all of the sudden have to provision instances in another data center, in another zone because you're running out of instances in the zone that you are operating in, that may have a major impact on performance and latency for the workloads et cetera that needs to communicate across boundaries in the cloud that you're using. Another one is costs. So, yeah, it can probably be provisioned, but can it be justified? Cost justified? Have you budget for it? Have you predicted this growth, and is there a budget in place for this? Because capacity may be infinite. But the cost is also infinite. So, if you just scale up or scale out in accordance with demand, you better make sure that you have a prediction that allows you to plan for this and budget for this.

                                                                                Capacity can be added instantaneously, but it needs to be understood before you do it. Another one that was very common in the beginning when people started to adopt cloud technology is that you can burst into the cloud during peak demand. So, moving workloads to accommodate for a changing demand. It's a very compelling idea, and very compelling scenario, but it's actually quite hard to do. There is a lack of compatibility between both the private cloud and public cloud, but also between different public cloud solutions. So, making sure that your workloads are designed in a way so that they can move across different environments is going to be quite expensive and quite ... it's actually going to prevent you from doing a lot of things.

                                                                                So, having to customize your workloads and test them and validate them and nurse that solution many times becomes more expensive than actually using a bit more resources than it actually needs. Or provision for a peak demand, in your On-Prem environment, et cetera. So, if you can't use some of the features of a cloud platform just to make sure that workload is compatible with another provider of cloud, that may actually eat up all the benefits from being able to move.

                                                                                So, it's a very compelling concept, but as far as I know, there's very few successful real life examples. There may be a few ones, but it's not a sort of a general [inaudible 00:17:13], so this cloud burst thing is more in theory than in practice actually. Another common view for in the beginning of this, the planning can be delegated to the provider. But that is a bit like asking the folks to watch the hen house. The provider will not be incentivized to bring down cost or drive down costs. Also, any kind of forecasting about demand and business demand requires an understanding of the business. And you won't get that from your public cloud provider. So, long term planning needs to be done by people who understand your specific circumstances and your specific business requirements and only you do that.

                                                                                Also to make matters worse, if you have multiple providers of cloud services, no one is going to be able to plan and optimize and cross those different providers if you leave it to the provider. So, that's not really a solution. And then the last myth that we're sort of going to debunk. That, everything that runs in the cloud is cloud native. So, it's been refactor prior to the migration. So, the textbook example is that you refactor your application to be cloud native, you test and verify it, and then you deploy it in the cloud, and then you operate it there. That's the best way of doing it in theory, but in practice, it's really hard to do this. It's very hard to mimic your target platform during [inaudible 00:19:01] test. So, the reason your moving over to the cloud is to get to that. Or you get a completely new architecture there that you need to operate in. And you typically don't have that in your own environment.

                                                                                It's also a bit tricky, because if you do the refactoring prior to the migration, you'll be introducing multiple different changes at once. You have this refactor application that you're supposed to host, and you move it to a completely new environment that you don't really know as well as your old one. So, it's never a good idea to introduce multiple changes at once. So, rather than doing that, a lot of companies are actually doing a lift and shift kind of operation where they lift and shift the workload, operate it as is in the cloud, and then gradually refactor test and verify to get to some sort of cloud native architecture, and then you continuously try to maintain that one, and continuously deploy that application as you make updates.

                                                                                This is the reality. It's going to be quite expensive. It's going to incur a lot of cost in the beginning, because those legacy workloads that are perhaps monolithic and stateful, they will not fit very well into the typical scenario for a public cloud provider. You probably need to run reserved instances that are much larger than ... The sweet spot for moving into the cloud is using whole instances with a lower amount of resources. You probably have to allocate fairly large instances and run on that in the beginning. So, it will be quite expensive in the beginning, and then you can sort of work back from that by gradually refactoring.

                                                                                So, this is the reality for a lot of organizations. And a lot of the workloads that are being moved have not really been fully refactor before they are moved. So, those are a couple of the things that are sort of misconceptions about capacity in the cloud and what it means. So, how do you manage capacity in the cloud? A couple of general remarks there. First on is that it has a major impact on ITIL. ITIL has been this sort of natural habitat for capacity management. Both in terms of what the responsibilities of capacity management are, but also perhaps more importantly describing the adjacent disciplines and how capacity management can rely on and benefit from interacting with them. Things like conflict management for example, where we get a lot of our definitions of components into services. Financial management perhaps. Service level management around what the performance objectives are. Or performance requirements are for the applications, et cetera, et cetera.

                                                                                With migration into the cloud, a lot of these disciplines change the same way capacity management does, and some of them even almost seize to exist. So, it's important to understand how those changes will impact you. So, you can't just move over to the cloud and expect to have the same support from other IT disciplines that you had before. Another one is this self service, self provisioning mechanism that is associated with the cloud. That is one of the major reasons why organizations move to the cloud. Price is probably number one, but next to that is the flexibility and the agility that the cloud offers. Traditionally, capacity management has been a lot about controlling things and doing things proactively. Doing assessments for costing prior to things happening and trying to avoid them.

                                                                                That can have a negative impact. If you try to implement all those vetting procedures in a cloud environment, that will most likely impact the provisioning lead times, and that's not a very good idea. So, you need to avoid anything that can impact the provisioning lead time negatively and you need to be wary of policies or controlled mechanisms. Because that's only going to upset people, and it's going to give you ... You will be seen as a blocker of innovation and moving forward. So, try to resist interfering with provisioning mechanisms et cetera, and learn to deal with it in different ways instead of putting up obstacles.

                                                                                Another one that is important is the speed of change. So, when you move over to the cloud, assuming that you refactor your applications, it will be built on micro-services. It will allow you to have a completely different way of developing your software. So, you will have much more frequent releases. You will have adopt agile frameworks and short release cycles. A lot of the changes that are made in those release cycles will be deployed straight into production environments in a way that they never was before. So, the number of changes that can potentially impact performance will increase dramatically. So, you need to focus more on having access to real-time data. And the past, capacity management has been a lot about snapshots. Maybe a 10 minute average or it could go down to one minute.

                                                                                But in a lot of cloud environments, and for a lot of workloads running in public cloud, that will not be enough. You need to have this real-time data access. You need to get streaming data about things going on in the minute actually that you are operating in. To find those issues and to deal with them as quick as possible. So, speed of change is much quicker, and that forces you to adopt more of a real time perspective on things.

                                                                                Another thing that we've already sort of mentioned is this movement moving some of the activities from proactive to reactive. So, proactive has always been sort of the guiding principle or the guiding ... what defined capacity managements to do things proactively. Do proactive assessments. But that's really hard to do in auto-provisioning or self service environments. So you have to accept the fact that researches will be allocated without proper analysis. And proper in this case meaning, capacity management having been involved. So, you can probably define T-shirt sizes of your instances and inform your customers and ask them to use them wisely, but that's as far as you can go sort of.

                                                                                The rest of the stuff, a lot of the other stuff when it comes to efficiency and cost optimization will have to happen as reactive cleanup or right sizing procedures. Rather than doing a lot of proactive assessments and vetting of requirements or requests, you have to reactively find those underused or dormant resources and reclaim them rather than preventing them from ever being proficient. So, a slight shift of focus from proactive to reactive. Those were some of the general recommendations around capacity management. I thought we should perhaps break here and have a quick Q&A. A chance for questions. So, Amanda, is there any questions that have come in?

Amanda Hendley:            00:26:54               I've got a question for you, and that is, can you talk about the importance of cloud portability?

Per Bauer:                           00:27:05               Maintaining cloud portability by being provider agnostic that means. So making sure that you're not getting locked into using just one provider. If you go with AWS, if you avoid using some of the primitive services that they offer, that will allow you to one day choose another vendor, like your Google Cloud for example. But in reality, I think it's often more expensive to actually maintain that. The cost for maintaining cloud portability is actually higher than the potential effect of a lock in. And also, even if you try to avoid it, there's always going to be differences between the different offerings that could cause that lock in effect anyways. So, there's no certain, you can never be certain to be completely cloud compatible. I think the whole portability thing with cloud is an interesting concept, but I think in reality it's really hard to maintain.

                                                                                We see with most of our customers that we've been working with, that you make a choice for one provider, then you use as much of the features that that providers offers as possible, because that's how you get your best ROI.

Amanda Hendley:            00:28:32               Great. Noah Hall, I'm sorry if I'm mispronouncing your name, is asking what some of the key metrics to ensure cloud efficiency are.

Per Bauer:                           00:28:46               Yeah. That's a question that I would like to defer. Because I think we're going to touch on a lot of those. So, the next couple of sections we're going to get into more specifically around public and private cloud, how you manage those. And we're going to answer a lot of that. In essence, basically depending on if it's public or private cloud. But if we're talking about public cloud, a lot of the key metrics are actually around costs. So, how much cost are you incurring compared to what you expected? Because the metrics around resource usage is not really as interesting anymore. They're interesting for finding underused or dormant resources, but outside of that you don't really have to care so much about the resources anymore the way you did. But let's come back to that one after the next sections.

Amanda Hendley:            00:29:44               Okay. Do you have time for one more or should we move on?

Per Bauer:                           00:29:48               Yeah, let's move on and save questions for the end to make sure that we get through all the slides that we have.

Amanda Hendley:            00:29:56               Sounds good.

Per Bauer:                           00:29:57               Good, thank you. We talked about the fundamental concept of the cloud and some of the misconceptions about it, and how you're on a conceptual level, how you would manage capacity in the cloud. So, how does this translate into specific activities that you need to do in private and public cloud? Let's take a look at that. We'll start with private cloud. Private cloud as we said before is basically normally or in most cases, it's On-Prem infrastructure that you have, that you wrap into a cloud service. It can also be infrastructure that is hosted by someone else or in a [inaudible 00:30:38] somewhere. But at least you control that infrastructure. You probably pay for it, or you invest in it. Or at least you are charged for it.

                                                                                The private cloud optimization is sort of a two fold activity. You need to focus on optimizing the cloud service, making sure that you have the optimal pre-defined configurations or typically T-shirt size, small, medium, large. That type of instances that your customers are provisioning. Or requesting and provisioning. And you need to regularly reclaim dormant allocations. So, finding those instances that were provisioned but are no longer used or didn't turn out to ... The usage pattern is not what was expected when they were provisioned, so you can reclaim some of those resources. So, making sure that the cloud service is operating according to those two points.

                                                                                But then you also need to do the regular stuff. Making sure that your resources are used in a proper way, so used with the assets. Making sure that you have sufficient use of your infrastructure and also plan for requirements for new infrastructure. So, having this planning cycle where you look at the future demand from your tenants or your customers and predict what that will mean for your cloud service based on both trending but also things like specific business events, or new customers coming in or acquisitions of other companies, et cetera. Significant events that is going to drive the need for your services.

                                                                                So, try and predict those and source for those in the best possible way. Private cloud optimization takes two different views basically. Process for that is basically you need to make sure that you have data about all the relevant resources. So, you need the good old fashioned monitoring solutions that picks up performance data from your compute, your storage from your network, saving that somewhere or making sure that it's available for historical purposes so you can study that. You need to identify organic growth patterns across that. So, with cloud and with the less obvious or the supply and demand patterns are not as clear because customers are going to allocate resources or allocate instances themselves et cetera. I think that identify your organic growth patterns is even more important, because you don't really have this ability to exactly what is running inside the cloud.

                                                                                You need to establish some sort of demand planning process with your most significant tenants. So, if you have some larger tenants in your private cloud, you need to talk to them, and you need to establish this process of getting forecasts of future capacity needs. You also need to understand the lead time for provisioning new capacity. If it's your own resources, or your infrastructure that you're provisioning, you need to understand the lead time and incorporate that into your planning process of course. And then number five is repeated over and over again. So, you need to forecast and provision to meet the demand, accommodating for those lead times. And the forecast is built on both this organic growth that is always present, and it will always go on, but then you also need to incorporate some sort of estimates for specific events that are going to drive demand for your services.

                                                                                And then of course, in the spirit of continuous improvement, you need to revisit the above and adjust for changing circumstances when you see fit. But this is sort of the standard process for doing this. Some rules of thumb for managing a private cloud based on what we've seen with customers we work with. The larger the cloud, the fewer surprises. If you have a large scale private cloud implementation, each tenant and each specific business event or each specific launch of a new application et cetera, et cetera, will have smaller impact because of the size of the whole.

                                                                                If you have a small private cloud with just a few tenants and a few workloads running, of course any change to that will have a major impact. So, the larger the cloud gets, the less you need to care about specific events and specific things. You can pretty much rely on this trending, looking at the organic growth of the cloud and assume that that is going to continue to happen sort of. Number two, you need to understand your most significant tenants and what they are. What are their business cycles? Are they offering a retail service or is it a big data kind of workload where you have a relatively predictable and sustained level of activity? So, understanding what those tenants are doing and what their business cycles are. And also understanding the length of those business cycles. Because if you cut between and don't use a full business cycle for your planning, that's going to lead to the completely wrong decision.

                                                                                So, understanding the length of the business cycles and seasonality of your tenants et cetera. This will become important. Make sure that you base your decisions on the right data. And as we said before, it will probably be a sort of combination of linear prediction and some sort of capacity modeling. So, linear predictions for the organic growth, that is always going to happen. And then some sort of modeling approach to specific business events that happens, that's going to drive the demand. So, this is how you operate a private cloud. If we move over to the public cloud, how to optimize that, it's quite a different story for quite a diff approach. The major difference is that of course, in a public cloud there is no infrastructure. There is no physical resources that you need to bother about. You pay for what you use.

                                                                                So, the only thing that should bother you is the cost. So, you should optimize the provisioning process by offering advice on optimal instant size and configuration to your customers, to make sure that your self service provisioning or your centralized provisioning mechanism is using the optimal sizes of the different type of instances available. Even though you do that, you will still have to retrospectively right-size and cleanup and find dormant resources and dormant allocations that are in no longer use, and clean those up and make sure that they're decommissioned. And then the first two activities are clearly focused on the resource allocation. The third one is about providing predictability on the cost. So, making sure that you're using the right type of offerings, the right type of instances. Making sure that you're looking at them from all the relevant perspectives, and trying to predict that cost. So at the end of the month or the end of the billing period that you get invoices from your public cloud provider that are predictable.

                                                                                So, what are you optimizing? We talked about this before. You have the top of monolithic legacy workloads that are stateful, that are using ... they may not be complete monoliths, but they're not designed around micro-services, they are not designed to scale out. So you need to allocate larger instances and have them run inside those. And for those, it's typically a good idea to use reserved instances. Reserved instances provides you with a lower unit price, but you commit to using them for a longer time. So, if you know that you're going to be stuck with this legacy workload for a couple of years before you have time to refactor it, it's obviously a good idea to reserve an instance for that time and get the best running rate possible.

                                                                                For cloud native workloads it's a completely different thing. They scale out, they're designed to scale out, you deal with varying demand by adding and retracting instances to your environment. For those, you need to have some sort of cookie cutter approach where you find the optimal instance size and add those kind of instances and retract those kind of instances. And how you get there will be discussed in the coming slides. The rest of the slides we are talking about there, they are going to be sort of focusing on the assumption that you've done the refactoring and that you're optimizing for cloud native workloads. Because if they're not, it's the same kind of exercise that you've done before in traditional capacity management. So, this is going to be focused on cloud native workloads.

                                                                                Cloud native applications are designed to manage and increase load by spawning new instances and scaling out across those. So, defining a capacity unit becomes very important. Understanding what is the optimal increment size for this specific workload. So, how do I scale up quick enough and how do I retract quick enough? Or how can I optimize that? So, capacity unit is defined by a combination of different measurements. You look at the type of instances available from your vendor. In this case we've used AWS as an example. You have t2.micro, small, medium, large, X-large, XX-large, et cetera. So, finding out what are the alternatives, how are they configured and what is the optimal size for me?

                                                                                You also need to look at the profile of your application. What is the seasonality and the magnitude of that seasonality? What are the business cycles? Et cetera, et cetera. And then you also need to look at the type of workload that you are hosting. So, what is the type of workload that we're designing for, and based on that you sort of need to optimize the image size for your specific application, and that's the size of unit that you add to your environment when you need to grow and when you need to scale out. That's the concept of a capacity unit. Another thing that you need to take into consideration when you're hosting things in the public cloud is that the cloud will always be multi-tenant by design. So, there is always a risk that concurrency of events across those tenants will impact your performance. You can't rule that out at any point.

                                                                                Whatever you do, you always need to keep this in mind, and you need to make sure that you accommodate for those characteristics of the cloud platforms. Another important aspect is the distribution of instances. Cloud providers typically don't centralize your processing in a sing physical data center, unless it's integrated upfront. We use again Amazon Web Services as an example here. They have the notion of availability zones and regions. Availability zones inside a region are basically data centers and regions are continents I guess. So, you don't really have any visibility between different regions. But within a region you can reach everything that is inside that region. If you spawn new instances, if you want to grow your workload, and that workload is sensitive to latency, or there is a large amount of data being changed by those different instances, you need to make sure that they can operate in the same availability zone. Otherwise, they will be communicating across availability zones, and that can incur a lot of additional fees to your bill.

                                                                                Reserving instances and availability zone may be a good idea, but again that comes at a price. So, you need to find the breaking point for that or what is worth for me to be in the same availability zone versus what does it cost me to make sure that it actually happens. That's another aspect you always need to look into. Then of course the last point here. When you do this, it may actually be desirable to put them across different availability zones if you can afford it from an availability perspective. Because if one data center goes down, it's quite rare with the large public cloud providers. But still it can happen. You're going to be hurt pretty bad if you have everything in the same availability zone. So, it makes sense to distribute them across multiple zones.

                                                                                Another thing that you need to keep in mind is that you need to focus on services rather than components. The traditional capacity management approach has been to go from components to applications to business services. The component aspect is completely replaced by micro-services. You need to focus on those, you need to instrument those. You need to understand how much resources each micro-services used. And then by aggregating those into applications, you can understand how much an application is using and eventually business service. micro-services and containers becomes very important. Having that level of granularity in your monitor solution and aggregating that into applications and services. Not aggregating component level data anymore. It's a big difference from how you used to do things as well.

                                                                                And then the last very important aspect is the efficiency. As we said before, probably the biggest driver behind cloud adoption is cost. Because it comes with a relatively attractive ROI in an ideal case. But if you're not deliberate about optimizing, you're not going to realize that ROI. So, in the cloud, the public cloud capacity equals cost. The more capacity used, the more it's going to cost you. Capacity management begins a practice of delivering the right capacity at the lowest possible cost. Capacity is no longer limited. You can get any amount of capacity. Especially if you have an application that is designed to scale out. There's always going to be capacity available, but the cost will also grow in relation to how much capacity you allocate.

                                                                                So, understanding the cost model to make informed decision, and get predictable bills becomes the biggest or the most important aspect of capacity management in the overall objective of capacity management changes to cost optimization rather than optimizing limited resources. How do we do this conceptually? The goal is to predict future allocation and cost based on trends and seasonal profiles. You need data about usage profiles, you need trends and seasonality patterns as we said before. You need to store data about how many instances are being allocated during a specific phase of a business cycle, and how does that look from month to month or from week to week, and what is the trend, et cetera, et cetera.

                                                                                Understanding what the usage profiles and the business cycles are. You also need to understand the charges. So, what are the current rates and what are the accumulated charges this far? Also, things like inter-instance communication or types of instances. If it's reserved or on demand et cetera, et cetera. So, charges and the cost structure for the public crowd provider becomes a very important aspect as well. And then you still need to work with forecasts. Significant tenants and new initiatives on a business level needs to be incorporated as well. So, you need to forecast those demands and weigh that into your usage profiles. Because they are going to potentially at least going to throw your historical profiles completely overboard, and change them dramatically. So, you need to take that into consideration as well.

                                                                                There's three things you need to keep track of. And it's important to ... Charges is typically you will get from the cloud provider. Forecast is something you need to work with the business on. Usage profiles, yeah the data may come from the cloud provider, but it's probably your responsibility to make sure that you build those trends and build those historical trails. Because that's not necessarily going to be available from the provider of the cloud service. That becomes your responsibility. Also as an advice in this, it's very important to work with resource tagging. Different public cloud providers have different capabilities in this space, but it's very important to tag resources.

                                                                                When you provision new resources in the public cloud, you need to tag them with projects or lines of business or customer or whatever makes sense to you. But if you don't do that, the bill that you get, even if you try to forecast and you try to understand the cost. If the bill doesn't tell you anything about what resources were using what et cetera, it's really hard to improve and to learn from your efforts. So, having that granularity and that understanding and that factoring of your bill becomes very important in order to improve your ability to predict. So, resource tagging is key to all of this. This is sort of the general approach to how you do optimize and predict the cost in the public cloud.

                                                                                A quick summary before we go into Q&A. What do you need in a cloud capacity management solution? The tool that allows you to migrate and deploy workloads into the public or private cloud, the tool that understands the difference between your legacy platforms and the capabilities and the performance of your cloud instances. You need to have monitoring of ongoing resource usage and charge. Typically you get that from the monitoring solution that the cloud providers has. So, Cloud Watch for AWS, Azure Monitor for Azure, Stack Driver for Google Cloud, et cetera, et cetera. Using those and making sure that you pick up that data.

                                                                                You need to end-to-end resource visibility for hybrid IT. Going to the cloud will not be a model or a binary switch where you turn everything else off, and all of the sudden everything runs in the cloud. You will likely have things running in your data center still, and you need to federate the view of have one foot in each camp, and understand both the cloud environment as well as your traditional On-Prem IT. You need to optimize and predict the financial impact as we said before. You don't really have to care so much about your infrastructure anymore. Infrastructure is infinite in the cloud, but you need to predict the financial impact and make sure that you run your workloads as efficiently and as lean as possible.

                                                                                And then the last point. You need an ability to do long term planning and meet business demands. That's the whole point of doing capacity management. It's easy to forget about this. In order to understand seasonality and long term growth patterns et cetera, you probably will have to store the data somewhere from those monitoring solutions. Because if you look at Cloud Watch for example, it only provides you with data about an instance as long as that instance exists. With the whole Cloud Native approach, you're going to add new instances and remove instances on a daily basis. If all the data disappears when you decommission an instance, how are you going to remember what yesterday looked like, or last week looked like, or this peak in business demand? How many instances were provisioned to cope with that? There will be no memory of that in the cloud platform, in the cloud framework.

                                                                                So, you need to store this data somewhere. You need to tap off data from those solutions into a historical database that you maintain yourself in order to do long term planning and understand business demands over time. In summary, public cloud is about providing required capacity at the lowest possible cost. It's not about optimizing use of limited resources, but actually getting the job done at the lowest possible cost. Private cloud is about optimizing both allocation of resources in the cloud service, but also about sourcing the right infrastructure for that in the cheapest possible way and just in time et cetera, et cetera. All the stuff we used to do in traditional capacity management still needs to be done in the private cloud. It will be more around organic growth and trending than it used to be because you don't have the same visibility of the workload. So, you don't have the same understanding of the workloads running in the private cloud service. But you still need to add an element of understanding the business demands of your biggest tenants at least.

                                                                                Another [inaudible 00:53:46] important to remember, most organizations will have both hybrid IT. So, you need to look for a solution that can address both sides of this. You need to have one foot in the traditional legacy data center IT, as well as one foot in the public cloud, the workloads running in the public cloud. Because you will have business services that goes across. And in order to optimize those and to provide optimization and management of those, you need to understand both. And then the last one, really important. The ROI of cloud and initiatives, capacity management will have a major impact on those. The typical ROI business case for moving over to the cloud is built on the assumption that you optimize your resources. You use resources in the best possible way that you get the most mileage out of your resources, et cetera, et cetera. And none of that is going to happen by itself.

                                                                                You need to make sure that you understand your workloads and that you use the right type of resources to host those. You shouldn't strive for perfection, but at least you need to do decent work at optimizing and avoiding the most wasteful type of behaviors. Completely optimizing it will be difficult, because there are so many different parameters, and so many different characteristics of the cloud solution that you can optimize. But at least you need to do the stuff that we talked about in this presentation in order to get the kind of ROI that you're expected.

                                                                                So, with that, I think we're done. And we have time for some more questions. Amanda, is there any more questions?

Amanda Hendley:            00:55:38               Yes. We have a few questions. This is from Marty Pruden. He said, "How do you ensure that your cloud provider is not giving you less capacity? For instance, fractional CPUs and charging for full CPUs. Are there metrics that can be used to keep your cloud provider honest?"

Per Bauer:                           00:56:04               No, I think it's kind of difficult to do that. Either you use their own instrumentation, and then you're obviously left to their ... in their hands, and they are in control of it. You could use a third party monitoring solution. But they are still going to use the same APIs to get to the data. They way public clouds are designed, there is no way that you can get to the actual metrics on the operating system level. It's always going to be through some sort of hypervisor that provides an API. So, resources are monitored through APIs, and it's the quality of those APIs that determines how accurate the measurements are. I think we're going to have to operate under the assumption that a serious cloud provider will actually provide you with the type of resources that you asked for.

Amanda Hendley:            00:57:02               Here's another one for you. From your experience, what is the number one driver for cloud adoption and companies?

Per Bauer:                           00:57:12               It depends I guess. For some companies, the biggest driver cloud adoption is cost. Whether that is because the CEO read about cloud in an in-flight magazine, or it's actually based on real insights. Depends. But that is mostly the driver, I guess. But the expectation is that you will be able to provide IT cheaper than before. That is sort of the general reason for moving into the cloud. For some other organizations, I think their online presence and their performance is crucial to their success. Especially if you have a global approach where you need to be available to people in all parts of the world at the same time, et cetera.

                                                                                Cloud adoption is seen as the best way to improve that and improve that agility. So, that becomes the biggest driver sort of. It's sort of a mix between cost optimization and online presence, I guess.

Amanda Hendley:            00:58:16               Okay. Here's a question for you. How do you distinguish between a performance anomaly that might be consuming capacity for the wrong reasons, versus a genuine need for capacity based on business growth?

Per Bauer:                           00:58:35               That's a very good question. I think the best way of doing that, since you don't have the access to the same granular data as you used to do, I think this monitoring and the ability to compare release cycles or releases to each other is very important. You understand the business cycles of the business seasonality, to make sure that you're comparing apples to apples. But as long as you have that understanding, you can always go back and compare. Because anomalies are typically introduced through new releases. And new releases happens more frequently in the cloud as we discussed before with adoption of agile methods, et cetera.

                                                                                And a good way of doing that is compare to a previous release during the same business cycle. How did that act, and is there a significant difference in how this one behaves compared to before? And then compare that to the level of business activity. That's probably the best way of doing it. So, you kind of get an analysis towards historical data.

Amanda Hendley:            00:59:47               I think we're out of time for today. But I believe that you promised that you would answer some of the remaining questions offline.

Per Bauer:                           00:59:56               Sure.

Amanda Hendley:            00:59:58               We'll make sure that we get that information out to everyone after the session. Per, thank you so much for an interesting and engaging session today, and I appreciate your time.

Per Bauer:                           01:00:12               Thank you very much.

Amanda Hendley:            01:00:13               All right.

Per Bauer:                           01:00:14               Thank you everyone for attending.

Move forward with capacity management in the cloud

Try Vityl Capacity Management free for 30 days to see how it can solve your organization's cloud capacity management problems.