XenApp Scalability v2013 - Part 2

1:08 PM
XenApp Scalability v2013 - Part 2 -

Overview

The first part of this article (you should read first if you have not already), I found some interesting results from a deep commitment scalability testing we recently completed a major customer. And I'm sure some of the results are you scratching your head a little ... and maybe you had questions like I did, first as:

  • Why 4 vCPUs was chosen as the spec VM? Citrix and VMware does not always recommend 2 vCPUs for XA?!?
  • Why do we have only 130 users on each physical host? Andy The article written a year ago said that we should expect 192 users with this material and a medium workload?!?
  • Why do we only-commit "half" of the total available CPU (ie 24 vCPUs)? If you are not always use all logical vCPUs available in the box (ie 32 vCPUs)?
  • How did NUMA factor in this particular test? Why NUMA same issue at the high sizing VMs these days?
  • Why do users "work" much less than we think? (Kidding ... I stay away from it!)

If you have more questions than that, feel free please leave me a comment below. But let's address the above questions first.

CPU Over-Subscription

Whether or not you use 2 vCPUs, 3vCPUs, 4 vCPUs, 6 vCPUs, 8 vCPUs (or another number) for each VM XA is a more difficult question that really requires testing and understanding of NUMA (and I'll get to next). But a question best addressed first is how much CPU oversubscription should you do? It is easier to explain with an example ... so remember for this particular test we had a box with a CPU configuration "2 × 8" (which is kind of my slang for 2-socket and 8-core per socket). And remember we enable hyper-threading. So we have 16 "physical" processors or cores and 32 "virtual" processors or logic. And most people refer to what pCPUs 16 and 32 vCPUs. So the question is do I deploy enough XA VM to use 16 processors, 32 processors or somewhere in between? The short answer is somewhere in between and most of the time the best "sweet spot" will be a report oversubscription of 1.5: 1 meaning 24 vCPUs in this case . So I could deploy 12 XA VMs on each host if I am using a specification of 2 vCPUs, or 6 XA VM if I'm using a specification 4 vCPUs, or perhaps only 3 XA VM if I'm using a specification 8 vCPUs. You basically want the math to add up to 24 vCPUs in case you did not take it. Now the question ... why use a 1.5: 1 oversubscription ratio or "split the difference" between pCPUs vCPUs and generally the best sweet spot? For that, look at some more detailed results of our test, then we will do our best to interpret the data.

We tested different ratios using oversubscription LoginVSI to find the optimal sweet spot. To keep things simple, I will provide test results with 4 vCPUs VM configuration. So in summary, we tested 4 XA VMs, VMs 6 XA and XA 8 VMs (again, while using 4 vCPUs) to see if the use of 16, 24 or 32 total processors give the best results. The VSI Max score for each test was 119, 130 and 122, respectively. Meaning that the 6 XA VMs with 4 vCPUs test resulted in the optimal density while maintaining a user experience (which is measured by LoginVSI in a ton of different ways in these tests). Now the question becomes WHY made the 1.5 :? 1 oversubscription ratio "win" or give the best results

This sort of sense if you think about it. When using only the 16 physical cores in the box, you really do not enjoy hyper-threading (which should give you a bump in performance of 20-30%). So in this scenario with a 1: 1 oversubscription ratio, you really are not making full use of the box. On the other hand, when we try to use the 32 virtual cores in the zone, we stress the CPU scheduler too, so it has an effect of decreasing returns. Remember, hyper-threading only gives you a 20-30% performance increase ... not 100%. In addition, we save some resources (CPU cycles) for the CPU scheduler in ESX itself - the process of determining which logical CPU to place the workload cycle following comes not "free." So in this scenario with a 2: 1 oversubscription ratio, we are really hammering the box and it is not the optimal configuration to be. That's where split the difference between the amount of pCPUs vCPUs and made a lot of sense and really shines - leaving us some valuable resources for the scheduler itself, but we are also taking advantage of hyper-threading. And do not believe me - believe the data ... the scenario with a 1.5: 1 oversubscription ratio gave the best score VSI Max

Another thing I wanted to share is not the first time. I saw this oversubscription ratio yield the best results. In fact, one of our most important partners in industry (which makes the software EMR / EHR) always recommends 1.5: 1 oversubscription ratio And they did more testing with LoadRunner on a variety of hardware configurations than anyone I know. And this test, we did not really validated these results

Now what that means 1,5 :. 1 oversubscription ratio should always be used? Not necessarily. It depends on the participation rate, the hardware, the version of the hypervisor and programmer, etc. For example, I could find a sweet spot slightly different if my participation rate is 50% compared to 85% ... if my equipment is a 4 × 6 versus a 2 x 8 ... if I am using XS versus VMW. The sweet spot might be 1.7: 1 or 1.25: 1 ... and the only way to find out is through proper testing with a tool like LoginVSI or LoadRunner. But what I say is that if you do not know or do not have time to test, then using 1.5: 1. CPU oversubscription ratio for XA is my recommendation

NUMA

It always amazes me how many people do not know what NUMA is or care to understand how it impacts the design of exercises like the one we do. And that is not specific to the XA workload - this concept applies to all levels and you need to consider when virtualizing all NUMA big workload server such as Exchange, SQL, etc. Now I will not explain what is NUMA detail in this article, as it has been done 100 times before the Interwebs. But, it is synonymous with Non-Uniform Memory Access and I want to consider it as "keep things local". The idea is pretty simple - when you have a box with multiple sockets, each socket has local memory it can access very quickly ... and remote memory (on the other taken through a connection bus or inter-socket) that it can access without a goal so quickly. And if the CPU scheduler of a hypervisor is not "aware NUMA" as some say, then bad things can happen - we could be sending processes and CPU Remote son or memory, as opposed to the local CPU or memory. And when we do that, it introduces some latency and ultimately affects performance (user density in the case of XA). How much of an impact on the performance NUMA can really have? We did a study on this a few years ago when we introduced support for NUMA in Xen and it was a success ~ 25% on the user density! Which means we could get 100 users on a box without NUMA awareness ... and 125 with the consciousness of NUMA. Quite significant. Fortunately for you and me, all major hypervisors these days are NUMA aware. This means that they understand how the hardware is configured in terms of underlying NUMA nodes and scheduling algorithms are optimized with NUMA in mind. So it's fantastic - but what the heck does that mean and how do I dimension XA my VMs? 😉

Notice I said above "underlying NUMA nodes" - what do I mean by this is really important things, believe it or not, all Intel chips?. are not created equal. And not all sockets have just one (1) underlying NUMA node. what i mean is a plug with 8 cores (as in our example), could actually be divided into multiple nodes NUMA underlyings. each outlet with 8 cores could probably 4 NUMA nodes, 2 NUMA nodes or one NUMA node. and if the decision is divided into nodes say 4 NUMA, each node has its own "local" resources CPU and memory that it can access the fastest. - in this case, 2 cores each if it is divided into two nodes, each node would have 4 cores each instance, the concept of NUMA concerns not only with multiple socket boxes .... it also applies within each outlet or die! So the next obvious question is how do you know that the NUMA configuration underlying is for equipment that you bought (or are considering buying hopefully!)? Well, there are tools like Coreinfo you can run on Windows operating systems ... and there are commands like info xl -l 'you can run on Xen or " numactl -Hardware 'you could run on Linux. And those spit all sorts of good information on the CPU in the box, but especially the NUMA configuration to help you understand where the "limits" NUMA lie. If you do not want the CLI, you can simply ask the seller or the manufacturer (Intel ... or HP, Dell, IBM, Cisco, etc.). Sometimes the material data sheets even on there. But I can tell you from experience that some years ago, the catches were almost always divided into multiple NUMA nodes. Today (ie the new Intel chips), they are divided less or perhaps not at all. This is one reason why we have always recommended 2 vCPUs per VM XA back in the day - because most of the boxes we were using at the time was 2 × 2 × 2 or 4 and each box quad-socket has been divided into two nodes, respectively (each with 2 cores each). So 2 vCPUs was really the sweet spot and led to optimal performance on older chips / boxes. But fast forward to today ... and catch the last boxes are not divided as much or they are divided into much larger knots. So you might get a 2 x 8 and each socket has only one node, which makes super-awesome and easy to the size of things. I'll explain why it's great to side

XA VM Spec -. How vCPUs

So now that you know a thing or two? about NUMA (and if you're still fuzzy on this, I highly recommend this and this article - they explain it better than me), can you guess what the NUMA configuration underlying was on our box 2 × 8 used in our tests? Well, since I have already said that you 4 vCPUs gave the best results, you might assume that every shot is actually composed of two NUMA nodes within them (with 4 cores at their disposal each). And when I first heard this, I immediately looked up the specific Intel chip we used in the Dell boxes (because I am under the impression that they were newer Intel processors and they would therefore not be divided into multiple nodes in each socket), and of course, we use Intel chips were about 2 years (to 2011). It was therefore logical. But chances are if you buy hardware today with the latest Intel chips, sockets will probably have a single node. And if you have a box that is a 2 × 8 with a single NUMA node in each take, then chances are you'll get linear scalability with 8 vCPUs assigned to each VM XA. Let's dig in a little bit more ...

To better illustrate the concepts of "thrashing NUMA" and "linear scalability," we will again look at the practical results of our tests. To keep things simple, assume that we use a ratio of oversubscription of 1.5: 1 for all three of these scenarios. We wanted to test 2, 4 and 8 vCPUs (with 12 virtual machines 6 and 3, respectively) to see if the underlying NUMA nodes were really a factor and if things brought to the linear scale or not. We achieved a VSI Max score of 125, 130 and 0, respectively. So how should we interpret these results?

Well, these results tell me the difference between a 2 vCPUs and 4 vCPUs VM spec is almost negligible ... and things linear scale (ie we get about 10 users per VM XA with 2 vCPUs and 12 GB of RAM and 20 XA users per VM with 4 vCPUs and 24GB of RAM). And what is really expected, knowing that all these virtual machines can fit well in each NUMA node and are not cross the border NUMA. But what happens when we have larger virtual machines with 8 vCPUs and they do not fit well in these NUMA nodes each with 4 cores? This is when we have to use local and remote resources between NUMA nodes or even made ... or maybe the planner is moving things too often try to compensate locality ... it could lead to something called " NUMA thrashing ". And what that means thrashing NUMA scalability for XA and what we do? This means that latency increases, the planner hungry more often and it results in the density of users quite poor. In our case, it meant not to scale even close linearly - took about a 30% drop in performance when switching with 8 vCPUs XA specification VM! The moral of the story is to try to the size of your VM as a multiple of the NUMA node size of your physical host.

And another fun fact, I wanted to mention this note ... if you have a box 2 × 6 (still popular) and is composed of NUMA nodes each with 3 cores .. . then using a VM XA specification with an odd number of vCPUs could really give the best results! So while 3 vCPUs XA VMs may sound weird ... NUMA is why some people might recommend that apparently spec particular VM. Go with 2 or 4 vCPUs could actually cause more thrashing NUMA and worst response time and user density (and this has been tried and tested by several of our major health care clients, I might add!).

a word about business and LoginVSI Ratio

I know this article is long but stay with me because there one thing I want to quickly address if I can - and that is why we have "only" get 130 XA users on each physical piece of hardware (compared to the number 192 we get there a year on apparently same piece of hardware). Well, this has to do with the new default value or the average workload Login VSI 4.0 (the latest shipping release). The scalability tests that Andy and his team conducted a year ago (and even guys HRV project made it a few years ago) were performed with LoginVSI 3.x ... and the "normal" or average at the time workload was fairly well ... medium'ish! The script was much shorter and less intense. Fast forward to LoginVSI 4.0 and the script is much longer and more intense. In fact, a colleague and I did a quick test and the average bandwidth used by one user was around 260 kbps with the new script v4! And the script is "active" (inactive against) nearly 0% of the time !!! I do not know about you, but I would not say this as a typical worker XA task or even a medium / normal user ... that, to me, is a heavy user that works way more than the average bear. I really hope LoginVSI take this feedback to heart and change their script - most average users still XA 20 kbps today and most of the users' work 'about 50-60% of the time ... so it' is what really skew the numbers in our test with LoginVSI 4. once we have introduced more time to rest (sleep) and "softened" a little test script, we were magically get nearly 0 users box. Again, it is up to you to ensure that simulate what your users are doing - shame on us for using the average script or default comes with the free version of LoginVSI

UPDATE DAY: LoginVSI handed it to me after the publication of this article and let me know that they have some of that documentation on page 9 of their new v4 Upgrade Guide. But it seems they tested XD (versus XA) and only found that the new script increases the use of 22% CPU. It's still a lot, but I think more like 33% in my tests. LoginVSI also reflected on the change in their average workload by default and I will meet in November to discuss further. Good stuff ... cheers LoginVSI listen to our comments and one for the Community!

Wrap-Up and Key takeways

, which was a long article ... I know. So I'll try to summarize main points

  • If you can not do tests to find the optimal CPU oversubscription ratio, "split the difference" between . the number of vCPUs and pCPUs
  • do not forget to take into account NUMA when sizing your VMs XA - if possible, go with a multiple of the size of the NUMA node of the physical host
  • not. be afraid to go with bigger XA specifications, such as 4 or 8 vCPUs each. on the latest Intel chips, we see a linear scalability or even slightly better scalability with the major characteristics VM! and it also means fewer Windows OS license costs and much less overall to manage virtual machines !!! this is a great advantage that I kind of failed to explicitly mention above. I know a customer that can 800 XA manage VMs with 2 vCPUs each - they are now running 0 virtual machines with 8 vCPUs
  • Do not be afraid to use an odd number of vCPUs for your XA VMs - on a box 2 × 6, 3 vCPUs could be the best bet. About 2 × 10 boxes with newer procs E7, 5 vCPUs could be better if the industry is divided into two NUMA nodes within them. Or maybe you use 10 vCPUs each and set vNUMA if you use VMware (a topic for another day, but see page 42 of the last vSphere 5.5 WhitePaper best performance practices for information on vNUMA and 8 vCPU VM + ).
  • Be careful with the new default / average workload that ships with LoginVSI 4.0 - it is quite heavy and can skew your results by that 22-33%! This means that you could really buy too much equipment -. It is always best to customize your workload or test script to match what your users really are
hope you enjoyed this update on XenApp scalability - the 2013 edition! Please leave me a comment below if you have learned anything or if you have other questions, I have not managed to answer. Who knows ... maybe there is a part 3 on the road.
Cheers, Nick
Nick Rintalan, Lead Architect, Citrix Consulting
Previous
Next Post »
0 Komentar