VMworld 2017 Breakout Session – Extreme Performance Series: vSphere Compute & Memory Schedulers

One of the longest blog post headings I have seen:)

After 30 min lunch break which meant I didn’t get lunch i attended the breakout session “Extreme Performance Series – vSphere Compute & Memory Schedulers SER2344” presented by Xunjia Lu who is a Staff Engineer.

!! Resources coasts money so we have to schedule them accordingly !!

This is one session I try to attend every year since some new stuff is always added and here are a few take aways I want to point out:

  • esxtop – %OVRLP and %SYS moved to %RUN in vSphere 6.5
  • Be aware of difference betqeen vm (group) and world (CPU) and specifically if looking at %RDY since the Group will shown an aggregate of all worlds RDY number meaning you can easily have over the critical 5% level when running large VMs.
  • VM configuration for vNUMA where you don’t have to put in any thoughts around the vSocket vs Cores per vSocket anymore. Just configure the number of vCPUs you need as the number of vSockets and use default Cores per vSocket meaning 1. vNUMA will figure out to most optimal virtual hardware layout for you meaning matching the underlying NUMA architecture.
  • Size vm based on cores and not available threads which includes hyperthreading

A short summary of the different session topics

CPU Scheduler

Goal are to ensure fairness (shares, reservations and limit) and provide high CPU utilisation together with high application throughput.

CPU Sheduler overview includes

  • When to run
    • Workload wakes up
    • Workload wait
    • Preemption
  • What to do
    • Put world (VM vCPU)  in ready queue with the least consumed CPU time.
  • Where
    • Balance across PCPUs
    • Preserve cache state
    • Avoid HT/LLC contention
    • Close to worlds that have frequent communication pattern

esxtop: %USED vs %RUN was explained. %RUN time is based on clock time meaning if vCPU runs for 1 second during 2 sec windows (min intervall in estop) you’ll see 50% %RUN and if it runs for 2 seconds you’ll see 100% %RUN.
%Used is trying to capture the amount of work and needs to take things like hyper threading and turbo mode into account. That’s why you can see e.g. 70% %USED while %RUN shows 100%.

Even though the general recommendation is to turn on hyperthreading you should be aware of that when system is under load each thread (and VM vCPU) runs slower when hyperthreading is enabled compared to having hyperthreading turned off.
However, even though each thread might run slower you’ll most likely still see an improved throughput and enable hyper threading is about increasing higher throughput.

vCPU scheduler breakdown example:

  • t0 – t1 waiting
  • t1 – t2 cpu scheduling cost
  • t2 -t3 time in ready queue
  • t3 t4 running
  • t4 – t5 interrupted
  • t5 -t6 running
  • t6 -t7 efficiency loss from e.g. power mgmt , HT
  • t7 – t8  running

%USED is calculated according to

used = %run + %sys -%overlP -E (which can be gain or loss but stands for efficiency loss)

co-scheduling happens because we don’t want one vCPU to take all CPU resources for a specific VM. ESXi allows a subset of vCPUs to run simultaneously. If you got high %COSTOP check for

  • %RDY
  • Watch out for vCPU %VMWAIT

Memory management

Less flexible compared to CPU scheduling. The memory scheduler will reclaim memory if consumed > entitled

  • Entitlement: shares, limit, reservation, active estimation
  • Page Sharing > Ballooning > Compression > Host Swapping

Large memory pages gives 10-30% perf improvements but we can’t share large pages.

Ballooning is not always bad if not more memory is used by VMs than available on the ESXi host. Most important parameter to check for if investigating slow memory performance is swap-in, SWR/s.

There are two types of memory overcommitment:

  • Configured
  • Active – If active memory overcommit = 1 there can be a performance problem

NUMA Scheduler

The NUMA (Non Uniform memory Access) scheduler has two main purposes

  • Load Balancing including:
    • initial placement based on CPU/memory load + round robin
    • Periodic rebalance run every 2 seconds to improve utilisation and performance
  • expose vNUMA to guest
    • useful for wide VMs meaning VMs that needed more vCPUs than available physical cores in a NUMA node or more memory than available in a NUMA node.

in vSphere 6.5 it does not matter if you configure 4×1 vCPU socket (vSocket) with 2 cores but the NUMA node contains 4 cores. Then scheduler places 2 sockets (a sum of 4 vCPUs)  in same NUMA node.

Cluster-on-die (COD) breaks each physical CPU socket into 2 NUMA domains bacause of issues of having too many CPUs on the same shared bus. This means that a NUMA node is not (but it can) have the same size as a physical CPU socket in terms of cores and memory. vSphere 6.5 supports Haswell, Braodwell and future CPUs in regards of COD. Follow vendor recommendation if you should enable COD or not.

VM Sizing and Host Configuration

Regarding host configuration the Power Management Policy has been a hot topic for many years. Unless you have a system which runs heavily all the time you should use the balanced mode instead of high performance mode since balanced mode has better insight in the CPU state (using C and P states) and can e.g.  take better advantage of the Turbo boost compared to high performance mode.

Ideally size a VM to fit within a NUMA node.

The below sizing information basically tells you that both VM oversizing and undersizing is bad so I suggest you right size your VMs 🙂

  • Oversizing VM CPU can lead to performance regression. One showcase showed 40% perf loss with 8 vCPUs compared to 1 vCPU.
  • Undesizing VM CPU can also lead to performance regression based on internal CPU contention. Within the VM, check CPU usage and processor queue lenght. In Linux based systems you can just run the command “vmstat 1 60” to get 60 CPU queue information samples.
  • Oversize VM memory may lead to memory reclamation.
  • Unsersizing VM memory can lead to VM/guest level paging.

All in all a really good session as always