Notes on Memory Resource Management in VMware ESX Server

The paper: [1]esx

The context is virtual machines. A VM system has a virtual machine monitor (VMM), a privileged component, which runs one or more guest operating systems (guests) that may not know of one another’s existence (or of the existence of the VMM). Virtual machines can be hosted or native. Hosted means the VM runs as a process on an underlying OS, much like QEMU does for your labs. ESX Server is native, meaning the VMM has full privilege, with no OS underneath.

Goal

Improve the performance of VM systems whose primary memory is overcommitted. (The sum of the guests’ “physical” memory sizes is less than the host’s machine memory size.)

This goal is specific: we can assume the system bottlenecks on memory, not CPU or other devices. This allows us to trade off CPU time for reduced memory pressure, and the paper does this. (A system where guests contended on CPU, but not memory, would make different choices.) Nevertheless, the evaluation is overall system performance, so we can’t waste too much CPU.

This goal is not exactly new. Your laptop probably has overcommitted primary memory right now. Operating systems are designed to overcommit primary memory; they get better utilization that way, since most programs are mostly idle.

But the VM context adds some really interesting constraints. How can the VMM extract information from, and coordinate decisions with, guests that aren’t aware it exists?

The paper attacks this goal using a fantastic grab bag of data structure ideas and clever hacks. We can learn a lot from the way these problems were addressed.

Content-based page sharing (Section 4)

Problem: Memory is wasted on many redundant copies of the same page, for instance when multiple copies of the same version of Windows are running.

Conventional OS solution: Shared libraries, read-only program data (helps when multiple copies of an application are running), copy-on-write fork.

VM context: The conventional solutions make sharing happen explicitly, by changing how applications are written (shared libraries) or by building sharing into the API (fork). The VMM cannot do either.

Insight: Detect sharing opportunities in the most general way possible, by looking behind the scenes for pages that contain the same data. Matching physical pages can be remapped onto a single machine page, which guests access copy-on-write (even if they expect to access the page read/write).

Cooperative page reclamation (Section 3)

Problem: A VMM may need to reclaim some of guest G1’s memory to give to G2. (Perhaps G2 is more active; perhaps G2 deserves more memory by policy.) Which G1 page(s) should be reclaimed? If the VMM makes a bad choice, it will have to read G1’s page back in from disk. Worse, the VMM’s paging mechinasm can subvert the guests’ paging mechanisms, further increasing swapping. This is the double paging problem. (Say the VMM swaps out page P1, and then later G1 desides to swap out P1 to its virtual disk. The VMM, then, must (1) read P1 into memory from VMM swap space so G1 can read it, (2) write P1 to G1’s virtual disk, and then possibly (3) write P1 back to VMM swap space!)

Conventional OS solution: The OS manages most system memory using a buffer cache. Swap decisions are based on several interacting algorithms, including a least-recently-used list (so that recently-accessed memory is swapped out last) and information about likely future uses (cf. the Application Directed Prefetching paper). Applications can tell the OS about usage patterns using APIs such as madvise and posix_fadvise; the OS itself can infer usage patterns for some structures, like directories. Newer operating systems make heroic efforts to swap the right stuff out and in, with algorithms that approach machine learning (e.g. Windows SuperFetch).

VM context: The VMM could maintain an LRU list using similar techniques as the OS (on IA–32, the PTE_A bit), but is not given any feedback about future memory use. The double-paging problem cannot be avoided.

Insight: These problems could be solved with a simple VMM-guest interface: the VMM asks the guest to free a page, then the guest returns the pages it freed. This interface doesn’t exist, for obvious reasons. However, interfaces to allocate physical memory do exist in most kernels (for use by hardware drivers). And physical memory allocations are exclusive: when one OS component allocates a page, the prior owner has to free it. So we can convince the OS proper to free a page by asking a cooperating driver within that OS to allocate a page! What a clever reversal! When pressure reduces, the VMM asks the cooperating driver to free some of this pinned memory, which allows the guest to allocate it for other purposes. The cooperating driver is called the “balloon driver.”

Problem: VMware’s management interface should support guest priorities, which will affect guest memory usage. But strict partitions of memory would pointlessly restrict utilization: if a high-priority guest is idle, the machine’s resources should be given to other guests.

Conventional OS solution: Real OSes implement priorities for CPU scheduling using ad-hoc mixtures of proportional-share, strict priority, and priority with aging (with addons, such as priority inheritance to address priority inversion). Priorities are rarely applied to memory, as far as I know; it’s a very hard problem. The OS detects idle processes trivially and explicitly—they block in a system call—and can identify their memory. However, naive priority schemes might punish idleness too severely or too leniently. For instance, a hierarchical priority scheme (if (idle) low priority; else higher priority;) might swap out all idle memory, causing too much swapping when an idle process resumes.

VM context: A VMM can often detect total guest idleness, since modern machines offer explicit sleep instructions and low-power modes (for power saving and hyperthreading; see HLT, PAUSE, etc.), but there are no explicit interfaces for memory idleness. LRU tracking can detect idle memory, but this doesn’t solve the problem of how to flexibly combine idleness with priorities.

Insight: Hierarchical priority schemes either weight idleness too much or not at all. So we drop the hierarchy and parametrically combine idleness with priority, accounting for idleness on a flexible scale. The particular definition of ρ (the idle+share metric) is less important than having a single metric that combines idleness and share. Idleness is measured using the equivalent of an LRU list: n pages are sampled each second, by causing accesses to fault; the fraction f of sampled pages that fault is a good activity metric.

I/O page remapping (Section 7)

Problem: Late-model IA–32 processors support up to 64GB of machine memory (machine addresses are 36 bits long), but some devices only support the initial 4GB (“low memory”). The VMM must interpose on all guest device access; that interposition can be cheap in the common case (essentially just checking command validity and addresses), but if a guest tries to use a low-memory device using high machine pages, the VMM must copy data between the guest’s pages and “trampoline” pages in low machine memory.

Conventional OS solution: The OS uses low physical memory to talk to the problematic device.

VM context: The guest might use a low physical page that actually mapped to a high machine page. It is not possible to map all guests’ low physical memory to low machine memory: there are too many guests.

Insight: If a guest frequently uses a high machine page for device communication, the VMM can transparently remap that page to low machine memory. Waiting for actual device communication lets the VMM keep address-agnostic data (which is the majority) in high machine memory.

Dynamic reallocation (Section 6.3)

Problem: The VMM has several techniques for reducing guest memory usage: ballooning, paging, and even blocking entire VMs. How should these techniques be coordinated when the machine is under memory pressure?

Conventional OS solution: Conventional OSes also have an array of techniques: reduced prefetching, paging, blocking processes, swapping out entire processes. Roughly speaking an OS will order these techniques from least disruptive to most disruptive and try them in order until memory pressure reduces.

VM context: Not much difference (only the techniques are different). However, it’s great to have a well-described example of increasing-intensity reallocation.

Techniques

Sample guest behavior
- In the long run, sampling techniques can yield information almost as good as explicit techniques
- Sample memory idleness by snooping on accesses (mark sampled pages as inaccessible, leading to trap on access: §5.3)
- Sample page contents (§4)
Make the common case fast/cheap
- Don’t mark initially sampled pages as COW; wait for actual sharing (§4.2)
- Don’t waste memory on large sampled-page counts, keep an overflow table instead (§4.3)
Hysteresis: Avoid pathological oscillations (such as swapping a page in and out of memory) by delaying mode switches—that is, by making the system’s current mode dependent on its history, as well as the current state. Thermostats are a good real-world example. Turning a furnace on and off can quickly wear it down. Now say a thermostat is set at 75°F. If the system turns the furnace on whenever the temperature is below 75°, and then off whenever it goes above 75°, the furnace will switch modes all the time! A system with hysteresis might turn the furnace on when temperature falls below 73°, and off when the temperature goes above 77°.
- Mode switches in dynamic allocation use hysteresis (ballooning turns on when the system has <4% free memory, and off again when the system has >6% free memory; paging turns on at <2%, and then off at >4%; and so forth).
- Idle memory detection implements hysteresis with a combination of “slow” and “fast” moving averages: reductions in idleness are detected quickly, since the “fast” average goes up quickly, while increases in idleness are detected slowly, since the “slow” average stays high for longer.
Exponentially weighted moving averages (EWMAs) (§5.3)
- Very simple, low-space mechanism for tracking a smooth average
- Let x̄_i be the average at time i, and x_i be the sample at time i.
- Then x̄_i = αx_i + (1−α)x̄_i−1.
- A “fast” EWMA has high α (perhaps α=0.1 or more), and quickly reflects new data. A “slow” EWMA has low α (0.01 or less), and changes more slowly. (The exact meanings of “fast” and “slow” depend on the sampling period; one can calculate analytically how quickly a EWMA will get within a factor ε of a new steady state, given α and the sampling period.)
Track system performance at a fine-grained level, fix observed problems
- I/O page remapping detects frequent use of high machine memory for I/O and remaps to low machine memory
- In the VMware “Comparison” paper [2]comparison, the VMware binary translator detects frequent traps in system code and re-translates the problematic regions using SIMULATE or an interpreter to avoid the traps
- A form of “make the common case fast”
Squeeze space (§4)
- Hint frames make space for page backreferences by truncating the hash
- 24-bit page numbers
- No backreferences for shared pages
Avoid policy (to avoid bad interactions with guest policy)
- Ballooning (§3)
- When ballooning fails, demand paging falls back to random page replacement (§3.3)

Interactions and factor analysis

What is required to swap out a shared page? How much data must be scanned? (Think about what structures VMware stores.)
How do ballooning and content-based page sharing interact? What if the balloon driver frees up a shared page?
How do content-based page sharing and VMM paging interact?
How do idleness sampling and content-based page sharing sampling interact?

Evaluation

We leave this paper totally believing that its mechanisms work and are well designed. But we don’t leave with a feeling that all system interactions have been evaluated. And, as a result, we don’t know which, if any, mechanisms are overdesigned, and which, if any, mechanisms have pathological behavior in extreme cases. Do any pages but zero pages have a high share count? Is share-before-swap useful for any pages but zero pages? Would it ever become a problem that shared pages are difficult to swap? What are the CPU time overheads of the paper’s mechanisms, which all trade off CPU time for memory utilization? (In most cases, the mechanisms will come out ahead of the alternative, tremendously slow disk paging; but the overheads are still worth knowing precisely. We are told some overheads—for example, §4 reports that content-based page sharing is faster on a macrobenchmark, and that CPU overhead was “negligible”; this is better than nothing.) The benchmarks provided are generally macrobenchmarks for real-world workloads. Note how these real-world workloads are chosen to show off the mechanisms: in §4’s benchmarks each VMM runs multiple copies of the same OS, which surely offers more opportunities for sharing; but in §5, where sharing is not at issue, Fig.7 shows a VMM running one Linux VM and one Windows VM. In a research paper produced within academia, we might expect to see attempts at bad-case analysis—“here is a benchmark designed to make my system performed badly.” I find such benchmarks incredibly enlightening—they often show you problems you didn’t know you had—but few companies will publish such benchmarks; the average-, or at least real-world, case matters far more in industry. We’re grateful for what the paper offers anyway.

esx
Carl A. Waldspurger, “Memory Resource Management in VMware ESX Server”, in Proc. 5th OSDI, Dec. 2002
comparison
Keith Adams and Ole Agesen, “A comparison of software and hardware techniques for x86 virtualization”, in Proc. ASPLOS XII, Oct. 2006, pp.2–13