***** Disco completion ***** * Disco kernel replicated on every processor * Recap memory management (some notes repeated from Lecture 6) * Virtualized memory - OS sees "physical" pages, but those are really virtual; Disco maps those numbers to "machine" pages -- "pmap" data structure: maps physical => machine, and machine => virtual -- To install a TLB entry : Look up 'physical' addr in pmap, which maps ; install ; remember mach=>virt mapping, so you can shoot down the relevant parts of the TLB if disco takes the mach page away from the OS - Requires virtualizing TLB inserts - OS changes: Hints from the OS to disco about memory management -- Request a cleared page, rather than zeroing a page in software; Disco needs to do this anyway -- I'm not using this page anymore (freelist) * Disks - Non-persistent disks -- May be shared between VMs, but changes local per-VM -- Changes don't survive reboots - Read a disk block that someone else has read? -- Memory access! - Read a multiple of the page size? -- Remap TLB! -- But map read-only to preserve semantics; if someone writes the data, make a copy (copy-on-write) -- For read-only data, like program memory, this works great "Effectively we get the memory sharing patterns expected of a single shared memory multiprocessor OS even though the system runs multiple independent OSes." - Only used for VM-dependent, non-persistent disks -- No sharing! Independent machines -- Modified sectors kept in main memory -- Persistent disks are managed separately; see below - Data structures -- B-Tree: disk sector => machaddr in global cache; disk sector * virtmachine => machaddr (for modifications) - Persistent disks -- Only one virtmach can mount at a time! -- No need to virtualize disk layout -- Corresponds to multiple-machine abstraction: use NFS to access remote files - Networking interface -- Write a new network device driver with no MTU -- Driver detects the transfer of a disk block ==> transfer page, not data! ... OS already transfers pages because of scatter/gather! Convenient. (what is scatter/gather?) ... Kernel NFS always copied mbuf data to the file cache; instead, use a new HAL layer function, remap(): like bcopy(), but remaps the page when possible ... Kernel then places the mbuf data on a free-mbuf list. This used to use the data page memory; ==> lots of copy-on write faults! Just fix this problem. * More OS changes - What about kernel mode? -- MIPS supports "unmapped" memory in KSEG0 segment; kernel mode only; accesses bypass TLB -- OSes put their code and data in this segment ... Reduced pressure on TLB! -- Problem -- Disco would fault on each instruction -- Solution: relink the OS ... Not necessary on other OSes - ASIDs (address space IDs) -- TLB entries tagged with ASID -- Shoot down TLB when you swap OSes; no need to remap ASIDs - More TLB pressure -- The OS is in there -- it wasn't before -- TLB misses are more expensive, too, since they're virtualized -- Larger virtual TLB in software -- like exokernel (discussed before) **** VMware ESX Server **** * How has the goal changed from VMware to Disco? - Disco was worried about fault isolation *in a massive processor array* -- Mean time to failure for hardware components, etc. -- Multiple OSes per processor was possible, but not the expected mode - VMware insight: Holy crap, the problem is *the software* -- IT departments run a bunch of servers: mail, internal web server, directory server, ... -- Often run on different *machines* to achieve fault isolation ... You *know* they're isolated! -- Expensive! -- Solution: fault isolation *in software!* ... If someone tried to sell you good software you wouldn't believe it ... But the VM has the same fault isolation properties as different hardware boxes! * What's the major differences between VMware and Disco? - The OSes are *unmodified* -- No hooks ... No Cheap versions of enabling/disabling CPU interrupts, privileged registers ... No hints from OS about memory management: ... No Request a cleared page, rather than zeroing a page in software; Disco needs to do this anyway ... No I'm not using this page anymore (freelist) ... No sexy mbuf change ... No sexy memcpy change - Reduces possible performance - But can we get most of the benefit *anyway*? * Techniques - "Gray-box" techniques -- Don't get the OS to tell you what it's doing -- Watch what it *actually does* -- Can actually achieve more sharing and optimization than explicit changes, because you can catch unexpected sharing opportunities! * Content-based page sharing - Identify pages *by their contents* - Hash page contents into a table - If hash match, compare contents - If contents match, Copy-on-write sharing - How to find potential matches? -- Check pages randomly -- If no match, install page in hash table -- But don't prevent writes yet! -- Instead, mark as "hint" entry -- If a later page matches this bucket, check to see if the page itself has changed -- If so, remove hint, install new page -- If no change, up refcount -- Also check pages right before paging out to disk - Space saving -- 16-bit reference count plus an overflow table for larger counts - Space overhead: < 0.5% of system memory - 5 MB of memroy reclaimed from a single VM!! * Ballooning - Problem: How to convince an OS to page memory out? - In Disco, used an OS hint: "freelist" available to Disco - Solution: Load an OS module that communicates with VMware -- Why module OK? -> Doesn't change OS interfaces, most modern OSes have loadable module interfaces - When VMware needs memory, the balloon requests pinned memory from the OS -- Forces guest OS to page -- Balloon tells VMware what pages were allocated -- VMware needn't actually allocate the pages! * Allocation policies - Sysadmin wants to control how much space VMs get relative to one another -- min size, max size, memory shares (proportional allocation if \sum max > actual physical memory) -- 32MB overhead/VM - Dynamic reallocation -- Attempt to maintain min amt free memory -- high state: free mem >= 4% -- no reclamation -- soft state: free mem 2-4% -- balooning to 6%, paging if bal not possible -- hard state: free mem 1-2% -- paging to 4% -- low state: free mem <1% -- paging + stop VMs who are above allocation threshold * Allocation policies 2: Idle memory - min-funding allocation -- Client A demands space -- Replacement alg selects client B to relinquish space -- B = client that has the fewest shares per allocated page -- Shares per page \rho = S/P: S number of shares, P number of pages - But want to differentiate between idle OSes and active OSes - If an OS has many shares, but isn't doing anything, doesn't it make sense to give its shares to other OSes? - How can we even tell whether an OS isn't doing anything!? -- Measuring idle memory -- Idea: statistical sampling -- Every 30 seconds, choose 100 pages per VM and mark them as "not present" ... But don't swap them out! -- If OS touches page, we'll get a fault ... Mark page as actuve! -- At end of sampling period, fraction active is f = t/n -- How to track changes in idle fraction over time? -- Moving averages -- Keep 3 moving averages, though! ... 1 slow average: smooth & stable ... 1 fast average: recent results ... 1 faster average: includes the current sampling period ... Actual idle fraction = max of all 3 averages ... Goal: Track allocation quickly (fast is max) & deallocation slowly (slow is max) - Now we know how idle the memory is - What do we do with this information? - Idle memory tax! -- f = active fraction = t/n -- Shares-per-page \rho = S / (P*(f + (1-f)/(1-\tau))) = S / (P*((f-f\tau+1-f)/(1-\tau))) = S / (P*(1 - f\tau)/(1 - \tau)) -- \tau = 0: pure shares-per-page -- \tau ~= 1: all of idle memory can go away: if \tau = 0.99: f = 0 => \rho = S/100P = very small f = 1 => \rho = S/P