***** Disco completion *****

* Disco kernel replicated on every processor

* Recap memory management (some notes repeated from Lecture 6)
* Virtualized memory
- OS sees "physical" pages, but those are really virtual; Disco maps those
  numbers to "machine" pages
-- "pmap" data structure: maps physical => machine, and machine => virtual
-- To install a TLB entry <virt,phys,prot>: Look up 'physical' addr in
   pmap, which maps <phys,mach>; install <virt,mach,prot>; remember
   mach=>virt mapping, so you can shoot down the relevant parts of the TLB
   if disco takes the mach page away from the OS 
- Requires virtualizing TLB inserts
- OS changes: Hints from the OS to disco about memory management
-- Request a cleared page, rather than zeroing a page in software; Disco
   needs to do this anyway
-- I'm not using this page anymore (freelist)

* Disks
- Non-persistent disks
-- May be shared between VMs, but changes local per-VM
-- Changes don't survive reboots
- Read a disk block that someone else has read?
-- Memory access!
- Read a multiple of the page size?
-- Remap TLB!
-- But map read-only to preserve semantics; if someone writes the data,
   make a copy (copy-on-write)
-- For read-only data, like program memory, this works great
   "Effectively we get the memory sharing patterns expected of a single
   shared memory multiprocessor OS even though the system runs multiple
   independent OSes."
- Only used for VM-dependent, non-persistent disks
-- No sharing!  Independent machines
-- Modified sectors kept in main memory
-- Persistent disks are managed separately; see below
- Data structures
-- B-Tree: disk sector => machaddr in global cache;
   disk sector * virtmachine => machaddr (for modifications)

- Persistent disks
-- Only one virtmach can mount at a time!
-- No need to virtualize disk layout
-- Corresponds to multiple-machine abstraction: use NFS to access remote
   files
- Networking interface
-- Write a new network device driver with no MTU
-- Driver detects the transfer of a disk block ==> transfer page, not data!
... OS already transfers pages because of scatter/gather!  Convenient.
	(what is scatter/gather?)
... Kernel NFS always copied mbuf data to the file cache; instead, use a
	new HAL layer function, remap(): like bcopy(), but remaps the page
	when possible
... Kernel then places the mbuf data on a free-mbuf list.  This used to use
	the data page memory; ==> lots of copy-on write faults!
	Just fix this problem.

* More OS changes
- What about kernel mode?
-- MIPS supports "unmapped" memory in KSEG0 segment; kernel mode only;
	accesses bypass TLB
-- OSes put their code and data in this segment
... Reduced pressure on TLB!
-- Problem -- Disco would fault on each instruction
-- Solution: relink the OS
... Not necessary on other OSes
- ASIDs (address space IDs)
-- TLB entries tagged with ASID
-- Shoot down TLB when you swap OSes; no need to remap ASIDs
- More TLB pressure
-- The OS is in there -- it wasn't before
-- TLB misses are more expensive, too, since they're virtualized
-- Larger virtual TLB in software -- like exokernel (discussed before)


**** VMware ESX Server ****

* How has the goal changed from VMware to Disco?
- Disco was worried about fault isolation *in a massive processor array*
-- Mean time to failure for hardware components, etc.
-- Multiple OSes per processor was possible, but not the expected mode
- VMware insight: Holy crap, the problem is *the software*
-- IT departments run a bunch of servers: mail, internal web server,
	directory server, ...
-- Often run on different *machines* to achieve fault isolation
... You *know* they're isolated!
-- Expensive!
-- Solution: fault isolation *in software!*
... If someone tried to sell you good software you wouldn't believe it
... But the VM has the same fault isolation properties as different
	hardware boxes!

* What's the major differences between VMware and Disco?
- The OSes are *unmodified*
-- No hooks
... No Cheap versions of enabling/disabling CPU interrupts, privileged
	registers
... No hints from OS about memory management:
... No Request a cleared page, rather than zeroing a page in software; Disco
   needs to do this anyway
... No I'm not using this page anymore (freelist)
... No sexy mbuf change
... No sexy memcpy change
- Reduces possible performance
- But can we get most of the benefit *anyway*?

* Techniques
- "Gray-box" techniques
-- Don't get the OS to tell you what it's doing
-- Watch what it *actually does*
-- Can actually achieve more sharing and optimization than explicit
	changes, because you can catch unexpected sharing opportunities!

* Content-based page sharing
- Identify pages *by their contents*
- Hash page contents into a table
- If hash match, compare contents
- If contents match, Copy-on-write sharing
- How to find potential matches?
-- Check pages randomly
-- If no match, install page in hash table
-- But don't prevent writes yet!
-- Instead, mark as "hint" entry
-- If a later page matches this bucket, check to see if the page itself has
changed
-- If so, remove hint, install new page
-- If no change, up refcount
-- Also check pages right before paging out to disk
- Space saving
-- 16-bit reference count plus an overflow table for larger counts
- Space overhead: < 0.5% of system memory
- 5 MB of memroy reclaimed from a single VM!!

* Ballooning
- Problem: How to convince an OS to page memory out?
- In Disco, used an OS hint: "freelist" available to Disco
- Solution: Load an OS module that communicates with VMware
-- Why module OK? -> Doesn't change OS interfaces, most modern OSes have
loadable module interfaces
- When VMware needs memory, the balloon requests pinned memory from the OS
-- Forces guest OS to page
-- Balloon tells VMware what pages were allocated
-- VMware needn't actually allocate the pages!

* Allocation policies
- Sysadmin wants to control how much space VMs get relative to one another
-- min size, max size, memory shares (proportional allocation if \sum max >
actual physical memory)
-- 32MB overhead/VM
- Dynamic reallocation
-- Attempt to maintain min amt free memory
-- high state: free mem >= 4% -- no reclamation
-- soft state: free mem 2-4% -- balooning to 6%, paging if bal not possible
-- hard state: free mem 1-2% -- paging to 4%
-- low state: free mem <1% -- paging + stop VMs who are above allocation
threshold

* Allocation policies 2: Idle memory
- min-funding allocation
-- Client A demands space
-- Replacement alg selects client B to relinquish space
-- B = client that has the fewest shares per allocated page
-- Shares per page \rho = S/P: S number of shares, P number of pages
- But want to differentiate between idle OSes and active OSes
- If an OS has many shares, but isn't doing anything, doesn't it make sense
to give its shares to other OSes?
- How can we even tell whether an OS isn't doing anything!?
-- Measuring idle memory
-- Idea: statistical sampling
-- Every 30 seconds, choose 100 pages per VM and mark them as "not present"
... But don't swap them out!
-- If OS touches page, we'll get a fault
... Mark page as actuve!
-- At end of sampling period, fraction active is f = t/n
-- How to track changes in idle fraction over time?
-- Moving averages
-- Keep 3 moving averages, though!
... 1 slow average: smooth & stable
... 1 fast average: recent results
... 1 faster average: includes the current sampling period
... Actual idle fraction = max of all 3 averages
... Goal: Track allocation quickly (fast is max) & deallocation slowly
(slow is max)
- Now we know how idle the memory is
- What do we do with this information?
- Idle memory tax!
-- f = active fraction = t/n
-- Shares-per-page \rho = S / (P*(f + (1-f)/(1-\tau)))
			= S / (P*((f-f\tau+1-f)/(1-\tau)))
			= S / (P*(1 - f\tau)/(1 - \tau))
-- \tau = 0: pure shares-per-page
-- \tau ~= 1: all of idle memory can go away:
	if \tau = 0.99:   f = 0 => \rho = S/100P = very small
			  f = 1 => \rho = S/P