[Kernel, courtesy IowaFarmer.com CornCam]

Advanced Operating Systems, Fall 2004

Lecture 6 Preliminary Notes

Virtualization

One definition of virtualization: The environment where the code is actually running is different from what the code seems to expect.

For example, take virtual addresses. Application code is written as if the address space was simple and linear, but in reality, a single virtual page might be located at different physical addresses at different times -- and there's no direct relationship between consecutive virtual pages and consecutive physical pages.

The most complete form of virtualization is an interpreter, like Bochs, that processes machine code using other software. For example, here's the Bochs code for the x86 cmp %eax instruction:

#define EAX BX_CPU_THIS_PTR gen_reg[0].dword.erx
bx_gen_reg_t  gen_reg[BX_GENERAL_REGISTERS];  // ...

  void
BX_CPU_C::CMP_EAXId(bxInstruction_c *i)
{
  Bit32u op1_32, op2_32;

  op1_32 = EAX;
  op2_32 = i->Id();

#if defined(BX_HostAsm_Cmp32)
  Bit32u flags32;
  asmCmp32(op1_32, op2_32, flags32);
  setEFlagsOSZAPC(flags32);
#else
  Bit32u diff_32 = op1_32 - op2_32;
  SET_FLAGS_OSZAPC_32(op1_32, op2_32, diff_32, BX_INSTR_COMPARE32);
#endif
}

The machine "registers" are explicitly stored in a gen_reg register file, and the code laboriously computes the specific effects of the instruction. One function call per instruction! If you'll allow me to anthropomorphize code, the code "thinks" it's running on a real machine, talking to a real disk, but it's running inside a Bochs application talking to a file instead. This can be cycle-accurate, and you can "run" machine code on an entirely different architecture, but it's 10s or 100s of times slower than direct hardware execution.

Binary rewriting

It's faster to run the code on the hardware, rather than interpret it. An extremely powerful virtualization technique is to rewrite machine code. The rewriting process changes the code to ensure desired properties, like memory safety.

The Valgrind debugging and profiling system is a powerful example. Valgrind rewrites binary code to detect and report errors like the using uninitialized memory, using memory after it's been freed, leaking memory, and so forth. It works by rewriting code a handful of basic blocks at a time, so that the code itself detects and reports all errors.

Let's work through an example. Say we want to ensure that a program never accesses uninitialized memory. This program should cause an error:

int f() {
    int y;
    return y;   /* uninitialized! */
}

Any decent compiler will tell us that the return statement accesses an uninitialized variable, y. But static analysis is limited in scope. Dynamic memory, pointer accesses, and complex control flow will prevent static analysis from giving the right answers. For instance:

 1 int f();
 2 int g() {
 3     int y = f();   /* value unknown */
 4     int x;
 5     if (y == 1)
 6         x = 2;
 7     /* code that doesn't change or refer to x or y */
 8     if (y == 1)
 9         return x;    /* same as 'return 2' -- compiler doesn't know that! */
10     else
11         return 0;
12 }

This code will never access uninitialized data, but most compilers will complain on line 9 that x is uninitialized.

How can we report fewer false positives? Let's change the machine code to explicitly keep track of whether variables are initialized. We'll do this by reserving a separate area, the memory statistics buffer, that remembers, for every piece of memory, whether or not that memory has been initialized.

Here's the example code we'll change:

int i;
int j;  /* initialized elsewhere */
if (j == 1)
    i = 2;
if (j == 1)
    return i;
else
    return 0;

Here's some pseudo-assembly corresponding to the normal code. The i and j variables are stored in -4(%ebp) and -8(%ebp), respectively.

1     cmp -4(%ebp), $1    #  if (j == 1)    
2     jne 1f
3     mov $2, -8(%ebp)    #      i = 2;       
4  1: cmp -4(%ebp), $1    #  if (j == 1)      
5     jne 1f
6     mov -8(%ebp), %eax  #      return i;    
7     ret
8  1: mov $0, %eax        #  else             
9     ret                 #      return 0;

The binary rewriter, then, needs to do three things.

It needs to mark newly allocated memory as "uninitialized" in the memory statistics buffer.
Written memory should be marked "initialized" in the buffer.
Check that every word of memory read was initialized, and trap if it isn't.

Sounds simple, no? Let's assume that the memory statistics buffer is located at a special register, %msb, and that 0(%msb) corresponds to i in the above program. Then we might change things this way:

      mov $0, 0(%msb)     #  mark 'i' as uninitialized
1     cmp -4(%ebp), $1    #  if (j == 1)    
2     jne 1f
3     mov $2, -8(%ebp)    #      i = 2;     
      mov $1, 0(%msb)     #      mark 'i' as initialized 
4  1: cmp -4(%ebp), $1    #  if (j == 1)    
5     jne 1f
      cmp 0(%msb), $0     #      if ('i' is not initialized) 
      jne 2f
      call uninit_error   #          uninit_error();  // reports error or warning 
6  2: mov -8(%ebp), %eax  #      return i;  
7     ret
8  1: mov $0, %eax        #  else           
9     ret                 #      return 0;

There's a couple things missing from this simple example, of course.

Relative addresses like branch targets (the jnes) must be adjusted to account for the added instructions.
This includes jumps from other functions, and indirect "function pointer" jumps!
We invented a new register to hold the memory statistics buffer. Of course, a real machine has a fixed number of registers, all of which may be used already! Real binary rewriters do a "register reallocation" to give themselves a register or two of space.
You shouldn't rewrite data, like string contants, that might be intermixed with code.
It can be unnecessarily expensive in time and space to rewrite an entire binary at once. Valgrind, for example, rewrites a couple basic blocks at a time. It maintains a translation buffer that caches translated basic blocks. It's like "just-in-time" compilation.
A smart binary rewriter can use compiler analysis techniques to avoid some overhead. For example, given code like f(i); f(i+1); f(i+2);, the rewriter need check that i was initialized only before the f(i) call.

Binary rewriting introduces significant overhead, but still runs much faster than an interpreter. Valgrind in "memcheck" mode, for example, maintains 9 bits of statistics data for each byte of memory (one bit to remember whether the byte has been freed, 8 bits to remember whether each bit has been initialized -- this supports bitfields), and still runs only 10-30x slower than native code. The lighter-weight addrcheck, which detects wild memory accesses and dynamic memory bugs but not uninitialized memory accesses, runs 5-20x slower.

It's also easier to build a binary rewriter than you might think. You have to implement many corner cases, but machine code semantics are very well documented: it's clear what you need to do.

Binary rewriting can make machine code safe to execute through virtualization. Valgrind only warns on uninitialized access, for example, but it could kill the offending program instead! Because of this, it's an extremely useful technique for operating systems. The exokernel uses it in several different ways: for application-specific handlers, for example. When binary rewriting is used to make machine code safe, it's often called "sandboxing".

A closely related technique is binary translation, which translates other binary code formats into executable machine code. Exokernel network message filters work this way: they translate a hand-built filter language format into machine code (link).

Virtual operating systems

There's a useful analogy between exokernel techniques and virtualization. An exokernel aims to securely expose hardware, including physical names. Thus, the code "sees" a version of the hardware that's been "virtualized" for safety. This analogy will only take you so far: exokernel applications use explicit exokernel system calls to access hardware. Thus, the code is aware of its environment.

But it can be useful to virtualize entire operating systems, running them on a virtual machine rather than physical hardware. Bochs does this, but it's terribly slow. Disco uses a bunch of techniques to make it fast.

Disco

* ccNUMA
- cache-coherent, non-uniform memory access times
- a cache, plus network substrate for accessing nonlocal memory
-- draw picture from FLASH paper
- not in cache: 24 cycles to first word, 39 total [128-byte cache line]
-- R4000
- block transfer: 100 cycles + 30-40 cycles/cache line
-- plus[?] additional time to get the stuff to the processor
-- best case
- model used in paper: 300ns local, 900ns remote minimum
- + cache coherency protocol
-- but thank god the hardware takes care of this, not the software

* Why be so worried about hardware/software faults?
- So many processors == low mean-time-to-failure
- So many copies of the operating system == same thing
- Wild writes: one processor writes into another's memory

ISSUES WITH VIRTUAL MACHINES
* Overhead
- Execution of privileged instructions emulated in software
- Access to I/O virtualized in software
- Additional memory cost: multiple copies of system software, applications

* Resource management issues
- No information available for making good policy decisions
- Too low level

INTERFACE
* Processor: abstracts a MIPS R10000
- OS change to optimize performance:
  To enable/disable CPU interrupts or access privileged registers, client OS
  can load/store special addresses.  Prevents "kernel crossing" overhead.
* Memory: abstracts a flat, uniform physical memory space
* I/O
- OS change to optimize performance: Similar to processor -- support
  special abstractions for particular devices
- Comunicate among virtual machines using a virtual "Ethernet"

IMPLEMENTATION
* Disco kernel replicated on every processor

* Virtual CPUs
- Disco: kernel mode -- full access to hardware
- Emulated kernel: supervisor mode -- access to memory, but not privileged
  instructions or physical memory
- Application: user mode
- How to fool the kernel into thinking it's the kernel?  Keep a "process
  table" for the virtualized OS: saved registers, saved privileged registers,
  saved TLB contents
- Just run code
- Encounter a privileged instruction, trap to Disco, which emulates the
  trap's effect on the current virtualized OS; update virtual registers &
  jump to OS trap vector
- Reduced power mode = swap me out

* Virtualized memory
- OS sees "physical" pages, but those are really virtual; Disco maps those
  numbers to "machine" pages
-- "pmap" data structure: maps physical => machine, and machine => virtual
-- To install a TLB entry <virt,phys,prot>: Look up 'physical' addr in
   pmap, which maps <phys,mach>; install <virt,mach,prot>; remember
   mach=>virt mapping, so you can shoot down the relevant parts of the TLB
   if disco takes the mach page away from the OS 
- Requires virtualizing TLB inserts

* Memory management
- Need to allocate machine pages for physical pages
- Easy: have a lot of memory -- hard: that memory is distributed all over
  the machine, and if you access the memory poorly the machine will be
  terribly slow
-- Cache coherency helps: disco only trying to optimize
- Pages heavily accessed by only one node migrated to that node;
  pages primarily read-shared are replicated to nodes that need them;
  write-shared not moved;
  limit number of times a page can be moved
-- Needs hardware support: FLASH counts cache misses
-- If a page is hot, choose to migrate/replicate/nothing
- pmap also helps: when a page is migrated, shoot down relevant virt addrs
-- they call this memmap
- OS change: Hints from the OS to disco about memory management
-- Request a cleared page, rather than zeroing a page in software; Disco
   needs to do this anyway
-- I'm not using this page anymore (freelist)

Back to Advanced Operating Systems, Fall 2004