![]() |
Advanced Operating Systems, Fall 2004Lecture 6 Preliminary NotesVirtualizationOne definition of virtualization: The environment where the code is actually running is different from what the code seems to expect. For example, take virtual addresses. Application code is written as if the address space was simple and linear, but in reality, a single virtual page might be located at different physical addresses at different times -- and there's no direct relationship between consecutive virtual pages and consecutive physical pages. The most complete form of virtualization is an interpreter, like
Bochs, that processes machine code using other software. For example,
here's the Bochs code for the x86 #define EAX BX_CPU_THIS_PTR gen_reg[0].dword.erx bx_gen_reg_t gen_reg[BX_GENERAL_REGISTERS]; // ... void BX_CPU_C::CMP_EAXId(bxInstruction_c *i) { Bit32u op1_32, op2_32; op1_32 = EAX; op2_32 = i->Id(); #if defined(BX_HostAsm_Cmp32) Bit32u flags32; asmCmp32(op1_32, op2_32, flags32); setEFlagsOSZAPC(flags32); #else Bit32u diff_32 = op1_32 - op2_32; SET_FLAGS_OSZAPC_32(op1_32, op2_32, diff_32, BX_INSTR_COMPARE32); #endif } The machine "registers" are explicitly stored in a Binary rewritingIt's faster to run the code on the hardware, rather than interpret it. An extremely powerful virtualization technique is to rewrite machine code. The rewriting process changes the code to ensure desired properties, like memory safety. The Valgrind debugging and profiling system is a powerful example. Valgrind rewrites binary code to detect and report errors like the using uninitialized memory, using memory after it's been freed, leaking memory, and so forth. It works by rewriting code a handful of basic blocks at a time, so that the code itself detects and reports all errors. Let's work through an example. Say we want to ensure that a program never accesses uninitialized memory. This program should cause an error: int f() { int y; return y; /* uninitialized! */ } Any decent compiler will tell us that the 1 int f(); 2 int g() { 3 int y = f(); /* value unknown */ 4 int x; 5 if (y == 1) 6 x = 2; 7 /* code that doesn't change or refer to x or y */ 8 if (y == 1) 9 return x; /* same as 'return 2' -- compiler doesn't know that! */ 10 else 11 return 0; 12 } This code will never access uninitialized data, but most compilers will
complain on line 9 that How can we report fewer false positives? Let's change the machine code to explicitly keep track of whether variables are initialized. We'll do this by reserving a separate area, the memory statistics buffer, that remembers, for every piece of memory, whether or not that memory has been initialized. Here's the example code we'll change: int i; int j; /* initialized elsewhere */ if (j == 1) i = 2; if (j == 1) return i; else return 0; Here's some pseudo-assembly corresponding to the normal code. The
1 cmp -4(%ebp), $1 # if (j == 1) 2 jne 1f 3 mov $2, -8(%ebp) # i = 2; 4 1: cmp -4(%ebp), $1 # if (j == 1) 5 jne 1f 6 mov -8(%ebp), %eax # return i; 7 ret 8 1: mov $0, %eax # else 9 ret # return 0; The binary rewriter, then, needs to do three things.
Sounds simple, no? Let's assume that the memory statistics buffer is
located at a special register, mov $0, 0(%msb) # mark 'i' as uninitialized 1 cmp -4(%ebp), $1 # if (j == 1) 2 jne 1f 3 mov $2, -8(%ebp) # i = 2; mov $1, 0(%msb) # mark 'i' as initialized 4 1: cmp -4(%ebp), $1 # if (j == 1) 5 jne 1f cmp 0(%msb), $0 # if ('i' is not initialized) jne 2f call uninit_error # uninit_error(); // reports error or warning 6 2: mov -8(%ebp), %eax # return i; 7 ret 8 1: mov $0, %eax # else 9 ret # return 0; There's a couple things missing from this simple example, of course.
Binary rewriting introduces significant overhead, but still runs much faster than an interpreter. Valgrind in "memcheck" mode, for example, maintains 9 bits of statistics data for each byte of memory (one bit to remember whether the byte has been freed, 8 bits to remember whether each bit has been initialized -- this supports bitfields), and still runs only 10-30x slower than native code. The lighter-weight addrcheck, which detects wild memory accesses and dynamic memory bugs but not uninitialized memory accesses, runs 5-20x slower. It's also easier to build a binary rewriter than you might think. You have to implement many corner cases, but machine code semantics are very well documented: it's clear what you need to do. Binary rewriting can make machine code safe to execute through virtualization. Valgrind only warns on uninitialized access, for example, but it could kill the offending program instead! Because of this, it's an extremely useful technique for operating systems. The exokernel uses it in several different ways: for application-specific handlers, for example. When binary rewriting is used to make machine code safe, it's often called "sandboxing". A closely related technique is binary translation, which translates other binary code formats into executable machine code. Exokernel network message filters work this way: they translate a hand-built filter language format into machine code (link). Virtual operating systemsThere's a useful analogy between exokernel techniques and virtualization. An exokernel aims to securely expose hardware, including physical names. Thus, the code "sees" a version of the hardware that's been "virtualized" for safety. This analogy will only take you so far: exokernel applications use explicit exokernel system calls to access hardware. Thus, the code is aware of its environment. But it can be useful to virtualize entire operating systems, running them on a virtual machine rather than physical hardware. Bochs does this, but it's terribly slow. Disco uses a bunch of techniques to make it fast. Disco* ccNUMA - cache-coherent, non-uniform memory access times - a cache, plus network substrate for accessing nonlocal memory -- draw picture from FLASH paper - not in cache: 24 cycles to first word, 39 total [128-byte cache line] -- R4000 - block transfer: 100 cycles + 30-40 cycles/cache line -- plus[?] additional time to get the stuff to the processor -- best case - model used in paper: 300ns local, 900ns remote minimum - + cache coherency protocol -- but thank god the hardware takes care of this, not the software * Why be so worried about hardware/software faults? - So many processors == low mean-time-to-failure - So many copies of the operating system == same thing - Wild writes: one processor writes into another's memory ISSUES WITH VIRTUAL MACHINES * Overhead - Execution of privileged instructions emulated in software - Access to I/O virtualized in software - Additional memory cost: multiple copies of system software, applications * Resource management issues - No information available for making good policy decisions - Too low level INTERFACE * Processor: abstracts a MIPS R10000 - OS change to optimize performance: To enable/disable CPU interrupts or access privileged registers, client OS can load/store special addresses. Prevents "kernel crossing" overhead. * Memory: abstracts a flat, uniform physical memory space * I/O - OS change to optimize performance: Similar to processor -- support special abstractions for particular devices - Comunicate among virtual machines using a virtual "Ethernet" IMPLEMENTATION * Disco kernel replicated on every processor * Virtual CPUs - Disco: kernel mode -- full access to hardware - Emulated kernel: supervisor mode -- access to memory, but not privileged instructions or physical memory - Application: user mode - How to fool the kernel into thinking it's the kernel? Keep a "process table" for the virtualized OS: saved registers, saved privileged registers, saved TLB contents - Just run code - Encounter a privileged instruction, trap to Disco, which emulates the trap's effect on the current virtualized OS; update virtual registers & jump to OS trap vector - Reduced power mode = swap me out * Virtualized memory - OS sees "physical" pages, but those are really virtual; Disco maps those numbers to "machine" pages -- "pmap" data structure: maps physical => machine, and machine => virtual -- To install a TLB entry <virt,phys,prot>: Look up 'physical' addr in pmap, which maps <phys,mach>; install <virt,mach,prot>; remember mach=>virt mapping, so you can shoot down the relevant parts of the TLB if disco takes the mach page away from the OS - Requires virtualizing TLB inserts * Memory management - Need to allocate machine pages for physical pages - Easy: have a lot of memory -- hard: that memory is distributed all over the machine, and if you access the memory poorly the machine will be terribly slow -- Cache coherency helps: disco only trying to optimize - Pages heavily accessed by only one node migrated to that node; pages primarily read-shared are replicated to nodes that need them; write-shared not moved; limit number of times a page can be moved -- Needs hardware support: FLASH counts cache misses -- If a page is hot, choose to migrate/replicate/nothing - pmap also helps: when a page is migrated, shoot down relevant virt addrs -- they call this memmap - OS change: Hints from the OS to disco about memory management -- Request a cleared page, rather than zeroing a page in software; Disco needs to do this anyway -- I'm not using this page anymore (freelist) |