Scribe Notes: CS 111, 11/15/05

This lecture will cover the remainder of Virtual Memory and the beginning of File Systems.

Virtual Memory

Performance
Lets consider our new memory mapping function, given by B(va, priv, write), where va is the virtual address for which B will return a physical address. Priv is the privilege level of the currently executing code, which allows us to have user and kernel space in same memory space. Write identifies whether the current access is a write, which is used to detect dirty pages. The B function returns either the physical address that va maps to or a page fault, in which the OS must decide what to do.

Lets look at how the OS might handle a page fault:

      PF(va) {
           If (va is swapped out to disk) {
               Choose a page to replace;
               Write to disk (if necessary);
               Read this page in from disk;
               Install mapping in B;
               Return;
           }
          ***
      }

There are a few more things that go on at the *** in a real OS. Let's look at a fork using a feature called "copy-on-write" to get an idea. First, here is how fork would work without copy-on-write. Notice that the two processes each have a copy of the exact same data in both virtual and physical memory space:

The goal of copy-on-write is to copy a page of memory only when necessary. Ideally, this would mean to only copy when a change in one process could be observed by another. Practically, this means to copy when there's a write to the page of memory. So, how do we implement this functionality, in order to catch the write before it happens? The key idea is to make every page read-only by page faulting on every write. In order to do this, we must install it on the B function of both the parent and the child as such:

     B'_p(va, perm, READ) = B_p(va,perm,READ)
     B'_p(va, perm, W) = PF

So, this means adding the following else if statement to the *** above:

     Else if (write == W && va is marked copy-on-write) {
          Allocate new page
          Copy old data into new page
          Change B(va)->new page; W OK
          Return;
     }

Now, our successfully implemented copy-on-write method behaves like so:

All this happens transparently, returns to the process, and continues without interruption. We need to write to a new page since the old page might be shared by another process. This also means that the processor must keep track of how many processes map to a particular physical address. This is clear in the condition when there is only one process currently with access to a physical address that was formerly accessed by several processes. The processor must know that there is only one process left, so that it doesn't copy-on-write for every memory access. Similarly, if a process is forked twice (so processes A, B, and C map to the same physical memory) and process A is killed, the OS must know that a write by process B to memory still necessitates a copy, so as not to affect C's data.

It is of note that the B mapping function is per-process. The OS designs a B for each process and installs it in the processor before running the process. Additionally, the OS needs its own bookkeeping for physical memory, in order to know if a page is allocated or free, or how many processes share a page.

To see how this affects performance, consider a process with N pages mapped. Suppose copying a page takes C cycles (C is approximately 5,000 - 10,000). After a fork, the process write W pages. A page fault costs F cycles (F is approximately 5,000). For an eager fork (in which all memory is copied on fork), time spent copying and latency are both equal to NC. Alternatively, for a copy-on-write fork, time spent copying is W(F+C), while latency is 0.

Demand paging
In order to emphasize the need for demand paging, let's talk about a big binary. We want to reduce application startup latency when loading a new binary from disk. The idea is to load pages into memory only when needed. So, we use a "shadow" page map function for what the behavior should be:

                      /--> Load data from disk address
  B_shadow(va,perm,w)   --> Allocate fresh page (demand loading for new stack, heap pages) 
                      \--> Segmentation fault (usually means that va is an unmapped address)

Now, we continue to add to the *** of our page fault function:

     Else if (B_shadow(va,perm,w) != seg. Fault) {
            If (B_shadow loads from disk) {
                 Allocate page
                 Load data from disk
                 Install page into B
                 Return;
            }
            Else if (B_shadow == fresh page) {
                 Allocate page
                 Map into B
                 Return;
            }
     }

But, using only demand paging is slow. Every memory access results in a PF, which makes loading very slow. What if we were to read more than one page at a time (i.e. do the operations of the If (B_shadow loads from disk) statement above 5 pages at a time)? This would decrease latency for pages 2-4, but increase it for page 1 by a factor of 5, since it takes 5 times as long to read data from disk). And thus we see the need for prefetching.

Prefetching
The central idea is to fetch data in the background that is likely to be needed soon. For example, if a program counter is in page n, page n+1 would be a good candidate for prefetching. Similarly, if a row of an array is accessed, we might want to prefetch the next row. In order to implement this, we only block for the read of the first page:

               // Prefetch 5 additional pages if required
       If (B_shadow loads va from disk) {
            Allocate pages from va,...,va+5
            Send read requests to disk for pgs va,...,va+5
            Block until va is read
            Install B(va)
            Return;
       }
       When disk read request completes
            If for prefetched page
                 Install mapping into B for the right process

Thus it makes sense to prefetch an entire binary if there is room in memory. Otherwise, it is more expensive to constantly swap in the necessary memory than it is to simply load on demand.

Complete PF function
We now have a (relatively) complete function to respond to page faults in the B mapping. This includes copying-on-write, demand paging, and prefetching.

PF(va) {
      If (va is swapped out to disk) {
            Choose a page to replace;
            Write to disk (if necessary);
            Read this page in from disk;
            Install mapping in B;
            Return;
       }
       Else if (write == W && va is marked copy-on-write) {
            Allocate new page
            Copy old data into new page
            Change B(va)->new page; W OK
            Return;
       }
       Else if (B_shadow(va,perm,w) != seg. Fault) {
            If (B_shadow loads va from disk) {
                Allocate pages from va,...,va+5
                Send read requests to disk for pgs va,...,va+5
                Block until va is read
                Install B(va)
                Return;
            }
            When disk read request completes
                If for prefetched page
                    Install mapping into B for the right process
            Else if (B_shadow == fresh page) {
                Allocate page
                Map into B
                Return;
            }
       }
}

Memory-mapped file
The idea is to make demand paging accessible to user-level applications for files. Then we'd be able to change this typical code:

     int fd = open(filename,...)
     char buf[];
     read(fd,buf,size); // blocking for disk files

In UNIX, a memory-mapped file call would look like this:

     addr = mmap(...,length,...,fd,offset);

This loads length byes of data from fd starting at offset into memory and returns the address. Additionally, it marks the address in B_shadow, thus only blocking on a PF. One of the advantages of this technique is the reduction in copies, since the disk driver can put data in the right memory location. Another advantage is the addition of prefetching to file reads. But how do we know what to prefetch, since file reads are not necessarily sequential?

We introduce the madvise(addr,length,behavior) function, where addr and length define the region of some memory-mapped file. Behavior causes the function call to advise a particular behavior, either prefetch, which suggests sending read requests to the disk, without blocking, for the defined region, or defetch, which suggests that the process no longer needs this data and it is thus a good choice to swap out of memory. It is important to note that madvise improves performance, but does not necessarily change functionality. A PF on prefetch or defetch still behaves under our normal PF function, so an advised functionality can not cause an incorrect behavior.

B implementation
So, now that we know what the mapping function B does and how it is used, we need to know how it is implemented. Though it varies from machine to machine, the x86 uses a two-level page table. There is a register--CR3--that points to the current page map. Given a virtual address, the two-level page table is used to map and subsequently look up the corresponding physical address. The process is simple but requires a few combinatorial steps, as seen below:

There is usually also a cache, commonly known as the Translation Lookaside Buffer (TLB), for the page table that speeds up lookups.

Disks & File Systems

Disk storage media
Disks are necessary in order to provide a persistence of data. They are additionally a cheap means of storage. Their most common modern forms are magnetic storage (hard drives) and optical storage (cd-roms).

The hard drive is made up of a stack of magnetic platters that are read using a mechanically operated arm and magnetic head. Each dual-surfaced platter is made up of several thousand tracks made up of equally-sized sectors. The sector is the atomic unit of reading and writing on disk, commonly 512 bytes in size.

Let's look at the physical specifications for a (formerly) typical hard drive on the market, the Seagate 73.4 GB SCSI. The Seagate has 12 platters (more than most of today's HDs), each with 14,100 tracks per surface (two surfaces per platter). It operates at 10,200 RPM with a peak transfer rate of 160-200 MB/s, or .014 ms/sector.

In order to move the read/write head from track to track, the arm must physically "seek." This is a time-expensive operation compared to the order of memory reads. The seek consists of acceleration (at about 40Gs), coasting, and slowing/locating, the whole of which takes on average 6ms, and to an adjacent track .6ms. An additional time cost is incurred when we have to wait for the desired data of a particular track to rotate and line up with the head. This is called rotational latency, and on average is 2.94ms for our example Seagate. With these times it is clear that SEEKS SUCK due to the slow physical motion; OS performance is often bottlenecked on these disk seek times.

We attempt to minimize seek times by arranging data on the disk to ensure that locality of reference corresponds to proximity on disk. This essentially means that references to data (that is, reads and writes) that occur close together in time, should be close together on disk. We will look at this in further detail in the next lecture.

Issues of file systems
There are several things to take into consideration when implementing file systems, which resemble the OS considerations. Performance is absolutely necessary to make a file system usable. In order to ensure this, we utilize smart caching and a good layout of data on disk. Second, a file system is nothing without robustness. This means the disk and its data must always recover from failures such as power outages. Lastly, efficiency is a concern of maximing disk space so that we can use as much of it as possible to store user data.

File systems generally represent the disk as an array of sectors. If we were to simply allocate files sequentially at any size interval, we would find a great deal of external fragmentation in the disk. So, we divide the disk into blocks (commonly 4KB), so that files can be a disjoint series of blocks, if necessary. Just like earlier we mapped virtual memory to physical memory, here we will map file names to a set of blocks.

We will continue with disk layout and file system semantics next lecture.