Lab 5 - CS 261 Research Topics in Operating Systems, Fall 2011

Disks perform reads and writes in units of sectors, which today are almost universally 512 bytes each. File systems, though, allocate and use disk storage in units of blocks. Be wary of the distinction between the two terms: sector size is a property of the disk hardware, whereas block size is an aspect of the operating system using the disk. A file system's block size must be at least the sector size of the underlying disk, but could be greater.

The original UNIX file system used a block size of 512 bytes, the same as the sector size of the underlying disk. Most modern file systems use a larger block size, however, because storage space has gotten much cheaper and it is more efficient to manage storage at larger granularities. Our file system will use a block size of 4096 bytes to match the processor's page size.

The JOS buffer cache

JOS user environments access the file system through a combination buffer cache and lock server called bufcache. You have no exercises to complete in bufcache, but you need to understand its interface (and, of course, you may be interested in it anyway).

The buffer cache responds to IPC requests from other environments. Each request contains a block number and a request type. For most requests, the buffer cache responds by sending back a shared-memory page with the corresponding disk block. All pages are sent with PTE_P|PTE_U|PTE_W|PTE_SHARE permission (PTE_SHARE is described later in the lab). The accompanying IPC value is ≥ 0 on success and an error code < 0 on error.

File system users also coordinate using the buffer cache, using its per-block advisory locks. The locks are called advisory because any environment can always get read-write access to any page with BCREQ_MAP. However, file system implementations coordinate their updates by explicitly locking the corresponding blocks. For instance, reads from and writes to a given file should only happen while the corresponding inode block is locked. The requests are as follows.

Other IPCs: You probably won't need to use these.

The buffer cache supports shared locks with BCREQ_MAP_RLOCK. This could be useful for some operations (for instance, letting multiple environments read the same file at once), but we recommend you rely on BCREQ_MAP_WLOCK at first.

The buffer cache also tracks whether blocks have been initialized. Each block has an initialization state that starts at 0. A BCREQ_MAP_[RW]LOCK IPC returns the corresponding block's initialization state. The BCREQ_INITIALIZE IPC sets a block's initialization state to 1. The buffer cache remembers initialization states for as long as it runs. You won't need to manipulate initialization states in the regular exercises.

File system data structures

The file system you will work with is much simpler than most "real" file systems, but it is powerful enough to provide the basics: creating, reading, writing, and deleting files organized in a hierarchical directory structure. Since JOS is a "single-user" operating system, our file system doesn't support the UNIX notions of file ownership or permissions. It also currently does not support hard links, symbolic links, time stamps, or special device files.

Most UNIX file systems divide available disk space into two main types of regions: inode regions and data regions. Each file corresponds to one inode, which holds critical metadata about the file such as its stat attributes and pointers to its data blocks. The data regions are divided into much larger (typically 8KB or more) data blocks, within which the file system stores file data and directory metadata. Directory entries contain file names and pointers to inodes; a file is said to be hard-linked if multiple directory entries in the file system refer to that file's inode.

Both files and directories logically consist of a series of data blocks, which may be scattered throughout the disk much like the pages of an environment's virtual address space can be scattered throughout physical memory. User processes can read and write the contents of files directly, but the file system handles all modifications to directories itself as a part of actions such as file creation and deletion. Our file system does, however, allow user environments to read directory metadata directly (e.g., with read), so user environments can perform directory scanning operations themselves (e.g., to implement the ls program). The disadvantage of this approach to directory scanning, and the reason most modern UNIX variants discourage it, is that it makes application programs dependent on the format of directory metadata, making it difficult to change the file system's internal layout without changing or at least recompiling application programs as well.

Superblocks

Layout for JOS file system with N blocks and I inodes. N must be at least 1+⌈N/4096⌉+I. Only I-1 inode blocks are required because 0 is an invalid inode number, so inode 0 isn't stored.

File systems typically reserve certain disk blocks, at "easy-to-find" locations on the disk such as the very start or the very end, to hold metadata describing properties of the file system as a whole, such as the block size, disk size, any metadata required to find the root directory, the time the file system was last mounted, the time the file system was last checked for errors, and so on. These special blocks are called superblocks.

Our file system's superblock layout is defined by struct Super in inc/fs.h. The file system superblock will always occupy block 1 on the disk; boot loaders and partition tables use block 0, so most file systems don't use the very first disk block. Many "real" file systems maintain multiple superblocks, replicated throughout several widely-spaced regions of the disk, so that if one of them is corrupted or the disk develops a media error in that region, the other superblocks can still be found and used to access the file system.

Freemap: Managing block allocation

Just as the kernel manages physical memory allocation so that physical pages aren't inappropriately reused, a file system must manage disk blocks to ensure that a given block is used for only one purpose at a time. Many file systems keep track of free disk blocks using a bitmap rather than a linked list of free blocks. A bitmap simplifies block placement (finding a free block in a particular disk region), is simple to manage and keep consistent, and can be loaded into memory with few seeks. Though some operations are slow with a bitmap—it can take O(N) time to find a free block—they can be sped up using auxiliary memory data structures.

The JOS file system tracks whether each block is allocated using an array of bytes, not bits. The Ith byte in the freemap data structure is 1 iff block I is free. Using bytes rather than bits wastes space, but makes freemap operations much easier to code (no bit swizzling). To set up a freemap, we reserve a contiguous region of blocks large enough to hold one byte for each disk block, starting at block 2 (just after the superblock). Thus, we must reserve one block for the freemap for every 4096 blocks in the file system. Note that the freemap includes bytes for all blocks, including the superblock and the freemap itself. The bytes for these special blocks are set to 0, indicating that the corresponding blocks are in use.

Inodes

The layout of a JOS inode is described by struct Inode in inc/fs.h. The inode includes the file's size, type (regular file or directory), reference count, and pointers to the blocks comprising the file. For simplicity we will use this one Inode structure to represent file metadata as it appears both on disk and in memory. Some of its fields are only meaningful in memory, and might have garbage values on disk; we must initialize these fields whenever we read a Inode structure into memory for the first time. (That's what BCREQ_INITIALIZE is for.)

Each Inode contains two reference counts. First, i_refcount is the "true" reference count; it measures the number of hard links (directory entries) pointing to the inode. (For the root directory, it is 1.) In contrast, the i_opencount value is only valid in memory. It counts the number of references to an inode from any currently running process. A file's data blocks are not reclaimed until its inode is unreferenced from the filesystem, and no process has the file open.

A single Inode structure is 4096 bytes big. This is much larger than for most file systems, and wastes a lot of space for small files. However, since the buffer cache's unit of locking is a single block, it is extremely convenient to have a separate block for each inode.

The i_direct array in struct Inode contains space to store the block numbers for the file's data blocks. There are 1018 direct pointers, limiting files to at most 4169728 bytes. Most Unix-like file systems also use indirect pointers (and doubly-indirect, triply-indirect, and so on) to support larger files. You may implement these pointers for a challenge, but our inodes are big enough to support pretty large files without indirect pointers.

If an inode is unreferenced (

i_refcount == 0 && i_opencount ==
0

), then the rest of its contents are ignored. In particular any nonzero i_direct data blocks may be free or used by other inodes.

Inode 1 corresponds to the file system's root directory. All other inodes have numbers 2 or higher.

Directories and regular files

An Inode in our file system can represent either a regular file or a directory; these two types of "files" are distinguished by the i_ftype field. The file system manages regular files and directory-files in exactly the same way, except that it does not interpret the contents of the data blocks associated with regular files at all, whereas the file system interprets the contents of a directory-file as a series of Direntry structures describing the files and subdirectories within the directory.

Each Direntry contains a file name (de_name + de_namelen) and an inode number (de_inum). The file name is only valid if de_inum is nonzero.

Part A: The File System

Disk Access

The file system server in our operating system needs to be able to access the disk, but we have not yet implemented any disk access functionality in our kernel. Instead of taking the conventional "monolithic" operating system strategy of adding an IDE disk driver to the kernel along with the necessary system calls to allow the file system to access it, we will instead implement the IDE disk driver as part of the user-level buffer cache environment. We will still need to modify the kernel slightly, in order to set things up so that the buffer cache has the privileges it needs to implement disk access itself.

It is easy to implement disk access in user space this way as long as we rely on polling, "programmed I/O" (PIO)-based disk access and do not use disk interrupts. It is possible to implement interrupt-driven device drivers in user mode as well (the L3 and L4 kernels do this, for example), but it is more difficult since the kernel must field device interrupts and dispatch them to the correct user-mode environment.

The x86 processor uses the IOPL bits in the EFLAGS register to determine whether protected-mode code is allowed to perform special device I/O instructions, such as IN and OUT. The IOPL bits equal the minimum (i.e. numerically highest) privilege level allowed to perform IN and OUT instructions, so if those bits are 0, only the kernel can execute INs and OUTs. All of the IDE disk registers we need to access are located in the x86's I/O space (rather than memory-mapped I/O space), so to let the file system environment access the disk, all we need to do is manipulate the IOPL bits. But no other environment should be able to access I/O space.

To keep things simple, from now on we will arrange things so that the buffer cache always has ID ENVID_BUFCACHE.

Do you have to do anything else to ensure that this I/O privilege setting is saved and restored properly when you subsequently switch from one environment to another? Make sure you understand how this environment state is handled.

This lab uses the file obj/kernel.img as the image for disk 0 (typically "Drive C" under DOS/Windows) as before, and to the (new) file obj/fs.img as the image for disk 1 ("Drive D"). In this lab your file system should only ever touch disk 1; disk 0 is used only to boot the kernel. If you manage to corrupt either disk image in some way, you can reset both of them to their original, "pristine" versions simply by typing:

Demand-paged buffer cache

The main JOS buffer cache is stored, of course, in the buffer cache environment. The 2GB region of virtual address space from 0x50000000 (DISKMAP) up to 0xD0000000 (DISKMAP + DISKSIZE) is reserved to map disk pages. These pages are read on demand based on IPC requests.

For simplicity, other user environments use the same virtual memory region to map buffer cache blocks, although they use different names (FSMAP and FSMAP + DISKSIZE). These blocks are demand paged. If a page fault happens in the file system region, a page fault handler will load the corresponding page from the buffer cache by IPC.

File descriptors

Unix file descriptors are a general notion that encompasses file I/O, pipes, console I/O, etc. In JOS, each of these device types has a corresponding struct Dev, with pointers to the functions that implement read/write/etc. for that device type. (Thus,

struct
Dev

is like an object-oriented class.) lib/fd.c implements the general Unix-like file descriptor interface on top of this. Each struct Fd indicates its device type, and most of the functions in lib/fd.c simply dispatch operations to functions in the appropriate struct Dev.

lib/fd.c also maintains the file descriptor table region in each environment's address space, starting at FDTABLE. This area reserves a page's worth (4KB) of address space for each of the up to NFD (currently 32) file descriptors the application can have open at once. At any given time, a particular file descriptor table page is mapped if and only if the corresponding file descriptor is in use. Each file descriptor also has an optional "data page" in the region starting at FDDATA, which we will use for pipes.

For nearly all interactions with files, user code will go through the functions in lib/fd.c.

File system interface

Each device type has at least one user-visible function that cannot be implemented generically: the function for opening a new file descriptor. Now you must implement this, and the rest of the incomplete functions in lib/file.c. When you're done, you'll have a working file system!

You will use the buffer cache's locking primitives to prevent race conditions between environments. To minimize the risk of deadlock, you should ensure that locks are held only during the execution of lib/file.c functions. In other words, no locks should be held when one of these interface functions returns to its caller.

The locking protocol makes sure that important file system metadata doesn't change during an operation. For example, it ensures that the Direntry pointer returned by dir_walk is actually for the right file name. (Without locking, another environment could potentially run, unlink the name, and create another file at the same position in the directory, creating a race.) It also ensures that the file size does not change during a read operation, and that write operations do not conflict.

You will now fill out the rest of the missing pieces of the file implementation. Feel free to work in any order, but the exercises guide you to pass the testfile tests in order.

Part B: Spawning Processes from the File System

In this exercise, you'll extend spawn from Lab 4 to load program images from the file system as well as from kernel binary images. If spawn is passed a binary name like "/ls" that begins with a slash, it will read the program data from disk; otherwise, it will read the program data from the kernel. Luckily, this requires just a couple of changes.

Sharing pages between environments

We would like to share file descriptor state across fork and spawn, but file descriptor state is kept in user-space memory. Right now, on fork, the memory will be marked copy-on-write, so the state will be duplicated rather than shared. (This means that running "(date; ls) >file" will not work properly, because even though date updates its own file offset, ls will not see the change.) On spawn, the memory will be left behind, not copied at all. (Effectively, the spawned environment starts with no open file descriptors.)

We will change both fork and spawn to know that certain regions of memory are used by the "library operating system" and should always be shared. Rather than hard-code a list of regions somewhere, we will set an otherwise-unused bit in the page table entries (just like we did with the PTE_COW bit in fork).

We have defined a new PTE_SHARE bit in inc/lib.h. If a page table entry has this bit set, then by convention, the PTE should be copied directly from parent to child in both fork and spawn. Note that this is different from marking it copy-on-write: as described in the first paragraph, we want to make sure to share updates to the page.

Use make run-testpteshare to check that your code is behaving properly. You should see lines that say "fork handles PTE_SHARE right" and "spawn handles PTE_SHARE right".

Use make run-testfdsharing to check that file descriptors are shared properly. You should see lines that say "read in child succeeded" and "read in parent succeeded".

Part C: A Shell

In this part of the lab, you'll extend JOS to handle everything necessary to support a shell. We've done a lot of the work for you, but you must (1) make it possible to share file descriptors across environments, (2) clean up a couple loose ends, and (3) implement file redirection in the shell.

At this point, you can use make run-initsh to boot into the current version of the shell, which can already do simple commands like "ls". As you progress through the lab, the shell will become more functional, and you will be able to do things like add redirections.

Pipes

Pipes and the console are both I/O stream interfaces. This means that they support reading and/or writing, but not file positions. Like Unix, JOS represents these streams using file descriptors. To support this, the file descriptor subsystem uses a simple virtual file system layer, implemented by struct Dev, so that disk files, console files, and pipes all implement the same file descriptor functions.

A pipe is a shared data buffer accessed via two file descriptors, one for writing data into the pipe and one for reading data out of it. Unix command lines like "ls | sort" use pipes. The shell creates a pipe, hooks up ls's standard output to the write end of the pipe, and hooks up sort's standard input to the read end of the pipe. As a result, ls's output is processed by sort. You may want to read the pipe manual page for background, and the pipe section of Dennis Ritchie's UNIX history paper for interesting history.

In Unix-like designs, each pipe's shared data buffer is stored in the kernel. Of course, this is not how we implement pipes on an exokernel! Your library operating system represents a pipe, including its shared buffer, by a single struct Pipe. The struct Pipe is stored on its own page to make sharing easier, and mapped into the file mapping area of both the reading and the writing file descriptor. Here's the structure:

The bytes written to the pipe can be thought of as numbered starting from 0. The write position p_wpos gives the number of the next byte that will be written, and the read position p_rpos gives the number of the next byte to be read. After a writer writes "abc" to the pipe, it will enter this state:

Since p_rpos != p_wpos, the pipe contains data. The next read from the pipe will return the next 3 characters. For example, after a read() of one byte:

This data structure is safe for concurrent updates as long as there is a single reader and a single writer, since only the reader updates p_rpos and only the writer updates p_wpos.

Since the pipe buffer is not infinite, byte i is stored in pipe buffer index i % PIPEBUFSIZ. Thus, after a couple reads and writes, the pipe might enter this state:

If p_rpos == p_wpos, the pipe is empty. Any read call should yield until a writer adds information to the pipe. Similarly, if p_wpos - p_rpos == PIPEBUFSIZ, the pipe is full. Any write call should yield until a reader opens up some space in the pipe.

Closed Pipes

There is a catch -- maybe we are trying to read from an empty pipe but all the writers have exited. Then there is no chance that there will ever be more data in the pipe, so waiting is futile. In such a case, Unix signals end-of-file by returning 0. So will we. To detect that there are no writers left, we could put reader and writer counts into the pipe structure and update them every time we fork or spawn and every time an environment exits. This is fragile -- what if the environment doesn't exit cleanly? Instead we can use the kernel's page reference counts, which are guaranteed to be accurate.

Recall that the kernel page structures are mapped read-only in user environments. The library function pageref(void *ptr) returns the number of page table references to the page containing the virtual address ptr. It works by first examining vpt[] to find ptr's physical address, then looking up the relevant struct Page in the UPAGES array and returning its pp_ref field. So, for example, if fd is a pointer to a particular struct Fd, pageref(fd) will tell us how many different references there are to that structure.

Three pages are allocated for each pipe: the struct Fd for the reading file descriptor rfd, the struct Fd for the writing file descriptor wfd, and the

struct
Pipe

p shared by both. The struct Pipe page is mapped once per file descriptor reference. Thus, the following equation holds: pageref(rfd) + pageref(wfd) = pageref(p). A reader can check whether there are any writers left by examining these counts. If pageref(p) == pageref(rfd), then pageref(wfd) == 0, and there are no more writers. A writer can check for readers in the same manner.

Pipe Races

File descriptor structures use shared memory that is written concurrently by multiple processes. That creepy shiver that just ran up your back is justified: this kind of situation is ripe with race conditions. We've made one race condition, concerning pipes, particularly easy to run into.

The race is that the two calls to pageref() in _pipeisclosed might not happen atomically. If another process duplicates or closes the file descriptor page between the two calls, the comparison will be meaningless. To make it concrete, suppose that we run:

Run "make run-testpiperace2" to see this race in action. You should see "RACE: pipe appears closed" when the race occurs.

This race isn't that hard to fix. Comparing the counts can only be incorrect if another environment ran between when we looked up the first count and when we looked up the second count. In other words, we need to make sure that _pipeisclosed executes atomically. Since it doesn't change any variables, we can simply rerun it until it runs without being interrupted; the code is so short that it will usually not be interrupted.

But how can we tell whether our environment has been interrupted? In the uniprocessor JOS kernel, this can be simple: just check the env_runs variable in our environment structure. Each time the kernel runs an environment, it increments that environment's env_runs. Thus, user code can record env->env_runs, do its computation, and then look at env->env_runs again. If env_runs didn't change, then the environment was not interrupted. Conversely, if env_runs did change, then the environment was interrupted.

Run "make run-testpiperace2" to check whether the race still happens. If it's gone, you should not see "RACE: pipe appears closed", and you should see "race didn't happen". You should also see plenty of your "avoided" messages, indicating places where the race would have happened if you weren't being so careful.

The shell itself

Run make run-initsh. This will run your kernel starting user/initsh, which sets up the console as file descriptors 0 and 1 (standard input and standard output), then spawns sh, the shell. Run ls and cat lorem.

Note that the user library routine printf prints straight to the console, without using the file descriptor code. This is great for debugging but not great for piping into other programs. To print output to a particular file descriptor (for example, 1, standard output), use fprintf(1, "...", ...). See user/ls.c for examples.

Run make run-testshell to test your shell. Testshell simply feeds the above commands (also found in fs/testshell.sh) into the shell and then checks that the output matches fs/testshell.key.

CS 261 Research Topics in Operating Systems, Fall 2011

Lab 5: File System and Shell

Due 11:59pm Friday, November 11: Submit Here

Introduction

Lab Requirements

Merging Lab 5

File system code

Sectors and blocks