In this lab, you will implement an exokernel-style file system library,
client-side file descriptors, and a Unix-like command shell!
The file system uses a shared buffer cache server implemented,
in microkernel fashion, using a user-space environment.
Other environments access disk blocks
by making IPC requests to this special file system environment.
You will need to do all of the regular exercises described in the lab,
and complete at least one challenge problem, providing a short
(e.g., one or two paragraph) description of what you did.
Place the write-up in a file called answers.txt (plain text)
or answers.html (HTML format)
in the top level of your lab5 directory
before handing in.
To fetch the new source, use Git to commit your lab 4 and save that code
in your lab4
branch, which should have been created in the
last lab. Then fetch the latest version of the course repository, and
update your local branch based on our lab5
branch,
origin/lab5
:
Before you start lab 5, make sure that your lab 4 code is still working.
make grade-lab4
should give you full credit.
The JOS file system is implemented in exokernel fashion. A complete
file system implementation is linked into each user environment. The
disk's contents are mapped into memory that can be shared by any
environment. A special buffer cache environment is given the I/O
privilege necessary to read and write the disk hardware. The buffer cache
accesses the disk on demand and uses JOS's IPC mechanism to share the
resulting blocks with other environments. The buffer cache can also
lock individual disk blocks, so environments can avoid race
conditions.
This exokernel style is unusual. Modern file systems are implemented
either inside monolithic kernels or as separate processes (a more
microkernel-like style). JOS's design has many issues. Every environment
has effective write access to the whole disk (XN solved this
problem, but you aren't implementing XN, thank gosh). An unexpected
environment crash or infinite loop can leave a block in the locked state
indefinitely. However, the exokernel style fits well with the rest of JOS,
and lets us focus more on file system internals than (for example) the
details of a client-server IPC mechanism.
inc/fs.h
- Structure and constant definitions for the file system layout
(which is shared by both memory and disk), and macros
for communication with the buffer cache.
inc/fd.h
- The JOS file descriptor interface. File descriptors use an
object-oriented design. The
struct Dev
structure defines
operations relevant for a class of file descriptors. File descriptor
operations like read()
call out to struct Dev
class-specific methods to do most of the work. We provide three descriptor
classes: file system descriptors, console descriptors, and pipe
descriptors.
inc/lib.h
-
- Declares new file descriptor functions like
read()
.
lib/fd.c
- File descriptor implementation.
lib/file.c
- File system implementation, including the
struct Dev
methods for file system descriptors.
fs/
- Code for the buffer cache.
fs/ide.h
, fs/ide.c
(no exercises)
- Code for accessing an IDE hard drive using programmed I/O instructions.
fs/bufcache.c
(no exercises)
- The buffer cache environment.
Sectors and blocks
Disks perform reads and writes in units of sectors,
which today are almost universally 512 bytes each.
File systems, though, allocate and use disk storage in units of blocks.
Be wary of the distinction between the two terms:
sector size is a property of the disk hardware,
whereas block size is an aspect of the operating system using the disk.
A file system's block size must be at least
the sector size of the underlying disk,
but could be greater.
The original UNIX file system used a block size of 512 bytes,
the same as the sector size of the underlying disk.
Most modern file systems use a larger block size, however,
because storage space has gotten much cheaper
and it is more efficient to manage storage at larger granularities.
Our file system will use a block size of 4096 bytes
to match the processor's page size.
The JOS buffer cache
JOS user environments access the file system through a combination
buffer cache and lock server called bufcache
.
You have no exercises to complete in bufcache
, but you
need to understand its interface (and, of course, you may be interested
in it anyway).
The buffer cache responds to IPC requests from other environments.
Each request contains a block number and a request type.
For most requests, the buffer cache responds by sending back a
shared-memory page with the corresponding disk block.
All pages are sent with PTE_P|PTE_U|PTE_W|PTE_SHARE
permission
(PTE_SHARE
is described later in the lab).
The accompanying IPC value is ≥ 0 on success and an error code < 0 on error.
The simplest requests simply read and write disk blocks.
BCREQ_MAP
- Return the block's contents. This request always succeeds, even if
the block is locked.
BCREQ_FLUSH
- Write the current contents of the block out to disk (because another
environment has changed the corresponding memory page). Does not return
a page.
File system users also coordinate using the buffer cache, using
its per-block advisory locks. The locks are called
advisory because any environment can always get read-write access
to any page with BCREQ_MAP
. However, file system
implementations coordinate their updates by explicitly locking the
corresponding blocks. For instance, reads from and writes to a given file
should only happen while the corresponding inode block is locked. The
requests are as follows.
BCREQ_MAP_WLOCK
- Return the block's contents and obtain an exclusive lock. If the block
is currently locked by some other environment, the buffer cache will delay
its response until that environment unlocks. (The lock queue can hold up
to 8 environments per block; the 9th and later environments are rejected
immediately with an
-E_AGAIN
error.)
BCREQ_UNLOCK
- Unlock a block previously locked by
BCREQ_MAP_WLOCK
. Does
not return the block's contents. The buffer cache checks to make sure that
the unlocking environment actually held a lock.
BCREQ_UNLOCK_FLUSH
- Combines the effects of
UNLOCK
and FLUSH
.
Other IPCs: You probably won't need to use these.
The buffer cache supports shared locks with
BCREQ_MAP_RLOCK
. This could be useful for some operations
(for instance, letting multiple environments read the same file at
once), but we recommend you rely on BCREQ_MAP_WLOCK
at
first.
The buffer cache also tracks whether blocks have been
initialized. Each block has an initialization state that starts at 0. A
BCREQ_MAP_[RW]LOCK
IPC returns the corresponding block's
initialization state. The BCREQ_INITIALIZE
IPC sets a block's
initialization state to 1. The buffer cache remembers initialization
states for as long as it runs. You won't need to manipulate initialization
states in the regular exercises.
File system data structures
The file system you will work with is much simpler
than most "real" file systems,
but it is powerful enough to provide the basics:
creating, reading, writing, and deleting files
organized in a hierarchical directory structure.
Since JOS is a "single-user" operating system,
our file system doesn't support the UNIX notions
of file ownership or permissions.
It also currently does not support hard links, symbolic links,
time stamps, or special device files.
Most UNIX file systems divide available disk space
into two main types of regions:
inode regions and data regions.
Each file corresponds to one inode,
which holds critical metadata about the file
such as its stat
attributes and pointers to its data blocks.
The data regions are divided into much larger (typically 8KB or more)
data blocks, within which the file system stores
file data and directory metadata.
Directory entries contain file names and pointers to inodes;
a file is said to be hard-linked
if multiple directory entries in the file system
refer to that file's inode.
Both files and directories logically consist of a series of data blocks,
which may be scattered throughout the disk
much like the pages of an environment's virtual address space
can be scattered throughout physical memory.
User processes can read and write the contents of files directly,
but the file system handles all modifications to directories itself
as a part of actions such as file creation and deletion.
Our file system does, however, allow user environments
to read directory metadata directly
(e.g., with read
),
so user environments can perform directory scanning operations
themselves (e.g., to implement the ls
program).
The disadvantage of this approach to directory scanning,
and the reason most modern UNIX variants discourage it,
is that it makes application programs dependent
on the format of directory metadata,
making it difficult to change the file system's internal layout
without changing or at least recompiling application programs as well.
Superblocks

Layout for JOS file system with N blocks and I
inodes. N must be at least
1+⌈N/4096⌉+I. Only I-1 inode blocks
are required because 0 is an invalid inode number, so inode 0 isn't stored.
File systems typically reserve certain disk blocks,
at "easy-to-find" locations on the disk
such as the very start or the very end,
to hold metadata describing properties of the file system as a whole,
such as the block size, disk size,
any metadata required to find the root directory,
the time the file system was last mounted,
the time the file system was last checked for errors,
and so on.
These special blocks are called superblocks.
Our file system's superblock layout is defined by struct Super
in inc/fs.h
.
The file system superblock will always occupy block 1 on the disk;
boot loaders and partition tables use block 0, so
most file systems don't use the very first disk block.
Many "real" file systems maintain multiple superblocks,
replicated throughout several widely-spaced regions of the disk,
so that if one of them is corrupted
or the disk develops a media error in that region,
the other superblocks can still be found and used to access the file system.
Freemap: Managing block allocation
Just as the kernel manages physical memory allocation
so that physical pages aren't inappropriately reused,
a file system must manage disk blocks
to ensure that a given block is used for only one purpose at a time.
Many file systems keep track of free disk blocks
using a bitmap rather than a linked list of free blocks.
A bitmap simplifies block placement (finding a free block in a particular disk region),
is simple to manage and keep consistent,
and can be loaded into memory with few seeks.
Though some operations are slow with a bitmap—it can take O(N) time to find a free block—they can be sped up using auxiliary memory data structures.
The JOS file system tracks whether each block is allocated
using an array of bytes, not bits.
The Ith byte in the freemap data structure
is 1 iff block I is free.
Using bytes rather than bits wastes space, but makes freemap operations
much easier to code (no bit swizzling).
To set up a freemap,
we reserve a contiguous region of blocks
large enough to hold one byte for each disk block,
starting at block 2 (just after the superblock).
Thus, we must reserve one block for the freemap
for every 4096 blocks in the file system.
Note that the freemap includes bytes for all blocks,
including the superblock and the freemap itself.
The bytes for these special blocks are set to 0, indicating that the corresponding blocks
are in use.
Inodes
The layout of a JOS inode
is described by struct Inode
in inc/fs.h
.
The inode includes the file's size,
type (regular file or directory),
reference count,
and pointers to the blocks comprising the file.
For simplicity we will use this one Inode
structure
to represent file metadata as it appears
both on disk and in memory.
Some of its fields
are only meaningful in memory, and might have garbage values on disk;
we must initialize these fields whenever we read a Inode
structure into memory for the first time. (That's what BCREQ_INITIALIZE
is for.)
Each Inode
contains two reference counts. First,
i_refcount
is the "true" reference count; it measures the
number of hard links (directory entries) pointing to the inode. (For the
root directory, it is 1.) In contrast, the i_opencount
value
is only valid in memory. It counts the number of references to an inode
from any currently running process. A file's data blocks are not
reclaimed until its inode is unreferenced from the filesystem, and
no process has the file open.
A single Inode
structure is 4096 bytes big.
This is much larger than for most file systems, and wastes a lot of space
for small files.
However, since the buffer cache's unit of locking is a single block,
it is extremely convenient to have a separate block for each inode.
The i_direct
array in struct Inode
contains space
to store the block numbers for the file's data blocks.
There are 1018 direct pointers, limiting files to at most 4169728 bytes.
Most Unix-like file systems also use indirect pointers (and doubly-indirect,
triply-indirect, and so on) to support larger files.
You may implement these pointers for a challenge, but our inodes are big
enough to support pretty large files without indirect pointers.
If an inode is unreferenced (i_refcount == 0 && i_opencount ==
0
), then the rest of its contents are ignored. In particular any
nonzero i_direct
data blocks may be free or used by other
inodes.
Inode 1 corresponds to the file system's root directory. All other
inodes have numbers 2 or higher.
Directories and regular files
An Inode
in our file system
can represent either a regular file or a directory;
these two types of "files" are distinguished by the i_ftype
field.
The file system manages regular files and directory-files
in exactly the same way,
except that it does not interpret the contents of the data blocks
associated with regular files at all,
whereas the file system interprets the contents
of a directory-file as a series of Direntry
structures
describing the files and subdirectories within the directory.
Each Direntry
contains a file name (de_name
+
de_namelen
) and an inode number (de_inum
). The
file name is only valid if de_inum
is nonzero.
Part A: The File System
Disk Access
The file system server in our operating system
needs to be able to access the disk,
but we have not yet implemented any disk access functionality in our kernel.
Instead of taking the conventional "monolithic" operating system strategy
of adding an IDE disk driver to the kernel
along with the necessary system calls to allow the file system to access it,
we will instead implement the IDE disk driver
as part of the user-level buffer cache environment.
We will still need to modify the kernel slightly,
in order to set things up so that the buffer cache
has the privileges it needs to implement disk access itself.
It is easy to implement disk access in user space this way
as long as we rely on polling, "programmed I/O" (PIO)-based disk access
and do not use disk interrupts.
It is possible to implement interrupt-driven device drivers in user mode as well
(the L3 and L4 kernels do this, for example),
but it is more difficult
since the kernel must field device interrupts
and dispatch them to the correct user-mode environment.
The x86 processor uses the IOPL bits in the EFLAGS register
to determine whether protected-mode code
is allowed to perform special device I/O instructions,
such as IN and OUT.
The IOPL bits equal the minimum (i.e. numerically highest) privilege level
allowed to perform IN and OUT instructions, so if those bits are 0, only
the kernel can execute INs and OUTs.
All of the IDE disk registers we need to access
are located in the x86's I/O space (rather than memory-mapped I/O space),
so to let the file system environment access the disk, all we need to do is
manipulate the IOPL bits.
But no other environment should be able to access I/O space.
To keep things simple, from now on we will arrange things so that the
buffer cache always has ID ENVID_BUFCACHE
.
Exercise 0.
Did you resolve struct Page
(see "Merging Lab 5" above)? Just checking!
|
Exercise 1.
Modify your kernel's environment initialization function,
env_alloc in env.c ,
so that it gives environment ENVID_BUFCACHE I/O privilege,
but never gives that privilege to any other environment.
After this exercise, make run-testfile should
print a message "bufcache can do I/O ".
|
Do you have to do anything else
to ensure that this I/O privilege setting
is saved and restored properly when you subsequently switch
from one environment to another?
Make sure you understand how this environment state is handled.
This lab uses the file obj/kernel.img
as the image for disk 0 (typically "Drive C" under DOS/Windows) as before,
and to the (new) file obj/fs.img
as the image for disk 1 ("Drive D").
In this lab your file system should only ever touch disk 1;
disk 0 is used only to boot the kernel.
If you manage to corrupt either disk image in some way,
you can reset both of them to their original, "pristine" versions
simply by typing:
$ rm obj/kernel.img obj/fs.img
$ make
Challenge!
Implement interrupt-driven IDE disk access,
with or without DMA.
You can decide whether to move the device driver into the kernel,
keep it in user space along with the file system,
or even (if you really want to get into the microkernel spirit)
move it into a separate environment of its own.
|
Demand-paged buffer cache
The main JOS buffer cache is stored, of course, in the buffer cache
environment.
The 2GB region of virtual address space from 0x50000000 (DISKMAP
)
up to 0xD0000000 (DISKMAP + DISKSIZE
)
is reserved to map disk pages.
These pages are read on demand based on IPC requests.
For simplicity, other user environments use the same virtual memory
region to map buffer cache blocks, although they use different names
(FSMAP
and FSMAP + DISKSIZE
).
These blocks are demand paged.
If a page fault happens in the file system region,
a page fault handler will load the corresponding page from the buffer cache
by IPC.
Exercise 2.
Implement the bcache_pgfault_handler function
in lib/file.c .
testfile should print "initial fsck is good "
when you get this right.
|
File descriptors
Unix file descriptors are a general notion that encompasses
file I/O, pipes, console I/O, etc. In JOS, each of these device types has a
corresponding struct Dev
, with pointers to the functions that
implement read/write/etc. for that device type. (Thus, struct
Dev
is like an object-oriented class.) lib/fd.c
implements the general Unix-like file descriptor interface on top of this.
Each struct Fd
indicates its device type, and most of the
functions in lib/fd.c
simply dispatch operations to functions
in the appropriate struct Dev
.
lib/fd.c
also maintains the file descriptor table region in
each environment's address space, starting at FDTABLE
. This
area reserves a page's worth (4KB) of address space for each of the up to
NFD
(currently 32) file descriptors the application can have
open at once. At any given time, a particular file descriptor table page is
mapped if and only if the corresponding file descriptor is in use. Each
file descriptor also has an optional "data page" in the region starting at
FDDATA
, which we will use for pipes.
For nearly all interactions with files, user code will go through the
functions in lib/fd.c
.
Exercise 3 (no code).
Look over and analyze the code in inc/fd.h and
lib/fd.c .
To check your understanding, see if you can answer some questions
(no need to write up the answers):
When is memory mapped at a location in FDTABLE ?
What virtual memory features does fd_find_unused rely on?
If file descriptors were implemented as C++ or Java classes,
what would be their virtual functions?
|
File system interface
Each device type has at least one user-visible function that
cannot be implemented generically: the function for opening a new
file descriptor. Now you must implement this, and the rest of the
incomplete functions in lib/file.c
. When you're done, you'll
have a working file system!
You will use the buffer cache's locking primitives to prevent race
conditions between environments. To minimize the risk of deadlock, you
should ensure that locks are held only during the execution of
lib/file.c
functions. In other words, no locks should be
held when one of these interface functions returns to its caller.
The locking protocol makes sure that important file
system metadata doesn't change during an operation. For example, it
ensures that the Direntry pointer returned by dir_walk
is
actually for the right file name. (Without locking, another environment
could potentially run, unlink the name, and create another file at the same
position in the directory, creating a race.) It also ensures that the file
size does not change during a read
operation, and that
write
operations do not conflict.
Exercise 4.
Start implementing open in lib/file.c .
It must find an unused file descriptor
using fd_find_unused(),
walk the path hierarchy,
open the corresponding inode,
and create a new file descriptor on success.
Be sure your code fails gracefully
if the maximum number of files are already open,
or if any of the IPC requests to the file server fail.
For hints on style, consider the unlink implementation.
testfile should now pass the open
and file_stat tests.
|
You will now fill out the rest of the missing pieces of the file
implementation. Feel free to work in any order, but the exercises guide
you to pass the testfile
tests in order.
Exercise 5.
Implement devfile_read in lib/file.c .
testfile should now pass the file_read
and file_read across a block boundary tests.
|
Exercise 6.
Implement devfile_write in lib/file.c .
testfile should now pass the file_write
and file_read after file_write tests.
|
Exercise 7.
Implement block allocation and inode initialization. Write block_alloc in lib/file.c and add O_CREAT support to open .
testfile should now pass the file_write create
and file_read after file_write create tests.
|
Exercise 8.
Implement block freeing. Complete inode_close ,
devfile_close , and
inode_set_size
in lib/file.c and add O_TRUNC support to open .
testfile should now pass the final fsck
test.
|
Challenge!
The JOS file system locking protocol depends on the fact that
different inodes are given different locks. If a lock covered
more than one struct Inode , we would risk deadlock:
two processes executing path_walk concurrently could
create a circular wait. Fix this, and support smaller inodes,
by writing a more generic lock server.
|
Challenge!
The buffer cache cannot recover if an environment dies while
holding a lock. Implement a revocation protocol that allows it to
reclaim locks. The simplest protocol would simply check whether an
environment had died while holding a lock, and revoke the lock if so.
A more complex protocol might explicitly revoke locks from environments
after some amount of time (a technique related to leases).
|
Challenge!
Change the lib/file.c locking protocol to use read locks when possible.
Make sure that your locking protocol prevents race conditions
and avoids deadlock. Some operations will still
require write locks---maybe more than you'd first expect.
For example, the exclusive locks obtained during read
operations also protect the file descriptor f_offset
field from race conditions on concurrent updates---each
read call updates the f_offset
independently. If you switch
read to use read locks, you'll need to protect
f_offset a different way.
|
Challenge!
Change the lib/file.c locking protocol to use read-copy-update when possible.
|
Challenge!
The block cache has no eviction policy. Once a block is read,
it never gets removed and will remain in memory
forever. Add eviction to the buffer cache and
lib/file.c . The buffer cache
cannot evict a page that is still mapped by another environment,
but it should be able to evict any other page. You may need to
add additional IPC calls to allow environments to suggest pages
to evict. In those environments, page table
"accessed" bits, which the hardware sets on any
access to a page, can can track approximate usage of disk blocks
without the need to modify every place in the code that accesses
the disk map region. Be careful with dirty blocks.
|
Challenge!
The file system code uses synchronous writes
to keep the file system fairly consistent in the event of a crash.
Implement soft updates or journaling instead.
|
Challenge!
Implement an XN-like system to protect disk blocks from inappropriate
updates.
|
Challenge!
Add support to the file server and the client-side code
for files greater than 4MB in size.
|
Challenge!
Add file system interface functions
to create hard links and subdirectories.
|
Challenge!
Change the file system design to support more than one
file descriptor per page.
|
Challenge!
Implement the file system in a microkernel-like design.
This will require major IPC changes as well as buffer cache
changes.
|
Part B: Spawning Processes from the File System
In this exercise, you'll extend spawn
from Lab 4
to load program images from the file system as well as
from kernel binary images.
If spawn
is passed a binary name like "/ls
" that
begins with a slash, it will read the program data from disk;
otherwise, it will read the program data from the kernel.
Luckily, this requires just a couple of changes.
Exercise 9.
Change your spawn in lib/spawn.c
to open and read from a file,
rather than looking up a kernel binary with sys_program_lookup ,
if the first character of progname is a slash '/' .
Also close the file descriptor before exiting.
See the new "LAB 5 EXERCISE " comment.
Use make run-icode to test your code. This program spawns
off init , which should print some messages like this:
icode: close /motd
icode: spawn /init
[00001001] new env 00001002
init: running
init: data seems okay
init: bss seems okay
init: args: 'init' 'initarg1' 'initarg2'
init: exiting
[00001002] exiting gracefully
[00001002] free env 00001002
icode: exiting
[00001001] exiting gracefully
[00001001] free env 00001001
|
Sharing pages between environments
We would like to share file descriptor state across
fork
and spawn
, but file descriptor state is kept
in user-space memory. Right now, on fork
, the memory
will be marked copy-on-write,
so the state will be duplicated rather than shared.
(This means that running "(date; ls) >file
" will
not work properly, because even though date updates its own file offset,
ls will not see the change.)
On spawn
, the memory will be
left behind, not copied at all. (Effectively, the spawned environment
starts with no open file descriptors.)
We will change both fork
and spawn
to know that
certain regions of memory are used by the "library operating system" and
should always be shared. Rather than hard-code a list of regions somewhere,
we will set an otherwise-unused bit in the page table entries (just like
we did with the PTE_COW
bit in fork
).
We have defined a new PTE_SHARE
bit
in inc/lib.h
.
If a page table entry has this bit set, then by convention,
the PTE should be copied directly from parent to child
in both fork
and spawn
.
Note that this is different from marking it copy-on-write:
as described in the first paragraph,
we want to make sure to share
updates to the page.
Exercise 10.
Change your duppage code in lib/fork.c to follow
the new convention. If the page table entry has the PTE_SHARE
bit set, just copy the mapping directly, regardless of whether it is
marked writable or copy-on-write.
(This could be a one-line change, depending on your current code!)
|
Exercise 11.
Change spawn in lib/spawn.c to propagate
PTE_SHARE pages. After it finishes
setting up the child virtual address space but before it marks the
child runnable, it should call copy_shared_pages ,
which loops through all the page table entries in the current process,
copying any mappings that have the PTE_SHARE bit set.
You'll just need to modify spawn so that it calls copy_shared_pages
(a one-line change). Make sure that you copy the shared pages
very near the end of the function, after closing the file descriptor corresponding to the ELF binary! (Why?)
|
Use make run-testpteshare
to check that your code is
behaving properly.
You should see lines that say "fork handles PTE_SHARE right
"
and "spawn handles PTE_SHARE right
".
Use make run-testfdsharing
to check that file descriptors are shared
properly.
You should see lines that say "read in child succeeded
" and
"read in parent succeeded
".
Part C: A Shell
In this part of the lab, you'll extend JOS to handle everything
necessary to support a shell.
We've done a lot of the work for you,
but you must (1) make it possible to share file descriptors across
environments, (2) clean up a couple loose ends, and
(3) implement file redirection in the shell.
Before going further, enable keyboard interrupt handling in your
kernel.
Exercise 12. Change
trap in kern/trap.c to call
kbd_intr() every time interrupt number
IRQ_OFFSET+1 occurs. (This should be a three-line
change.)
|
At this point, you can use make run-initsh
to boot into
the current version of the shell, which can already do simple commands like
"ls". As you progress through the lab, the shell will become more
functional, and you will be able to do things like add redirections.
Pipes
Pipes and the console are both I/O stream interfaces.
This means that they support reading and/or writing,
but not file positions.
Like Unix, JOS represents these streams using file descriptors.
To support this, the file descriptor subsystem
uses a simple virtual file system layer,
implemented by struct Dev
,
so that disk files, console files, and pipes all implement the same file
descriptor functions.
A pipe is a shared data buffer accessed via two file descriptors, one
for writing data into the pipe and one for reading data out of it.
Unix command lines like "ls | sort
" use pipes. The shell
creates a pipe, hooks up ls
's standard output to the write end
of the pipe, and hooks up sort
's standard input to the read
end of the pipe. As a result, ls
's output is processed by
sort
.
You may want to read the
pipe
manual page for background, and the pipe
section of Dennis Ritchie's UNIX history paper for interesting
history.
In Unix-like designs, each pipe's shared data buffer is stored in the
kernel. Of course, this is not how we implement pipes on an exokernel!
Your library operating system represents a pipe, including its shared
buffer, by a single struct Pipe
.
The struct Pipe
is stored on its own page to make sharing
easier, and mapped into the file mapping area of both the reading and the
writing file descriptor.
Here's the structure:
#define PIPEBUFSIZ 32
struct Pipe {
off_t p_rpos; // read position
off_t p_wpos; // write position
uint8_t p_buf[PIPEBUFSIZ]; // shared buffer
};
This is a simple lock-free queue structure. The pipe starts in this
state:
p_rpos = 0 ---+
p_wpos = 0 ---|+
VV
+---+---+---+---+---+---+---+- ... -+---+---+---+---+
p_buf: | | | | | | | | | | | | |
+---+---+---+---+---+---+---+- ... -+---+---+---+---+
0 1 2 3 4 5 6 28 29 30 31
The bytes written to the pipe can be thought of as numbered starting
from 0. The write position p_wpos
gives the number of the
next byte that will be written, and the read position p_rpos
gives the number of the next byte to be read. After a writer writes "abc
" to the pipe, it will enter this state:
p_rpos = 0 ---+
p_wpos = 3 ---|-----------+
V V
+---+---+---+---+---+---+---+- ... -+---+---+---+---+
p_buf: | a | b | c | | | | | | | | | |
+---+---+---+---+---+---+---+- ... -+---+---+---+---+
0 1 2 3 4 5 6 28 29 30 31
Since p_rpos != p_wpos
, the pipe contains data. The next
read from the pipe will return the next 3 characters. For example, after
a read()
of one byte:
p_rpos = 1 -------+
p_wpos = 3 -------|-------+
V V
+---+---+---+---+---+---+---+- ... -+---+---+---+---+
p_buf: | | b | c | | | | | | | | | |
+---+---+---+---+---+---+---+- ... -+---+---+---+---+
0 1 2 3 4 5 6 28 29 30 31
This data structure is safe for concurrent updates as long as there is a
single reader and a single writer, since only the reader updates
p_rpos
and only the writer updates p_wpos
.
Since the pipe buffer is not infinite, byte i
is stored in
pipe buffer index i % PIPEBUFSIZ
. Thus, after a couple reads
and writes, the pipe might enter this state:
p_rpos = 30 ----------------------------------------------+
p_wpos = 33 ------+ |
V V
+---+---+---+---+---+---+---+- ... -+---+---+---+---+
p_buf: | $ | | | | | | | | | | ! | @ |
+---+---+---+---+---+---+---+- ... -+---+---+---+---+
0 1 2 3 4 5 6 28 29 30 31
Note that byte 32 was stored in slot 0.
If p_rpos == p_wpos
, the pipe is empty. Any
read
call should yield until a writer adds information to the
pipe. Similarly, if p_wpos - p_rpos == PIPEBUFSIZ
, the pipe
is full. Any write
call should yield until a reader opens up
some space in the pipe.
Closed Pipes
There is a catch -- maybe we are trying to read from an empty pipe but
all the writers have exited. Then there is no chance that there will ever
be more data in the pipe, so waiting is futile. In such a case, Unix
signals end-of-file by returning 0. So will we. To detect that there are no
writers left, we could put reader and writer counts into the pipe structure
and update them every time we fork or spawn and every time an environment
exits. This is fragile -- what if the environment doesn't exit cleanly?
Instead we can use the kernel's page reference counts, which are guaranteed
to be accurate.
Recall that the kernel page structures are mapped read-only in user
environments. The library function pageref(void *ptr)
returns
the number of page table references to the page containing the virtual
address ptr
. It works by first examining vpt[]
to find ptr
's physical address, then looking up the relevant
struct Page
in the UPAGES
array and returning its
pp_ref
field. So, for example, if fd
is a
pointer to a particular struct Fd
, pageref(fd)
will tell us how many different references there are to that structure.
Three pages are allocated for each pipe: the struct Fd
for
the reading file descriptor rfd
, the struct Fd
for the writing file descriptor wfd
, and the struct
Pipe
p
shared by both. The struct Pipe
page is mapped once per file descriptor reference. Thus, the following
equation holds: pageref(rfd) + pageref(wfd) = pageref(p)
. A
reader can check whether there are any writers left by examining these
counts. If pageref(p) == pageref(rfd)
, then
pageref(wfd) == 0
, and there are no more writers. A writer can
check for readers in the same manner.
Exercise 13. Implement pipes in
lib/pipe.c . We've included the code for reading from
a pipe for you. You must write the code for writing to a pipe, and
the code for testing whether a pipe is closed. Run make run-testpipe to check your work; you
should see a line "pipe tests
passed ". |
Pipe Races
File descriptor structures use shared memory that is written
concurrently by multiple processes. That creepy shiver that just ran
up your back is justified: this kind of situation is ripe with race
conditions. We've made one race
condition, concerning pipes, particularly easy to run into.
The race is that the two calls to pageref()
in
_pipeisclosed
might not happen atomically. If another process
duplicates or closes the file descriptor page between the two calls, the
comparison will be meaningless. To make it concrete, suppose that we
run:
pipe(p);
if (fork() == 0) {
close(p[1]);
read(p[0], buf, sizeof(buf));
} else {
close(p[0]);
write(p[1], msg, strlen(msg));
}
The following might happen:
- The child runs first after the fork.
It closes
p[1]
and then tries to read from p[0]
.
The pipe is empty, so read
checks to see whether the pipe is closed
before yielding.
Inside _pipeisclosed
, pageref(fd)
returns 2
(both the parent and the child have p[0]
open), but then
a clock interrupt happens.
- Now the kernel chooses to run the parent for a little while.
The parent closes
p[0]
and writes msg
into the pipe. Msg
is very long, so the
write
yields halfway to let a reader (the child)
empty the pipe.
- Back in the child,
_pipeisclosed
continues.
It calls pageref(p)
, which returns 2
(the child has a reference associated with p[0]
,
and the parent has a reference associated with p[1]
).
The counts match, so _pipeisclosed
reports that the pipe is closed.
Oops.
Run "make run-testpiperace2
" to see this race
in action. You should see "RACE: pipe appears closed
"
when the race occurs.
This race isn't that hard to fix. Comparing the counts can only be
incorrect if another environment ran between when we looked up the
first count and when we looked up the second count. In other words,
we need to make sure that _pipeisclosed
executes
atomically. Since it doesn't change any variables, we can simply
rerun it until it runs without being interrupted; the code is so
short that it will usually not be interrupted.
But how can we tell whether our environment has been interrupted?
In the uniprocessor JOS kernel, this can be
simple: just check the env_runs
variable in our
environment structure. Each time the kernel runs an environment, it
increments that environment's env_runs
. Thus, user code
can record env->env_runs
, do its computation, and
then look at env->env_runs
again. If
env_runs
didn't change, then the environment was not
interrupted. Conversely, if env_runs
did change, then
the environment was interrupted.
Exercise 14.
Change _pipeisclosed to repeat the
check until it completes without interruption.
Print "pipe race avoided\n " when you notice an interrupt
and the check would have returned 1 (erroneously indicating
that the pipe was closed).
|
Run "make run-testpiperace2
" to check whether
the race still happens. If it's gone, you should not see
"RACE: pipe appears closed
", and you should see
"race didn't happen
". You should also see plenty of
your "avoided
" messages, indicating places where the
race would have happened if you weren't being so careful.
Challenge! Write a test program
that demonstrates one of the other races, such as a race
between multiple readers of a single pipe. |
Challenge! Fix all these
races! |
The shell itself
Run make run-initsh
. This will run your kernel starting
user/initsh
,
which sets up the console as file descriptors 0 and 1 (standard input and
standard output), then spawns sh
, the shell.
Run ls
and cat lorem
.
Exercise 15.
The shell can only run simple commands. It has no redirection or pipes.
It is your job to add these. Flesh out user/sh.c .
|
Once your shell is working, you should be able to run the following
commands:
echo hello world | cat
cat lorem >out
cat out
cat lorem |num
cat lorem |num |num |num |num |num
lsfd
cat script
sh <script
Note that the user library routine printf
prints straight
to the console, without using the file descriptor code. This is great
for debugging but not great for piping into other programs.
To print output to a particular file descriptor (for example, 1, standard output),
use fprintf(1, "...", ...)
. See user/ls.c
for examples.
Run make run-testshell
to test your shell.
Testshell
simply feeds the above commands (also found in
fs/testshell.sh
) into the shell and then checks that the
output matches fs/testshell.key
.
Challenge!
Add more features to the shell.
Some possibilities include:
- backgrounding commands (
ls & )
- multiple commands per line (
ls; echo hi )
- command grouping (
(ls; echo hi) | cat > out )
- environment variable expansion (
echo $hello )
- quoting (
echo "a | b" )
- command-line history and/or editing
- tab completion
- directories, cd, and a PATH for command-lookup.
- file creation
- ctl-c to kill the running environment
but feel free to do something not on this list. Be creative.
|
This completes the lab!
Back to CS 261 Research Topics in Operating Systems