This is not the current version of the class.

# Problem set 4: Disk and buffer cache

## Handout

Our handout code offers:

• A file system definition (chickadeefs.hh, described here).
• A class chickadeefs::journalreplayer that knows how to replay journal records.
• A class ahcistate (in k-devices.hh) that interfaces to disks that speak the modern SATA standard over Intel’s AHCI. An important aspect of this standard is NCQ support, which allows the operating system to post multiple disk requests in parallel, and allows the disk to service those requests in the fastest order it can.
• A class bufcache (in k-chkfs.hh and k-chkfs.cc) that implements a buffer cache.
• A class chkfsstate (in k-chkfs.hh and k-chkfs.cc) that accesses the sata_disk’s Chickadee file system via the buffer cache.
• A class chkfs_fileiter (in k-chkfsiter.hh) that iterates over the blocks in a ChickadeeFS file (somewhat like how vmiter iterates over the pages in an x86-64 page table).
• A function chickadeefs_read_file_data (defined in k-chkfs.cc) that uses bufcache, chkfsstate, and chkfs_fileiter to read disk files.
• A system call implementation of a new system call, sys_readdiskfile, that calls chickadeefs_read_file_data.
• A program p-readdiskfile.cc that uses sys_readdiskfile to read the contents of a disk file.

Before you do anything else, uncomment this line in init_hardware:

    /* sata_disk = ahcistate::find(); */


Then give make run-readdiskfile a try!

You will see that reading disk files is woefully slow. This is because Chickadee’s buffer cache and disk programming interfaces are woefully underpowered. In this problem set, you will fix the buffer cache and ahcistate; hook the file system up to your VFS from problem set 3; and add write support.

## A. Buffer cache

In this part, you’ll make a real buffer cache capable of speeding up read requests. The current buffer cache only contains blocks that are currently being used: as soon as a block becomes unreferenced, it is evicted. Your buffer cache should not evict immediately, and it should also contain blocks that will likely be used in the future.

To implement a good buffer cache, you will need an eviction policy, which decides which (unreferenced) block to evict when the cache is full but a new block should be added. LRU (least recently used) is a good first choice, but there are other good choices, such as policies that treat different kinds of blocks differently. (For example, it’s probably a good idea to always keep the superblock in memory—or to introduce a separate variable, such as a chkfsstate member variable, that holds the superblock’s data. And it might be a good idea to treat inode blocks differently from data blocks.)

You will also want a prefetching policy, which decides which blocks to read in advance of their being needed.

You will also want to raise the number of entries in the buffer cache, although your buffer cache should work for any number of entries greater than or equal to 10.

Your buffer cache must obey the invariant that only unreferenced blocks can be evicted. That is, any block with bufentry::ref_ > 0 must not be evicted.

Before working too much on the code, you should specify your policies in a short text document and talk over your policies with a TF.

Test your work first with make run-readdiskfile, which reads a very short file one byte at a time, and with make run-wcdiskfile, which reads longer files 125 bytes at a time. You can also run the readdiskfile and wcdiskfile programs with arguments after make run-sh.

### Hints

Prefetching will require more code surgery than eviction. You will need to change the ahcistate::read_or_write function to use more than just command slot 0. You also don’t want a process to block just because its kernel task is prefetching data!

You could implement a separate kernel task that just does prefetching. But there are other possibilities. Consider, for example, adding a volatile int fetch_status_ to bufentry, and adding a new function to ahcistate with the following signature:

// returns true iff command was queued
bool read_or_write_nonblocking(idecommand cmd, void* buf, size_t sz,
size_t off, volatile int* status);


You can change the existing read_or_write function to call read_or_write_nonblocking.

The simplest Chickadee prefetcher will fetch at most 32 blocks at a time, one per NCQ slot on the disk. Most real operating systems instead implement a software I/O queue capable of storing many hundreds of I/O requests. As soon as the disk satisfies one request, the interrupt handler pops another I/O request off the I/O queue and adds it to the disk hardware queue. Various algorithms are used to schedule these software requests, subject to requirements such as fairness (one process’s prefetch requests should not starve another process’s requests) and priority (write requests might be more important than read requests). Think about how you would implement such a queue in Chickadee; and if you’re feeling ambitious, implement one!

## B. VFS integration

In this part, you’ll hook up the Chickadee file system to your VFS for reads. This takes what the VFS you developed in the last problem set and applies it to a real on-disk file system.

Chickadeefs inodes have a built-in synchronization plan that you should understand. This plan involves the buffer cache and two inode fields, mlock and mref, that are only meaningful in memory.

The chkfsstate::get_inode and put_inode functions manipulate buffer cache reference counts to ensure that the returned inode remains in memory. That means that callers must call put_inode to “free” an inode when done with it.

The mlock field is a read/write lock for the inode’s contents. The lock can be held for either reading or writing. While the read lock is held, the inode’s contents cannot change; the write lock is required to modify the corresponding file (e.g., to add blocks to it or modify its contents). mlock should not be held for long; it should be acquired immediately before a kernel read or write operation, and released immediately after. Do not, for example, call inode::lock_write() simply because a process opened the corresponding file for writing. mlock is manipulated by inode::lock_read, inode::lock_write, inode::unlock_read, and inode::unlock_write.

The mref field is available for your use. For instance, you can use mref to store the address of the in-memory vnode corresponding to the inode. (mref is only 32 bits wide, so you might store the address with ino->mref = uint32_t(ka2pa(vnode)) and extract the address with something like pa2ka<vnode_type*>(ino->mref).) Or you can use it as a reference count, indicating that some operating system task or structure holds a reference to the inode and that the inode should not be deleted or reused. (You’d add ++ino->mref to chkfsstate::get_inode and --ino->mref to chkfsstate::put_inode).

Both fields are cleared to 0 when an inode block is read.

### execv

First, change your execv implementation to use the disk file system, rather than memfiles. Use make run-execallocexit—or make run-sh, followed by some commands like echo foo—to test your work.

We’ve made some changes to the process loader interface to make this easier. You’ll likely define a new derived class of proc::loader; your derived class will call chkfsstate::get_data_block and bufcache::put_block.

### VFS

Next, add a new VFS type for Chickadee file system files. You will add a new derived type of vnode. This vnode will hold a pointer to a chickadeefs::inode, and will use chkfs_fileiter and/or chkfsstate functions to read from it. Closing the vnode will put_inode the corresponding inode.

### open

Finally, change your open implementation to read the Chickadee file system rather than memfiles.

Use make run-sh, and then cat thoreau.txt, to check your work. (The thoreau.txt file is included only on the disk file system.)

## C. Simple writes and writeback

In the next parts, you’ll extend your Chickadee file system VFS to support writes. We have divided write support into multiple subparts, each of which should take several days. You don’t need to worry too much about file system crash safety yet.

### Writes

First, support opening disk files for writing (i.e., the OF_WRITE flag without OF_TRUNC or OF_CREAT), and support writing to such files.

You’ll need to change your VFS and the buffer cache to make this work. However, you don’t need to actually write to disk yet. In the first stage, writes will simply modify buffer cache contents.

In this stage, you need not support writes that grow or shrink files. There is thus no need to write inodes or to allocate or free disk blocks.

The initial part of make run-testwritefs tests this functionality.

### Writeback

Next, support writing data back to the disk via the sys_sync system call, which is implemented by bufcache::sync().

Change the bufcache::sync() function to write all “dirty buffers” (i.e., buffer cache entries whose data is modified relative to the disk version) back to disk, at the proper locations.

Use make cleanfs run-testwritefs to test this functionality.

The cleanfs command rebuilds the disk image, which matters because testwritefs assumes it starts with a clean disk image.

### Writeback hints

Basics: You’ll need to track which buffers are dirty, meaning modified in memory relative to the on-disk version. For example, you could add a flag bufentry::f_dirty that is set when data is modified and cleared when modifications are written back to disk. Your eviction policy from Part A must not evict dirty buffers.

Synchronization: The writeback process must write an internally consistent version of each dirty buffer, meaning that buffer contents must not be modified or freed while the buffer is in flight to the disk. This requires both delaying writeback until concurrent writes complete, and delaying new concurrent writes and evictions until writeback completes.

A simple synchronization method that works is a per-bufentry write reference count that can be either zero or one. Consider completing these functions:

// Obtain a write reference on e. Blocks until there are no other write
// references, then obtains a reference. (Can also be a good place to
// maintain dirty flags.)
void bufcache::get_write(bufentry* e);

// Release the write reference on e obtained by a prior call to get_write(e).
void bufcache::put_write(bufentry* e);


Your VFS will obtain a write reference before modifying a block, and release that reference afterward. (The VFS should not hold write references long term.) Your bufcache::sync function will likewise obtain write references. You may also want to add one or more wait queues to facilitate efficient blocking.

Note that a write reference should not prevent other kernel tasks from holding read references to the entry.

If you want to use a different synchronization plan than get/put_write, you will need to change chkfs_fileiter.

Dirty lists: A first attempt at a writeback procedure might look like this. (You would additionally need locking to protect access to the buffer cache and/or bufentries.)

for each bufentry in the buffer cache {
if (bufentry is dirty) {
get write reference;
write bufentry to disk, blocking until write is complete;
clear f_dirty flag and put write reference;
}
}


But walking the whole buffer cache takes O(bufcache::ne) time, even if very little cached data is dirty. It’s faster to maintain a linked list of dirty bufentry structures. The writeback procedure would look something like this (locking still elided):

while (bufentry* e = dirty_list_.pop_front()) {
get write reference;
write bufentry to disk, blocking until write is complete;
clear f_dirty flag and put write reference;
}


But what if other processes keep writing to new blocks while the writeback procedure is running? The writeback procedure might never reach the end of the dirty_list_! A simple swapping trick avoids the problem:

list<...> mydirty;
mydirty.swap(dirty_list_);
// Now dirty_list_ is empty; mydirty contains its previous contents!
// Any newly-dirty blocks will be put onto dirty_list_; mydirty, a
// local variable, will not grow.
while (bufentry* e = mydirty.pop_front()) {
// ...
}


This trick might also let you release a long-term lock on the buffer cache during the sync process.

Parallel writes: All of these loops end up writing blocks one at a time. That’s OK, but can you instead use all available SATA slots for writes? That will be faster!

## D. File sizes and extension

In this section, you’ll support making files bigger and smaller and seeking within files. Use make cleanfs run-testwritefs2 to test.

### OF_TRUNC

Implement the OF_TRUNC flag to sys_open. When OF_TRUNC and OF_WRITE are both given, the file size should be set to zero.

This step does not actually require that you free any data blocks! A ChickadeeFS inode can have more or fewer data blocks allocated than its size would indicate (link). But the tests will ensure that you mark modified inode blocks as dirty.

### sys_lseek

Implement the sys_lseek system call, which changes a file’s position and returns the new position.

Chickadee’s sys_lseek should behave generally like Linux’s lseek. sys_lseek(fd, off, LSEEK_SET) sets the file position to off; sys_lseek(fd, off, LSEEK_CUR) should adjust the file position by off; and sys_lseek(fd, off, LSEEK_END) should set the file position to the end of the file plus off. In all these cases, the system call returns the new file position. Chickadee also supports sys_lseek(fd, off, LSEEK_SIZE), which returns the file’s size without changing the position (off is ignored). Return E_SPIPE for unseekable files, such as pipes and the keyboard/console.

Unlike Linux, you only need to support seeks within a file for now.

### File extension

Finally, implement file extension. Writing additional data to the end of file should make the file bigger. This is a big deal!

Use make cleanfs run-testwritefs2 fsck to test. When testwritefs2 completes successfully, press q to quit QEMU; then Chickadee’s file system checker should run. You should see something like this in the terminal:

kohler@ubuntu:~/chickadee\$ make cleanfs run-testwritefs2 fsck
* Run gdb -x build/chickadee.gdb to connect gdb to qemu.
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
warning: TCG doesn't support requested feature: CPUID.01H:ECX.vmx [bit 5]
unknown keycodes (unnamed)', please report to qemu-devel@nongnu.org
* File system OK


### File extension hints

• Complete the chkfsstate::allocate_block() function so it allocates a free block. This function will load the free block bitmap into the buffer cache, find a bit that is 1 (indicating a free block), set that bit to 0, and return the corresponding block number.

Although ChickadeeFS can support more than one FBB block, one FBB block suffices to handle file systems with up to 215 blocks (128 MiB). It’s OK to not handle larger file systems.

• Your buffer cache may fill up with dirty blocks. One way to handle this case: make bufcache::get_disk_entry release its locks, call sync, and then try again if the buffer cache has any unreferenced dirty entries.

## E. Creating files

Add support for the OF_CREATE flag to sys_open. When specified in combination with OF_WRITE, this flag should create a file of the supplied name if no such file exists.

To create a file, you’ll need to:

• Find an empty direntry in the directory. This may require extending the size of the directory, which may require allocating one or more blocks (just as in extending files).
• Allocate an inode.
• Store the inode number and filename in the direntry.
• Mark the inode as allocated (set its type, size, and nlink count).

Use make cleanfs run-testwritefs3 fsck to test your work. You should also now be able to do cool things like create new files from your shell.

## F. Optional features

Want to extend your file system further? There’s a lot of stuff you can add if you want and have time.

• Add support for the sys_unlink system call, which removes a file. Use make cleanfs run-testwritefs4 fsck to test your work.
• Add support for the sys_rename system call, which changes a file’s name.
• Add subdirectory support: sys_mkdir and sys_rmdir system calls, and support for moving files between directories.
• Add support for special kinds of file, such as /dev/null` or named pipes.