By Adriana Magaña, Brian Nguyen, Jesen Kwok Ha
There are three key concepts that are taken into consideration in the design and use of file systems:
For most modern file systems, if these invariants are broken, the file system will become inconsistent: that is, broken, and will eventually violate what the user expects. (For instance, writing data into one file might change a completely different file as well.)
But following these invariants naively makes the file system
implementation quite slow, because it strongly constrains the order in
which the OS can write blocks to disk; these constraints leave the OS less
free to choose a fast disk schedule. And if we don't guarantee the
environments we risk corruption, or we need to fsck
on
startup, which takes forever. So can we maintain the invariants
without constraining scheduling?
Soft Updates maintains these invariants by doing the following:
Follow these rules and the invariants will hold: cool! The neat aspect of soft updates is how it follows the rules without constraining scheduling order, through rollbacks and roll-forwards.
But there are other efficient ways to achieve consistency too.
In a journaling file system, all changes are written sequentially to a log. The log is usually kept in a separate section of the file system. The log is written before the actual changes are written to the "real" data structures. Now there's no need to carefully put disk writes in order (like soft updates does); file system consistency is maintained by finishing transactions that were recorded in the log but that weren’t completed because of some type of failure like a system crash. Any unfinished log entries can be ignored; since we write the log before the real disk structures, if we crashed during a write to the log, then the real disk structures are safe.
Here are the steps taken by a Journaling File System when about to perform some FS operation:
This type of FS can be slower because it has to write to both the file system and the journal. On the other hand, the journal is written sequentially, so the FS will achieve good locality.
What should we write into the journal? There are two possibilities: full data journaling and metadata journaling. In full data journaling, everything is written to the journal, including data blocks. This provides a strong guarantee: No matter when the computer crashes, any write will either succeed or it will be like the write never happened. This is stronger than soft updates, which just says that the disk's METADATA will be consistent. However, it is quite expensive. Metadata journaling provides a weaker, soft-updates-like guarantee: No matter when the computer crashes, the file system metadata will be easy to make consistent. But it is a lot faster.
Performance | slower by 2-3 times |
Robustness | Full data journaling => very reliable; metadata journaling => only guarantees consistency |
Overhead | Linux ext3: 32MB/disk; Linux JFS: 0.4% |
Journaling file systems arose out of an idea called log-structured file systems, where the entire file system data structure is just a journal!
OK, so say that a file system needs to allocate a block for a file (because the file got bigger). How should we do it?
Why not choose any available block at random? This will work OK, but it won't help keep blocks of the same file together. That is, locality of reference will no longer imply proximity on disk. There will be a lot of seeks, so performance will be bad.
The BSD operating systems' Fast File System (FFS) allocates blocks using cylinder groups, an idea that has made its way into other FSes. The disk is divided into several contiguous groups of cylinders. Each cylinder group has its own free block bitmap. We try to allocate the block in the same cylinder group as the other blocks in the file.
But that's not all. Does it make sense to let huge files take up entire cylinder groups?
No! For instance, imagine a directory d
that's in
cylinder group 1. d
has two files, crap.mp3
and
hello.txt
. Suddenly crap.mp3
grows in size to
gigantic proportions and takes up the entire cylinder group. What happens
if hello.txt
grows even by a tiny bit? Well, we have to go to
another cylinder group, causing a seek in the middle of a 2-block file! It
would be far better to cause the seek in the middle of the
crap.mp3
file, which, because it is big, will
already take a long time to read.
Thus, FFS tries to spread large files out across the disk! When a file goes above 40KB, and then at every megabyte thereafter, FFS switches cylinder groups. This generally leaves cylinder groups enough free space to work with.
Say you have a
directory entry for the following file:
/home/Lei/thesis.doc
You now want to rename the file by using:
mv /home/Lei/thesis.doc /home/Lei/thesis.txt
Just
do a simple change in the Filename in the directory entry. Filename=thesis.txt.
However, if we want to rename the file to a different directory [mv /home/lei/thesis.doc /shared/thesis.doc] what will happen then? We can create a duplicate Directory Entry, which points to /shared/ as the directory it belongs in. There is a problem with this though, this is a multi step operation!
There are 2 steps involved in coping/renaming a file between two different directories with directory entries.
What happens if this operation crashes after step 1?
This will violate invariant 3, "No block belongs to more than 1 file"! It looks like two different files share blocks with one another.
Unix solves this issue with the concept of an inode, which separates metadata from directory entries. Now a directory entry can be thought of as the name and inode pointer. The following diagram is what may potentially happen during a rename to a new directory.
During a rename to a new directory our system crashes after step 1 (copy directory entry), the result is having 2 directories pointing to the same inode. This does not break the invariants, though, since the data blocks are still only pointed to by one file (one inode).
This capability, of having multiple links to a file, is explicitly
supported by Unix. We can add multiple links to an inode. These are
called "hard links"; we'll see next time why. Any writes made to the file
using one of its names can be read using another name. And "deleting" a
file -- removing a link from a directory to an inode -- doesn't remove the
data from disk until all the links are removed. This is why the
Unix system call for removing a file is called unlink
and not
delete
.
A file system must be mounted before it can be available to processes on the system, similar to how a file must be opened before it is used.
Then the operating system mounts a file system it performs the following:
Typically a mount point is an empty directory. For example, in UNIX a
file system with user U
’s home directory can be mounted on the
empty directory /home/U
. We can then access a user’s directory
via /home/<user name>. If we mount the filesystem under /users, we
would then have to access the user’s directory via /user/<user
name>
It asks the device driver to read the device directory, and verify that the directory has the expected format
From now on, any accesses to the empty directory will "jump" automatically into the mounted file system. The mounted file system can also "jump" out, by following parent-directory links.
The following figure shows what a partition looks like after it is mounted:
DOS uses drive letters. The file system is not a tree, but a forest: it has many roots. Mounts are restricted: When you mount a file system, it turns into a new drive letter.
The UNIX file system is tree-structured: there is a single root. All mounted file systems are accessible from the root, and mounts can appear anywhere in the tree.
- attaches file systems as branches to the root file system (can be mountd on any directory)
i. sets a flag in the inode for that directory which indicates that the directory is a mount point. A field points to an entry in the mount table, indicating which device is mounted there.
ii. Allows OS to traverse directory structure and switch among file systems as appropriate.
- system can be configured to mount file systems on startup
When there are several processes requesting a disk access (e.g. for a read or write), what order should they be served? Disk scheduling algorithms determine the order disk requests are processed, which can be a crucial factor to performance. The object is to minimize the seek time for/between each request.
First-Come-First-Served (FCFS) :
- Simply processes disk access requests in the order they arrive.
- Problem: Can be very slow. Requests for spaced-out regions of the disk may cause the read/write head to swing back and forth, causing longer seek times. For uniformly random files, the read/write head seeks 1/2 of the disk on average for each request. We could obtain better throughput if we processed groups of requests for areas of the disk that are adjacent or closer to each other: if proximity on disk influenced locality of reference.
Shortest-Seek-Time-First (SSTF): a.k.a. Shortest Job First (SJF):
- Chooses the request that will cause the shortest access the nearest position to the seek head first.
- Problem: Can cause starvation. A request for a part of the disk far away from the current position of the disk head may never get served.
[C]SCAN: (elevator scheduler) fairly good
- Seek across disk in one direction and serve all jobs in that path. Once the other end is reached, service the requests that are on the reverse direction. Minimizes amount of zig-zag, so there’s less seek time and therefore it has better performance. SCAN searches back and forth, like an elevator that goes up as far as it can, then down as far as it can. It provides unfair service, since blocks in the middle are more frequently serviced than blocks on the ends. C-SCAN, or circular scan, searches only in one direction, and is fair.
A RAID is an architecture that employs multiple disk drives in a scheme that aims to maximize performance and/or minimize data loss. There are several types of RAID architectures, each with advantages and disadvantages with respect to one another.
The 2 extreme cases are described below:
RAID 0 (a.k.a. Striping)
In RAID 0, the different blocks of a file are spad out across multiple (N) disk drives.
RAID 1 (a.k.a. Mirroring)
In RAID 1, each write goes to all disk drives in the array. Therefore, each disk drive’s contents are identical to each other, and there are N copies of the same data (where N = the number of disks drives in the array).