CS 111 - More on File Systems
Authors: Andre Encarnacao, Ignacio Zendejas, and Jordan Saxonberg
Date: November 22, 2005
File System Organization
File System Consistency
File System Organization
Fig. 1 - Disk setup
How do applications name
all of these resources?
By using a naming scheme. Operating systems
have different approaches on how these naming schemes are defined. Here
are two examples:
1. DOS uses a very familiar naming scheme. It uses letters to name
drives, where each one of these drives is associated with a particular device
and can possibly have its own file system. This sort of naming scheme is the
equivalent of a forest structure of different file systems, as seen in
figure 2 below.
The advantage, of course, is that the hardware is named, making the different devices distinguishable.
However, it has some disadvantages:
The names are only for the system's benefit, and
not the user's benefit
There is limited space (up to 26 drives - one per letter)
Fig. 2 - DOS Naming Scheme
2. LINUX uses a different naming scheme in which it mounts devices
by mounting a file system. The Linux operating system has a single root
directory with its own file system and therefore gives it a tree structure (as
opposed to the forest structure of DOS). Please see figure 3 below for a
visualization of this. File systems can be mounted anywhere in this tree
and all mounted file systems are accessible from the root of the tree. A
file system is mounted by using the mount command, which essentially
overrides a portion of the already existing Linux directory hierarchy (i.e.,
namespace) by attaching the whole file system onto a specified directory inode
(the mount point). Typically, the mount point is an empty directory but
this is not required. If one were to mount a file system onto a non-empty
directory, all the data previously in that directory WILL STILL be on disk, just
NO LONGER visible to the user. Figure 4 below shows how we have been mounting
our file system for Lab 3.
Advantages of mounting:
Disadvantages of mounting:
It's a user-oriented naming scheme
It makes sense to be able to mount at any level
(i.e., replace inodes at any level) and that's exactly what a mount enables one to do.
The hardware is less obvious since devices can
be located anywhere in the tree structure
Fig. 3 - Linux Naming Scheme
Even though it's possible to have many file systems
mounted in this tree structure, each individual file system only knows about
itself and doesn't have to keep any information about other file systems on the
machine. It should also be noted that these mounted file systems are not
persistent, meaning that a reboot will lose all previously mounted file systems
(except for the base file system). This isn't too bad though because the
machine can be configured to mount file systems on startup.
Here's how we mounted OSPFS (our lab 3 file system):
Fig. 4 - OSPFS (Lab 3 File System) Mounting
When can a kernel throw (i.e., garbage collect) a kernel
structure (for example, a kernel inode)?
A kernel can garbage collect an inode when it is unreferenced. That is, when the following conditions are met:
This is part of the open-file guarantee that both
Windows and Unix systems provide. When you open a file, the system checks
for the file's existence and its permission settings. Moreover, every program has a current working
directory, and that is an opened file, in itself. Consider the case below:
No files associated with the inode are open
None of its subdirectories are open
None of the file components are open
test -> subdir -> Netscape
(You can't throw away subdir or test if Netscape is running.)
Now consider the case when you're updating Firefox
on your Linux machine. When you update, /usr/bin/firefox gets overwritten with
a newer version of /usr/bin/firefox. Imagine if there was no preservation of open files (i.e.,
open files can be modified when they are currently open or running) and we are
updating Firefox while an instance of the program is currently running. In
order to understand what may happen, we must first understand that when we run a
program, we load the binary code into main memory. For programs that are huge like
Firefox, we demand page them in the buffer cache. As we learned a
few lectures ago, demand paging is when pages are loaded into memory only when
needed. Figure 5 below shows this scenario.
Fig.5 - Firefox references modified binary
As we can see in the figure immediately above, Firefox is updated on disk as we are still running an instance of the old
version of Firefox. Since we are using demand paging, it could occur that
we reference a page that is not currently in memory and that we need to bring
that page into memory from the disk. Here's the problem! The Firefox
binary has been modified since we just updated it so we will most likely be
referencing an incorrect part of the code since we are using address references
from the old version of the binary. This will surely result in the program
crashing. The only way to prevent this is to either keep the old binary
code around until the program is closed (allowing us to correctly reference the
old binary even though Firefox has been updated) or we can just make sure that
Firefox has been closed prior to updating. The latter is a much better
idea and this is ensured through the open-file guarantee.
Symbolic versus Hard Links
A hard link is when a directory entry refers to a particular file inode.
Each file inode has at least one hard link or it wouldn't make sense. We usually
say that a particular file is hard-linked if multiple directory entries in the
file system refer to that file's inode. This means that the file is
referenced by more than one directory (and therefore by more than one name).
The problem here is that we cannot allow hard links to directories since this
could create circular links and would violate the tree structure design of
the file hierarchy, as discussed earlier. These circular links would also
complicate garbage collection.
How, then, is it possible for two directories to refer to just
one directory as in the example below?
We need to add a layer of indirection to permit the situation above and
create a different type of link that does not change file system semantics.
This new type of link is a symbolic link. A symbolic link is a
special file type whose data contains a filename. This filename is
essentially a "soft link" to another file.
Whenever the kernel encounters a symbolic link file type, it reads the filename
in the data part of the file and goes to that filename. It's as if the Kernel traverses
the namespace for the user. Symbolic links are nice in that they are links
at the namespace level rather than at the block level (like hard links).
Figure 6 below shows what the symbolic link would look like for the example
Fig. 6 - Symbolic Links
Overview: Hard links link file system OBJECTS, whereas symbolic links link
Figure 7 below illustrates the differences between hard links and symbolic
links. For the hard link part, we assume that filenames a and c are hard
linked to the same inode: A. When we delete filename a, inode A and file A
do not get deleted since all we do is decrease the hard link count in that inode
from 2 to 1. If we were to create a new file for filename a, then both the
filenames (a and c) would reference two different file inodes (and
therefore two different files). For the symbolic link part, we assume that
filename c contains a symbolic link to filename a. When we delete filename
a, inode A and file A get deleted as well and now the symbolic link is no longer
valid and any reference to filename c will cause an error. This is the
so-called dangling-link problem, where a symbolic link doesn't have any
control over what happens with the file it's referencing. The file it
points to can be deleted or its filename can change without the symbolic link
file knowing what has happened. If we were to create a new file for
filename a, then both the filenames (a and c) would still reference the same
file inode and file (the one that was just created).
Fig. 7 - Hard Links vs. Symbolic Links
Advantages of Symbolic Links:
Disadvantages of Symbolic Links
Can link to directories without creating circular links (unlike hard links)
Can link across file systems (unlike hard links)
- The dangling-link problem can occur
File System Correctness/Consistency
File System Correctness Problem
Fig. 8 - File system
In the file system displayed in figure 8 above,
imagine unlinking file /a. Here is the process we go through:
If the power shuts off after the first item above,
then we have failed to completely unlink file /a. The file still exists,
but it's data blocks 10 and 11 are now marked as free, meaning that the file system is
able to use these data blocks for storing data for a newly created file (for
example, file B).
This is a problem! This same problem can occur if file /a had an indirect
block that was marked as free prior to the power shutting off. This
indirect block could be used as a data block for a newly created file in the
file system, therefore creating a problem even worse than the previously
described one. We have just made our file system inconsistent and we have
to find a way to prevent this.
mark data blocks as free
mark inode as free
mark directory entry as unused
Objective: A set of invariants that will ensure that the file system is in a
These are the invariants:
If we maintain all of these invariants, we guarantee that file system is in
a consistent state.
No object is used for more than one purpose (No
block belongs to more than one file as in the case described above).
Every referenced block is marked allocated (No
block is marked both free and used as in the case described above).
Every unreferenced block is marked free
(Otherwise, we get a disk leak, though less serious than the previous two
Every referenced object (inode, block, et al)
has been initialized (For example, if you point to an indirect block
without having written to it, you may follow garbage pointers and encounter
the case described above)
There are various ways for a file system to ensure consistency and, therefore,
File System Check (fsck in Unix) - works by verifying invariants listed above
Careful Ordering (Synchronous Metadata)
Runs after an unclean shutdown
Traverses every directory and inode to
cross-check against other inodes and directories, as well as the freemap
When it finds a problem, it fixes it. For example, for problems with files A and
B in figure 8, we could fix the problem by performing any one of these
- Delete file A
- Delete file B
- Copy double-used block so we have 2 separate copies (but
keep in mind that we can't preserve the data for both files A
- Disadvantage: File System Check is just a
"band-aid" to place the file system in a consistent state. It has
to guess the correct state of the file system and it often guesses
wrong. This is problematic.
- Disadvantage: VERY SLOW!!!
Journaling - File System Transactions
Order all updates to file system structure
in such a manner that only the third invariant may be violated (the
invariant that states that every unreferenced block is marked free).
This invariant is not as serious as the others since it can only cause
disk leaks, but it cannot cause corruption. For example, for
unlinking a file, if we follow the steps listed earlier in the notes in
the opposite order (starting at step 4 and going up to step 1), then we
can ensure that only the 3rd invariant can be violated.
Disadvantage: This method requires
synchronous writes to disk so we cannot take advantage of many of the
nice efficiency-increasing techniques we've previously discussed.
This includes use of the buffer cache (for page swapping), page
pre-fetching and the disk scheduling algorithms. This is all
because writes to the disk NEED to occur in a certain order so that we
can ensure that only the 3rd invariant may be violated.
- The main idea is to keep a log of all the disk
transactions (consisting of a few disk operations), and to only write to
the file system structures when that particular transaction commits
(commit - the atomic point when the transaction becomes persistent).
Then, whenever the system reboots, we can just replay the log to make
sure that every committed transaction was actually performed. If a
certain recorded log transaction was not completed because of a system
crash, then we can go ahead and complete the transaction.
- This log/journal is usually kept in a special section of the file
system (see figure 8 above) and is available for viewing on Linux
machines at /.journal. Note: typical journals are about 2 MB in size.
- Here are the steps taken when a file system is performing some operations that constitute a "transaction":
- Write the transaction contents to the journal (see figure 9
below for a sample journal transaction for deleting a file)
- Perform the operations of the transaction to the actual file system
- Mark the transaction as done by writing commit to the journal (this means that
the transaction has been committed to the file system)
- When the system reboots, we run through the
journal and make sure that every committed transaction was actually
performed and is on disk. Once a transaction has been checked, it
can be marked as "released"
This method is slow because we need to write
both the journal and the disk, but the advantage is that we have a
contiguous arrangement of journal transactions and this means that there
are fewer seeks when we are verifying that transactions were actually
- We have two options for what sort of data to store in the journal:
- Write just the operation name (this is just a handful of bytes per operation)
- Write the entire block that was
modified. This is actually necessary since writing blocks is
not an atomic operation in the hardware. What if the system
crashes while the hardware is midway through writing a block into
disk? We need to be able to recover from this and storing the
entire block in our journal is a way to do this. This approach
makes non-atomic hardware appear atomic! The ext3 file system
in Linux uses this approach.
Fig.9 - Journaling Transaction