A filesystem consists of two data structures: the directory tree and the free list. These data structures must be kept in sync; we must maintain the invariant that every sector on the disk appears either in the free list or in the inode structure, but not both. In addition, each sector should appear once (with the exception of inodes, which should appear as many times as their reference counts indicate).
To maintain this invariant, block allocation and deallocation will usually require at least two steps: one to modify the free list, one to modify the directory structure.
However, a power failure can occur between these operations; because disks are persistent this can leave the filesystem in an inconsistent state:
if a block is removed from the free list but not added to the file system, it is "orphaned": it can never be used.
if a block is placed in the file system but not removed from the free list, it can be reallocated to a second file; writes to either file will corrupt the other.
Normally we would maintain invariants by breaking them only inside a critical section: for example we could acquire a lock before modifying the free list and the file system, and only releasing it afterwards. In this context, this is difficult: to implement the lock, we'd need to force the universe to acquire the lock before kicking out our power cord
We could try to come up with a complex protocol ensuring that no matter when we are interrupted, we are in a consistent state. However, we may be thwarted because the disk is allowed to reorder the writes. If the correctness of a protocol relies on the order in which writes take place, we must sync the disk between the writes: wait for the disk to acknowlege that the first write has been stored before beginning the second. This is an expensive operation.
One solution is to use an uninterruptible power supply (UPS) : a battery with a mechanism for raising an interrupt if the power is about to fail. With a UPS, we can avoid starting any writes if we know the power is about to go out.
This is our trick for forcing the universe to take a lock before killing the power.
An alternative solution is to check the filesystem for consistency when we reboot the machine. We can traverse the entire file system and free list to build a table telling us whether each sector occurs in exactly one of the two.
If we detect an orphaned sector, we can simply add it to the free list. A sector that appears twice in the directory structure could be duplicated or simply removed (some systems actually move duplicated blocks into a special "lost-and-found" directory, allowing the user to examine them and recover them manually if necessary).
In unix, the fsck
tool (named for what it does: filesystem check, and also for what you say when you have to run it) is used to check and recover a filesystem.
As disks (and thus file systems) got larger, the process of traversing an entire filesystem became prohibitively expensive. To solve this, we can use journaling:
before an operation begins, write an entry to the journal (a special section at the beginning of the partition) indicating the intent to perform the operation. Sync the disk.
perform the operation. Sync the disk.
at some point in the future, mark the operation completed.
When recovering, only the blocks that are part of incompleted operations in the journal need to be inspected for inconsistency.
A log structured file system (LFS) gives a completely different approach to managing a filesystem.
The central idea behind LFS is that blocks are never modified; whenever an operation conceptually modifies a file, the operation instead places a new block at the end of the log. Writes always go to the end of the disk.
For example, suppose I wished to change the first byte of a file. I would create a new copy of the direct block containing that byte, and place it at the end of the log. Since the address of that block has now changed, I would also create a copy of the inode for the file.
This might require me to create a new copy of any directory containing that file (since the address of the inode has changed). However this is difficult and expensive (consider hard links), so instead we add an extra layer of indirection. Directory entries in an LFS contain inode numbers for the files contained inside of them, instead of the disk addresses of the inodes themselves.
These inode numbers are then looked up in a global inode map, which maps inode numbers to the current location of the inode.
The disk is divided up into large segments. Each segment contains a large number (~1000) of data blocks and inodes, as well as the most recent copy of the inode map. To keep track of the current segment of the filesystem (i.e. the end of the log), a designated "superblock" at the beginning of the filesystem contains a reference to the most recently written segment.
Periodically (or when the current segment is full) the current segment is written to disk, and the head segment number in the superblock is updated to move to the next segment. This is referred to as a checkpoint; once the superblock has been written, the filesystem now reflects everything that happened before the checkpoint.
Note that without garbage collection / compaction (described in the next lecture) the entire state of the file system at any checkpoint in the past can be recovered by simply changing the head reference to that checkpoint. A log-structured file system preserves history, which is a nice feature.
The downside of storing the entire history is that it can easily fill up the disk with old versions. To clean up the unused segments, the filesystem can periodically run compaction on the tail of the log.
To compact a segment, you examine each block (data block or inode) in the segment. By consulting the current inode table you can determine whether those blocks are the latest versions. If they are, you can copy them to the head of the log (and their inodes, and the inode table), exactly as you would if you were overwriting them with new data. Once you have done this, you can safely reuse the segment, since all of the blocks stored in it are now obsolete.
In order to determine whether a block is stale, you need to know its identity. This is stored in an additional part of the segment called the segment table. The segment table contains an entry for each block in the segment, which identifies which file (inode number) the block is part of, and which part it is (e.g. "direct block 37", or "the inode").
data can be lost if it has been written but not checkpointed. This can be mitigated by decreasing the time between checkpoints or allowing applications to ask to wait until the next checkpoint before proceeding.
most reads are absorbed by cache; writes always append to the log, so they are sequential and very fast.
blocks are located on disk in exactly (or almost exactly) the order in which they were last written. Even if reads miss cache, they will have good locality if the order in which files are read mimics the order in which they are written.
LFS is good for flash memory (solid-state disks or SSDs): flash memory degrades with each subsequent write, but LFS naturally levels out writes evenly across all segments.
SSDs also require write operations on very large segments; writing segments fits these usage characteristics very well.