A filesystem needs to keep of the unused blocks so that it can allocate new blocks when creating new files or expanding existing files. There are a number of strategies for keeping track of the free space:
A bitmap: one bit per sector indicating whether the block is in use or not. Bitmaps are very compact, so for reasonably sized disks can be stored in memory. Supports contiguous allocation: can easily do a linear search for n contiguous free blocks when allocating a large file.
A linked list: each free block contains a pointer to the next free block. Allocating a large number of blocks at once requires reading all of them, which can be inefficient.
A linked list of sets: a linked list of free blocks as above, but the additional space in the block is used to store pointers to other free blocks. Improves on linked list by allowing many blocks to be allocated at once
A filesystem consists of two data structures: the directory tree and the free list. These data structures must be kept in sync; we must maintain the invariant that every sector on the disk appears either in the free list or in the inode structure, but not both. In addition, each sector should appear once (with the exception of inodes, which should appear as many times as their reference counts indicate).
To maintain this invariant, block allocation and deallocation will usually require at least two steps: one to modify the free list, one to modify the directory structure.
However, a power failure can occur between these operations; because disks are persistent this can leave the filesystem in an inconsistent state:
if a block is removed from the free list but not added to the file system, it is "orphaned": it can never be used.
if a block is placed in the file system but not removed from the free list, it can be reallocated to a second file; writes to either file will corrupt the other.
Normally we would maintain invariants by breaking them only inside a critical section: for example we could acquire a lock before modifying the free list and the file system, and only releasing it afterwards. In this context, this is difficult: to implement the lock, we'd need to force the universe to acquire the lock before kicking out our power cord
We could try to come up with a complex protocol ensuring that no matter when we are interrupted, we are in a consistent state. However, we may be thwarted because the disk is allowed to reorder the writes. If the correctness of a protocol relies on the order in which writes take place, we must sync the disk between the writes: wait for the disk to acknowlege that the first write has been stored before beginning the second. This is an expensive operation.
One solution is to use an uninterruptible power supply (UPS) : a battery with a mechanism for raising an interrupt if the power is about to fail. With a UPS, we can avoid starting any writes if we know the power is about to go out.
This is our trick for forcing the universe to take a lock before killing the power.
An alternative solution is to check the filesystem for consistency when we reboot the machine. We can traverse the entire file system and free list to build a table telling us whether each sector occurs in exactly one of the two.
If we detect an orphaned sector, we can simply add it to the free list. A sector that appears twice in the directory structure could be duplicated or simply removed (some systems actually move duplicated blocks into a special "lost-and-found" directory, allowing the user to examine them and recover them manually if necessary).
In unix, the fsck
tool (named for what it does: filesystem check, and also for what you say when you have to run it) is used to check and recover a filesystem.
As disks (and thus file systems) got larger, the process of traversing an entire filesystem became prohibitively expensive. To solve this, we can use journaling:
before an operation begins, write an entry to the journal (a special section at the beginning of the partition) indicating the intent to perform the operation. Sync the disk.
perform the operation. Sync the disk.
at some point in the future, mark the operation completed.
When recovering, only the blocks that are part of incompleted operations in the journal need to be inspected for inconsistency.