At the end of last lecture, we introduced the notion of paging: divide a large virtual address space into many small pages, which can be independently swapped into and out of frames in physical memory.
To do so, we need to keep a data structure (the page table) for each process mapping page numbers to frame numbers.
The simplest method is to put these into an array: the ith entry in the array gives the frame number in which the ith page is stored.
note: these numbers are typical, but not worth memorizing: the process by which they are derived is more important.
The page table needs one entry per page. Assuming a 4GB (2^32 byte) virtual and physical address space and a page size of 4kB (2^12 bytes), we see that the the 2^32 byte address space must be split into 2^20 pages.
This means the page table must have 2^20 entries.
How large are the entries?
Each entry contains a frame number. Since there are 2^32 physical addresses divided into frames of size 2^12 (frame size = page size), we see that there are 2^20 frames, so we need 20 bits to store the frame number.
We also need some extra bits: since pages may be swapped out to disk, we need a valid bit in the page table entry to indicate whether the page is valid (i.e. currently present in memory) or not. We can also store a dirty bit so that we can avoid writing pages back to disk if we need to swap them out.
To support segmentation (discussed below), we can also store read, write, and execute permission bits.
This gives a total of 25 bits per entry. The math is much easier if we round to bytes: each entry is 4 bytes.
Thus the total size of the page table is 2^20 entries * 2^2 bytes/entry = 2^22 bytes = 4MB.
4MB of contiguous space per process is a lot. Moreover, if the process is only using a small part of its address space, we will only need to access a small part of the page table.
Just as with the address space, we can solve these problems by paging the page table itself. For convenience, we can make the pages of the page table (POPTs) the same size as the pages of the process's address space. This allows us to use the same set of frames to store either process data or POPTs.
In our example, each POPT holds 2^12 bytes / 4 bytes per entry = 2^10 entries. Since there are 2^20 total entries in the page table, there must be 2^10 POPTs.
Just like we needed a page table when we split up the address space into pages, we will need a second level page table to tell us where the POPTs are stored. In our example, this table must contain 2^10 entries (one for each POPT), each of which is 4 bytes (it contains a 20 bit frame pointer and additional VDRWX bits). Thus the total size of the top-level page table is 2^10 entries * 2^2 bytes per entry = 2^12 bytes = 4kb. This fits in one page, so there is no reason to split it further.
With different numbers, we could have a very large top-level page table. If so, we could repeat this process by paging the top-level page table (thus introducing another layer of page table).
To look up an address in a hierarchical paging scheme, we use the first 10 bits to index into the top level page table. This tells us which POPT to use. The next 10 bits tell us which entry on that POPT points to the data page we are interested in. The remaining 12 bits gives us the offset into that page.
A good exercise is to work out all of the above computations with small numbers, and to actually fill out the contents of RAM on a piece of paper. For example, you could try designing everything with a page size of 8 bytes, a physical address space of 128 bytes, and a logical address space of 256 bytes. Figure out how to look up address 0x3A. Roll a (# of frames)-sided die whenever you need to find a frame for a page.
Here is a blank sheet of RAM; you would only use half of it for the scenario above (since the physical address space is 128, not 256). You could also print out lots of pages and experiment with a page size of 256 bytes.
The size of the page table (hierarchical or otherwise) grows with the size of the virtual address space. If we have a large virtual address space (such as in a 64 bit architecture), the page table will become huge. Hierarchical paging will allow us to keep most of that out of main memory, but would require a 6-level hierarchy (why?). That means to look up an address you need to read at least 6 frame numbers, which is expensive.
Instead of making a very large sparse array, we can instead use a hash table with one entry per frame. The hash table maps page numbers to frame numbers.
How does the MMU actually work with these fancy paging schemes? The key component is the translation lookaside buffer (TLB).
The TLB is a small hardware associative array (think tens to hundreds of entries) that maps page numbers to frame numbers.
As the program executes, the page numbers stored in virtual addresses are compared with all of the entries in the TLB (this is done in hardware, so all comparisons can happen simultaneously). If an entry matches, the corresponding frame number is combined with the offset to give the physical address.
If no entry matches, there is a TLB miss. If using a software-managed TLB, this miss will cause an exception to be raised; the operating system is then responsible for traversing the page tables to find the corresponding frame; it then loads the mapping into the TLB and continues. If using a hardware-managed TLB, the TLB is responsible for traversing the page table structure; it only raises an exception if the page table has not yet been properly configured.
Recall that each process has its own address space, and thus its own page table. This means that when the OS context switches to a new process, it switches the pointer to the root of the page table to a the page table for the new process (this is the "VM info" stored in the PCB referred to in the first week). This will cause all of the TLB entries to become invalid. This is called a TLB flush. Repopulating the TLB is a large component of what makes context switching between processes expensive.
The cost can be mitigated somewhat by adding process identifiers to the TLB lines and allowing the TLB slots to be split between multiple processes. This is called a tagged TLB.
It can be useful to mark different regions of a process's address space with different read/write/execute privileges. For example, a process is typically divided into a kernel area, a heap area, a stack area, a code area, and so forth. These large areas are called segments.
It makes sense to read and write in your heap, but not to jump there; conversely it makes sense to jump into your code section, but not to write it. Any access to unallocated space is an error.
The TLB can help us enforce these conventions. Each TLB entry has additional read, write, and execute bits. While translating an address, the TLB will also check whether the type of access is valid for the corresponding page. If not, it can raise an exception, and the OS can handle it appropriately. This is the source of your favorite C error: a segmentation fault occurs whenever you access a "bad" pointer: a pointer to a page of memory that hasn't been allocated with the corresponding permissions.
TLB permission bits give the OS a way to be interrupted when certain pages are accessed in certain ways (by clearing the corresponding permission bit). This can be used for various things other then protecting segments:
allocating zeroed pages: if a process wants a large chunk of memory to be zeroed out, the OS can do this lazily by revoking read and write access. When the process tries to access the page, a fault will be raised, and the OS can zero that page at that point in time.
copy on write: if a process wants to duplicate a large chunk of memory, the OS can instead make page table entries of the two copies refer to the same frame, but mark them read-only. If one of the processes tries to write, the OS can make the copy at that point in time.
debugger breakpoints on memory accesses: access permissions to pages of the debugged process can be cleared; when the process tries to access those pages, the OS will receive an exception. It will then block the process and notify debugger. The debugger can then check whether there is a break point set, and resume the program if not (notifying the user if there is).