deadlock detection
Deadlock occurs when a system is unable to make progress because threads are blocking each other.
Consider the "dining philosophers" problem: n philosophers are sitting around a table, wanting to eat. Between each pair of philosophers is a single chopstick; a philosopher needs two chopsticks to eat. One possible way to write the pseudocode for each philosopher:
while hungry: pick up left chopstick (blocking if unavailable) pick up right chopstick (blocking if unavailable) eat set down left chopstick set down right chopstick |
This solution may exhibit deadlock if all threads pick up their left chopstick before any thread picks up the right chopstick. At that point, all of the philosophers are waiting for a chopstick, but no chopstick is available, so the system is deadlocked.
The following conditions are necessary and sufficient for a system to be in deadlock: - mutual exclusion: multiple threads cannot simultaneously hold the same resource
hold and wait: threads can hold one resource while waiting for another
no preemption: it is impossible to force a thread to relinquish a resource until it has completed its task
circular wait: there is a thread that is currently waiting for another thread, which is currently waiting for another thread, ..., which is currently waiting on the first thread.
The circular wait condition can be easily explained using a resource allocation graph. The graph is drawn according to the following rules:
vertices represent either threads (drawn as ovals) or resources (drawn as rectangles).
an edge from a thread to a resource indicates that the thread is waiting to acquire the resource
an edge from a resource to a thread indicates that the thread is holding the resource
Assuming mutual exclusion, no preemption, and hold and wait, the system is deadlocked if and only if there is a (directed) cycle in the resource allocation graph. Here is the graph for the deadlocked dining philosophers scenario described above:
Here each philosopher is holding the chopstick on her left, while trying to acquire the chopstick on her right.
We can design a system to avoid deadlock by making any of the 4 conditions impossible.
In some cases, deadlock can be mitigated by making resources more shareable. For example, using a reader/writer lock instead of a mutex can make deadlock less likely (since many readers can share the read lock). Using a lock-free data structure is another way to allow multiple threads to access a data structure simultaneously (without blocking).
However, many resources are inherently non-shareable (e.g. printers: can't print two documents simultaneously!). Mutual exclusion is a good condition to break if you can, but often you can't.
In some situations we can make resources preemptable. If a process tries to acquire a resource that is held by another process, we can make it possible for the new process to steal the resource.
In order to do this, we need some mechanism for rollback: we need to be able to restore whatever program invariants that the resource was held in order to satisfy.
For example, if the resource is a lock protecting a shared variable, we could roll back the thread that holds the lock by restoring the state of the shared variable to the state it held before the lock was acquired, and restarting the process that was performing the update.
Once we allow computations to be rolled back, we introduce the possibility that two threads can continue to preempt each other forever. Although the system is not deadlocked (both threads seem to be making forward progress), the system may never actually finish its tasks. This state is called livelock: when competing threads are continuously being rolled back before they can finish.
It is not possible to make all resource preemptible. I/O is a well-known impediment to rollback: once some output has been performed, it may be impossible to return to a consistent state. Once you tell the user you've started processing their order, you can't take it back.
Can break hold-and-wait by having threads release all locks and re-acquire them all at once.
Releasing locks may require rollback, which leads to the same issues described above.
Monitors partially use this strategy to avoid hold-and-wait: calling wait on a condition variable automatically releases the lock, so that acquiring the monitor lock cannot cause deadlock. However, it is still possible to create a form of deadlock with a monitor where one thread needs to wait for some predicate before updating state in a way that satisfies another predicate, while a second thread waits for the second predicate before making the first predicate true.
A common approach to preventing deadlock is to use lock ordering. All resources in the system are numbered, and each thread must acquire low numbered resources before high numbered resources. That means that when traversing any path in the resource allocation graph, the numbers must increase (because if resource A is held by process P which is trying to acquire B, B must be bigger than A). This means that there can't be loops, because in a loop, the number must eventually go down.
Lock ordering can fail if you need to acquire a lock to determine what other locks to acquire, or if you can't predict what locks you may need for another reason.
For example, a process could read file B (which requires a lock on file B), which tells it to read file A. If file A has a lower numbered lock, then the process must release B before acquiring A. But once it releases the lock on B, another process can update B, so that the process should be writing to C instead of A. To safely perform the update, the process would need to release the lock on B, then acquire locks on B and A, reread file B, and then update the file it points to. If file B has changed, the process may need to start over again to acquire another lock, and so on (potentially leading to live lock).
Another general strategy for dealing with deadlock is to simply detect it and respond if it occurs. One can respond by either killing a thread (and releasing all of its resources), or by forcing it to roll back.
A simple practical solution to detecting deadlock is to simply put a time limit on the acquisition of resources. You may end up killing too many threads, but if you are writing code that is expected to be run with deadlock detection, it needs to be able to handle thread death anyway, so killing a few extra threads isn't so bad (this is an example of an end-to-end argument, which we'll discuss in more detail when studying networking).
A more precise method is to keep track of the resource allocation graph and check it for deadlock. This check can be done periodically or when a new resource is requested.
In order to detect deadlock, we can use the following algorithm:
The banker's algorithm is a slight variation on deadlock detection: instead of detecting whether there is currently a deadlock, we keep track of the maximum potential requests that each process might make, and block before granting a request that could lead to deadlock in the future if some processes request their maximum allocation.
The idea behind the banker's algorithm is that we keep track of every process's maximum and current allocations, as well as the current number of unallocated resources. We maintain the invariant that the state is safe: there is some sequence of processes P1, P2, P3, ... such that we can
Whenever a process requests a resource, we check whether granting that request would leave the system in a safe state or not. If it would, we grant the request. If not, we block the request until more resources become available.
Checking for safety is straightforward, because running a process to completion only frees up more resources for future processes. Thus, we can choose any completable process to run; we will not prevent ourselves from finding a safe schedule.
to check for safety:
make a copy of the current allocation table
while processes exist:
choose any process that can run to completion with available resources
if there are none: state is not safe
add that process's resources to the available resources
remove that process from the list
if you complete the loop, the state is safe.