In this survey, we’ll examine the Lovász Local Lemma (LLL), a cornerstone of the probabilistic method, through the lens of theoretical computer science. This powerful tool was developed in the 70s by Hungarian mathematicians Lovász and Erdős to guarantee the existence of certain combinatorial structures1, and it has since been borrowed by computer scientists to prove results about satisfiability problems, graph colorings, tree codes, integer programming, and much more2.
For quite some time, this method remained entirely non-constructive, until a quest to constructify the LLL began in the 90s3. For a while, all results required rather significant assumptions, but breakthrough work by Moser and Tardos in 2009 proved the efficacy of a suprisingly simple randomized search algorithm4 5. This post will describe an elegant proof of Moser’s algorithm using the method of entropy compression. If you are already familiar with the LLL, you may want to skip to the third section.
The Probabilistic Method
The probabilistic method is a useful non-constructive technique that was pioneered by Paul Erdős in the 40s to prove foundational results in Ramsey theory6. In general, this setup begins with a search space ΩΩ (usually large and finite) and a set BB of flaws that we aim to avoid. Specifically, each B∈BB∈B is a subset of ΩΩ whose objects share some undesired trait, and we are searching for a flawless object σ∈Ω∖∪B∈BBσ∈Ω∖∪B∈BB. The probabilistic approach to this search is to place a probability measure on ΩΩ, so that BB corresponds to a set of “bad” events. Then, we can borrow tools from probability theory to prove that the probability of no bad events occuring, Pr[∧B∈B¯B]=1−Pr[∨B∈BB],Pr[∧B∈B¯¯¯¯B]=1−Pr[∨B∈BB], is strictly positive, certifying the existence of a flawless object. While this approach is often just a counting argument in disguise, it has proven to be quite fruitful and has helped combinatorialists avoid redeveloping existing probabilistic machinery. The classic applications of this method are to graph coloring, but kk-SAT seems better suited to a computer science audience.
Let φφ be a Boolean formula in conjunctive normal form (CNF) with nn variables, mm clauses, and kk variables in each clause. The kk-SAT decision problem asks, given such a φφ, whether there exists an assignment of variables satisfying all of its clauses. We will diverge slightly from the traditional kk-SAT definition by requiring that the variables of each clause are distinct. In this case, our search space Ω={True,False}nΩ={True,False}n consists of variable assignments, and our flaws Bi={σ∈Ω∣σ violates clause i of φ},i=1,…,mBi={σ∈Ω∣σ violates clause i of φ},i=1,…,m correspond to violated clauses. We will simply select the uniform probability measure over ΩΩ, i.e. flip a fair coin to determine the assignment of each variable. With this choice, Pr[Bi]=12k,Pr[Bi]=12k, since there is a single assignment of the kk variables in a CNF clause which cause it to be violated.
Perhaps the most generic tool we have at our disposal is the union bound. In the case of kk-SAT, this guarantees that a satisfying assignment exists when the number of clauses is small. Specifically, if m<2km<2k, we have Pr[∧mi=1¯Bi]=1−Pr[∨mi=1Bi]≥1−m∑i=1Pr[Bi]=1−m2−k>0,Pr[∧mi=1¯¯¯¯¯¯Bi]=1−Pr[∨mi=1Bi]≥1−m∑i=1Pr[Bi]=1−m2−k>0, as desired. Another simple case is when all of the clauses are disjoint; then, the Bi events are independent and Pr[∧mi=1¯Bi]=m∏i=1Pr[¯Bi]=(1−2−k)m>0. In general, we have the following when each bad event occurs with at most probability p.
Method | Assumptions on … | Sufficient condition for flawless object |
---|---|---|
Union bound | n/a | p<1/n |
Independence | Global event structure | p<1 |
??? | Local event structure | p<? |
As the final row suggests, it is natural to ask whether one can interpolate between these two results with assumptions on local event structure.
The Lovász Local Lemma
This question was given an affirmative answer in the 70s when Lászlo Lovász developed his local lemma to establish a hypergraph colorability result7. We will first describe the so-called symmetric LLL, which is often applied when the bad events are identical up to some (measure-preserving) permutation of the objects.
Theorem (Symmetric LLL)
Let B be a set of events such that each occurs with probability at most p and is mutually independent from all but at most d other events. Then, if ep(d+1)≤1, Pr[∧B∈B¯B]>0.
For d≥1, this condition was later relaxed by Shearer to epd≤18. The dependency bound d serves as a characterization of local event structure, and this result fits into table as p<1/(ed), nearly interpolating smoothly between independence and the union bound. Let’s examine the consequences of the LLL for k-SAT.
Example
Suppose that each variable of our k-CNF formula appears in at most 2k/(ke) clauses. Then each clause can share variables with at most k⋅2k/(ke)=2k/e clauses, giving a dependency bound of d=2k/e between our bad events. Since p=2−k, epd=1, and the LLL condition (barely) holds, guaranteeing the existence of a satisfying variable assignment.
In the asymmetric case, we will require a more nuanced description of event structure.
Definition (Dependency Graph)
Let E be a set of events in a probability space. A graph with vertex set E is called a dependency graph for E if each event is mutually independent from all but its neighbors.
Example
Consider the 2-CNF formula φ=(x1∨x2)∧(¬x2∨x3)∧(x3∨¬x4). Then, we have three bad events to avoid; B1 associated with x1∨x2, B2 associated with ¬x2∨x3, and B3 corresponding to (x3∨¬x4). The following are valid dependency graphs for this event set.
Remark
In the case of k-SAT, there is a unique edge-minimum dependency graph with edges connecting events whose corresponding clauses share variables; however, there is no canonical dependency graph for general event sets. Furthermore, when we move from an event set to a dependency graph, we lose information about the events.
Now, we can state the LLL in its general form, where we let NG(v) denote the neighbors of vertex v in graph G.
Theorem (Asymmetric LLL)
Given a set of bad events B with dependency graph G, if there exists some x∈(0,1)B such that Pr[B]≤xB∏C∈NG(B)(1−xC)∀B∈B, then Pr[∧B∈B¯B]>0.
Remark
We can recover the symmetric case as a corollary by setting each xB=1/(d+1). Indeed, this choice of x results in the requirement Pr[B]≤1d+1(1−1d+1)d, which is implied by the condition ep(d+1)≤1, since 1e≤(1−1d+1)d.
We omit a proof of the asymmetric LLL, though it essentially follows from inductive applications of Bayes’ Theorem.
Variable LLL and Moser’s Algorithm
In most applications of the LLL (for instance, k-SAT and graph coloring), the bad events of interest are generated by a finite set of independent random variables. To capture this richer structure, we introduce the following.
Definition (Event-Variable Bigraph)
Let B be a set of events and X be a set of mutually independent random variables such that each event B∈B is completely determined by a subset of variables vbl(B)⊂X. We define the event-variable bigraph of this system to be the bipartite graph H=(B∪X,E) with B,X∈E if and only if X∈vbl(B).
Definition (Base Dependency Graph)
Given such an event-variable bigraph H=(B∪X,E), we define its base dependency graph GH=(B,E′), where B,B′∈E′ if and only if vbl(B)∩vbl(B′)≠∅.
Observe that GH is a dependency graph for B (in fact, its edge-minimum dependency graph as long as no variable set is unecessarily large).
Example
Recall the previous the 2-CNF formula φ=(x1∨x2)∧(¬x2∨x3)∧(x3∨¬x4) with corresponding bad events B1,B2,B3. This system admits the following event-variable bigraph,
where Xi is the random variable corresponding to the assignment of xi.
With this variable formulation well-defined, we are now prepared to constructify the LLL. The algorithm which I will present was published by Robin Moser in 20089 and extended to the general variable setting in 2009 alongside his advisor Gábor Tardos10. Their approach was a major breakthrough over previous work, which required significant assumptions on top of this variable structure. Particularly suprising was its simplicity; the algorithm is perhaps the simplest non-trivial procedure one could imagine to find a satisfying variable assignment. Specifically, given a set B of bad events generated by a finite set of random variables X, Moser’s algorithm finds an assignment of the random variables such that all of the bad events are avoided, as follows.
Algorithm (Resample)
Input: Finite event-variable bigraph (B∪X,E), procedure for sampling from each random variable X∈X
Output: Assignments v=(vX)X∈X of variables such that no event B∈B occurs
- Randomly sample assignment vX∼X for each variable X∈X
- While ∃ event B∈B which occurs under assignments v
- Resample vX∼X for each variable X∈vbl(B)
- Return v
Of course, assumptions on the events and their structure is necessary for Resample to run efficiently (or even to just terminate). Initially, Moser and Tardos proved good performance under asymmetric LLL conditions11.
Theorem (Running time of Resample)
If there exists some x∈(0,1)B such that Pr[B]≤xB∏C:vbl(B)∩vbl(C)≠∅(1−xC)∀B∈B, then Resample finds an assignment of variables such that no event occurs in at most ∑B∈BxB1−xB expected resampling steps.
Entropy Compression
In a 2009 blog post, Terence Tao presented a particularly elegant analysis of this algorithm, inspired by a talk Moser gave at STOC. His write-up coined the term entropy compression, a clever principle for bounding the running time of an algorithm using a reduction from data compression. To describe this method, we first introduce some fundamental definitions and results from information theory. See David MacKay’s textbook for a more comprehensive reference.
Definition (Information Content)
The information content, or surprisal, of an event E is given by I(E)=−logPr(E), with the logarithm taken in base 2.
This notion captures the fact that we learn more from the occurence of a rare event than a frequent event.
Definition (Entropy)
The (Shannon) entropy of a discrete random variable X with finite support X is defined as H(X)=Ex∼XI(X=x)=−∑x∈XPr[X=x]log(Pr[X=x]).
Entropy captures the expected amount of information obtained from measuring a random variable. Our exact choices of functions for I and H are supported by an elegant connection to coding theory.
Definition (Uniquely Decodable Codes)
For two alphabets (i.e. sets of symbols) Σ1 and Σ2, a code is function C:Σ1→Σ∗2 mapping each symbol of Σ1 to a string of symbols over Σ2. The extension of C is the natural homomorphism it induces from Σ∗1 to Σ∗2, mapping each s1s2…sn∈Σ∗1 to C(s1)C(s2)…C(sn)∈Σ∗2. A code is uniquely decodable if its extension is injective.
The following result establishes the limits of noiseless data compression in terms of entropy.
Theorem (Shannon’s Source Coding Theorem, Simplified)
For a discrete random variable X with finite support X, let f:X→{0,1}∗ be a uniquely decodable code. Then, we can bound the expected length of the codeword f(X) from below by the entropy of X, i.e. E|f(X)|≥H(X).
Of particular interest to computer scientists is the following example.
Example
In particular, if X∼U({0,1}r) is an r-bit string selected uniformly at random, then E|f(X)|≥H(X)=−∑x∈{0,1}rPr[X=x]log(Pr[X=x])=−log(1/2r)=r for any uniquely decodable f:{0,1}r→{0,1}∗. Essentially, this means that r random bits cannot be noiselessly compressed into fewer than r bits in expectation.
Finally, we are prepared to outline Tao’s entropy compression principle.
Theorem (Entropy Compression)
Consider an algorithm A with access to a stream of bits sampled independently and uniformly at random. Suppose that A proceeds though a sequence of steps t=0,1,… until (potentially) terminating, such that the following hold for any input:
- A can be modified to maintain a bit string Lt logging its history after each step t, such that the random bits read so far can be recovered from Lt;
- The computational cost (or even computability) of this history log modification is irrelevant.
- The precise notion of recovery is a bit nuanced12 – we will elaborate within the proof.
- Lt has length at most ℓ0+tΔℓ after step t;
- r0+tΔr random bits have been read after step t.
If Δℓ<Δr, then the step after which A terminates is at most ℓ0−r0Δr−Δℓ in expectation.
Proof
Fix any input to the algorithm. For each (r0+tΔr)-bit string y, sufficient to run the algorithm up to step t, let s=s(y)≤t denote the step at which the algorithm terminates when its random stream is replaced by the fixed string y. Then, we denote the final history log by Ls(y) and let z(y) be the suffix of y consisting of the (t−s)Δr bits unused due to early termination. Making the notion of recovery for the history logging protocol precise, we require that the code f:{0,1}r0+tΔr→{0,1}∗ defined by the concatenation f(y)=Ls(y)∘z(y) is uniquely decodable, for all t. This is a slightly technical condition, but it will not be an obstacle for our analysis of Moser’s algorithm. Now, supplying random bits Y∼U({0,1}r0+tΔr) to the algorithm, resulting in a run to step S=s(Y), we find that |f(Y)|=|LS(Y)|+(t−S)Δr≤ℓ0+SΔℓ+(t−S)Δr. Next, we use the source coding bound of E|f(Y)|≥H(Y) and rearrange to obtain E[S]≤ℓ0−r0Δr−Δℓ. Since A always completes its zeroth step, source coding also requires that ℓ0≥r0, ensuring that the previous bound is always sensical (non-negative). Finally, if T denotes the terminating step of the algorithm when provided with a stream of random bits (with the convention that T=∞ when the algorithm fails to halt), we see that S=min{t,T}. Since our bound on E[S] is independent of t, it also applies to E[T]=limt→∞E[S], as desired.
Essentially, if our algorithm is logging fewer bits than it reads at each step, it cannot progress too far in expectation without acting as an impossibly good compression protocol. Note that the identity protocol, which simply logs the random bits read so far without modification, satisfies our unique decodability requirement and can be appended to any algorithm. In this case, Δℓ=Δr, and the entropy compression bound means that an algorithm admitting a better protocol cannot continue indefinitely.
Remark
Alternatively, one can formalize this analysis using Kolmogorov complexity to get a bound with high probability, as described in this blog post by Lance Fortnow.
Analysis of Moser’s Algorithm
Now, we will apply our new found principle to analyze Moser’s algorithm. For conciseness, we will restrict ourselves to the k-SAT example and consider a slightly modified procedure. Recall that we are requiring k-CNF formulas to have exactly k distinct variables in each clause.
Algorithm (Resample2)
Input: k-CNF formula φ
Output: Assignments x← solve(φ) such that φ(x) holds
- solve(φ)
- Randomly assign each xi true or false with equal probability
- While ∃ unsatisfied clause C
- x ← fix(x,C,φ)
- Return x
- fix(x,C,φ)
- Randomly resample each xi∈C
- While ∃ unsatisfied clause D sharing a variable with C
- x ← fix(x,φ,D)
- Return x
This algorithm is very similar in spirit to Resample, with the added recursive step of fixing nearby clauses before continuing to other unsatisfied clauses. We will prove efficient running time bounds for this algorithm assuming slightly relaxed symmetric LLL conditions, under a model of computation where sampling a single random bit takes O(1) time.
Theorem (Running Time of Resample2)
Let φ be a k-CNF formula with n variables and m clauses. If each clause of φ shares variables with at most 2k−C other clauses, then Resample2 runs in O(n+mk) expected time, where C is an absolute constant.
Remark
Note that C=log2(e) would match the symmetric LLL conditions exactly.
Proof
To apply entropy compression, we must first divide each run of Resample2 into steps. We let the initial assignments of solve comprise step 0 and let each clause resampling by fix consistitute an additional step. Our key observation is that only one assignment of the relevant k variables can cause a k-CNF clause to be violated and require fixing. Thus, given a stack trace of an algorithm run, we can work backwards from the final string x to recover the sampled random bits. For example, suppose that x1∨¬x2∨x3 was the last clause of a 3-CNF formula to be fixed by Resample2. Then, the final assignments of these variables give the last three bits sampled, and their previous assignments must have been (False, True, False).
Now, we use the dependency bound of d≤2k−C to construct an efficient history logging protocol. Every history log will begin with n bits specifying the current variable assignment. Then, we mark the clauses which were fixed by non-recursive calls from solve. Since there are m clauses in total, such a subset can be described by m bits, with each bit specifying the membership status of a particular clause. Next, for each clause, we fix an ordering of the ≤d+1 clauses it shares variables with (including itself). Then, for the clauses of recursive calls to fix, we can simply log their index with respect to the “parent” clause which called them, requiring at most ⌈log(d+1)⌉+O(1) bits (with constant O(1) information used to keep track of the recursion stack). Reading through such a log as part of a larger string, one can determine when the log has terminated and how many steps were completed by the run it encodes, so it is simple enough to show that our unique decodability condition holds.
Thus, if the algorithm has called fix t times, our log has length at most n+m+t(log(d)+O(1)). However, we have sampled n bits for the initial assignments and k for each call to fix, giving a total of random n+kt bits. By the dependency bound, the history log “write rate” of log(d)+O(1)=k−C+O(1) bits per step is less than the random bit “read rate” of k bits per step, for sufficiently large C. Thus, our entropy compression principle states that the number of calls to fix is at most O(n+m−n)=O(m) in expectation. The total number of random bits sampled by Resample2 is therefore O(n+mk) in expectation, as desired.
Many lower bounds in theoretical computer science are proven using information theory, but Moser was the first, to my knowledge, to provide an upper bound using such methods. The key ingredient of his original proof was that of a witness tree, similar in spirit to our history log above. Although this technique was quite novel at the time, Tao observed that it could framed in a more familiar light. Algorithm analysis often exploits monotonic properties that consistently increase over the course of an algorithm. If one can ensure that there is a constant increase at each step and that the property is bounded from above, then this gives an upper bound on the number of steps taken by the algorithm (common examples of monotonic properties include cardinalities, weights, densities, and energies). In this case, we essentially13 have that the ratio of entropy to output plus history log length is bounded by one and increases by a constant amount with each (expected) step, ensuring termination. With this interpretation, “entropy compression” is a very natural description of our analysis method.
Remark
As a final aside, I’d like to promote the excellent Budapest Semesters in Mathematics program, where I first learned about the LLL in András Gyárfás’s class on the combinatorics of finite set systems.
Paul Erdős and László Lovász. “Problems and Results on 3-Chromatic Hypergraphs and Some Related Questions”.↩
Michael Molloy. “The Probabilistic Method”.↩
Jószef Beck. “An Algorithmic Approach to the Lovász Local Lemma”.↩
Robin A. Moser. “A Constructive Proof of the Lovász Local Lemma”.↩
Robin A. Moser and Gábor Tardos. “A Constructive Proof of the General Lovász Local Lemma”.↩
Noga Alon and Joel H. Spencer. “The Probabilistic Method”.↩
Paul Erdős and László Lovász. “Problems and Results on 3-Chromatic Hypergraphs and Some Related Questions”.↩
James B. Shearer. “On a Problem of Spencer”.↩
Robin A. Moser. “A Constructive Proof of the Lovász Local Lemma”.↩
Robin A. Moser and Gábor Tardos. “A Constructive Proof of the General Lovász Local Lemma”.↩
Robin A. Moser and Gábor Tardos. “A Constructive Proof of the General Lovász Local Lemma”.↩
In fact, I was confused enough to ask about it on Math Overflow, where Tao answered my (misguided) question himself!↩
Formalizing this in the proof required some care, capping the length of the random stream, but the description is morally correct.↩