The claim that hash tables have O(1) expected performance for lookup and insert is based on the assumption that the number of elements stored in the table is comparable to the number of buckets. If a hash table has many more elements than buckets, the number of elements stored at each bucket will become large. For instance, with a constant number of buckets and O(n) elements, the lookup time is O(n) and not O(1).
The solution to this problem is to increase the size of the table
when the number of elements in the table gets too large compared to the
size of the table. If we let the
The linear running time of a resizing operation is not as much of a
problem as it might sound (although it can be an issue for some
real-time computing systems). If the table is doubled in size every time
it is needed, then the resizing operation occurs with exponentially
decreasing frequency. As a consequence, the insertion
of n elements into an empty array
takes only O(n) time in all, including
the cost of resizing. We say that the insertion operation
has O(1)
It is crucial that the array size grow geometrically (doubling). It might be tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this results in asymptotic linear rather than constant amortized running time.
Now we turn to a more detailed description of amortized analysis.
Amortized analysis is a worst-case analysis of a a sequence of operations — to obtain a tighter bound on the overall or average cost per operation in the sequence than is obtained by separately analyzing each operation in the sequence. For instance, when we considered the union and find operations for the disjoint set data abstraction earlier in the semester, we were able to bound the running time of individual operations by O(log n). However, for a sequence of n operations, it is possible to obtain tighter than an O(n log n) bound (although that analysis is more appropriate to 4820 than to this course). Here we will consider a simplified version of the hash table problem above, and show that a sequence of n insert operations has overall time O(n).
There are three main techniques used for amortized analysis:
Consider an extensible array that can store an arbitrary number of integers,
like an ArrayList
or Vector
in Java. These are
implemented in terms of ordinary (non-extensible) arrays. Each add
operation inserts a new element after all the elements previously inserted.
If there are no empty cells left, a new array of double the size is allocated,
and all the data from the old array is copied to the corresponding entries
in the new array. For instance, consider the following sequence of insertions,
starting with an array of length 1:
+--+ Insert 11 |11| +--+ +--+--+ Insert 12 |11|12| +--+--+ +--+--+--+--+ Insert 13 |11|12|13| | +--+--+--+--+ +--+--+--+--+ Insert 14 |11|12|13|14| +--+--+--+--+ +--+--+--+--+--+--+--+--+ Insert 15 |11|12|13|14|15| | | | +--+--+--+--+--+--+--+--+
The table is doubled in the second, third, and fifth steps. As each insertion takes O(n) time in the worst case, a simple analysis would yield a bound of O(n2) time for n insertions. But it is not this bad. Let's analyze a sequence of n operations using the three methods.
Let ci be the cost of the i-th insertion:
ci = i if i−1 is a power of 2 1 otherwise
Let's consider the size of the table si and the cost ci for the first few insertions in a sequence:
i 1 2 3 4 5 6 7 8 9 10 si 1 2 4 4 8 8 8 8 16 16 ci 1 2 3 1 5 1 1 1 9 1
Alteratively we can see that ci=1+di where di is the cost of doubling the table size. That is
di = i−1 if i−1 is a power of 2 0 otherwiseThen summing over the entire sequence, all the 1's sum to O(n), and all the di also sum to O(n). That is,
Σ1≤i≤n ci ≤ n + Σ0≤j≤m 2j−1
where m = log(n − 1). Both terms on the right hand side of the inequality are O(n), so the total running time of n insertions is O(n).
The aggregate method directly seeks a bound on the overall running time of a sequence of operations. In contrast, the accounting method seeks to find a payment of a number of extra time units charged to each individual operation such that the sum of the payments is an upper bound on the total actual cost. Intuitively, one can think of maintaining a bank account. Low-cost operations are charged a little bit more than their true cost, and the surplus is deposited into the bank account for later use. High-cost operations can then be charged less than their true cost, and the deficit is paid for by the savings in the bank account. In that way we spread the cost of high-cost operations over the entire sequence. The charges to each operation must be set large enough that the balance in the bank account always remains positive, but small enough that no one operation is charged significantly more than its actual cost.
We emphasize that the extra time charged to an operation does not mean that the operation really takes that much time. It is just a method of accounting that makes the analysis easier.
If we let c'i be the charge for the i-th operation and ci be the true cost, then we would like
Σ1≤i≤n ci ≤ Σ1≤i≤n c'i
for all n, which says that the
Back to the example of the extensible array. Say it costs 1 unit to insert an element and 1 unit to move it when the table is doubled. Clearly a charge of 1 unit per insertion is not enough, because there is nothing left over to pay for the moving. A charge of 2 units per insertion again is not enough, but a charge of 3 appears to be:
i 1 2 3 4 5 6 7 8 9 10 si 1 2 4 4 8 8 8 8 16 16 ci 1 2 3 1 5 1 1 1 9 1 c'i 3 3 3 3 3 3 3 3 3 3 bi 2 3 3 5 3 5 7 9 3 4
where bi is the balance after the i-th insertion.
In fact, this is enough in general. Let m refer to the m-th element inserted. The three units charged to m are spent as follows:
In fact, we can do slightly better, by charging just 1 for the first insertion and then 3 for each insertion after that, because for the first insertion there are no elements to copy. This will yield a zero balance after the first insertion and then a positive one thereafter.
Above we saw the aggregate method and the banker's method for dealing with extensible arrays. Now let us look at the physicist's method.
Suppose we can define a
Intuitively, the potential function will keep track of the precharged time at any point in the computation. It measures how much saved-up time is available to pay for expensive operations. It is analogous to the bank balance in the banker's method. But interestingly, it depends only on the current state of the data structure, irrespective of the history of the computation that got it into that state.
We then define the
c + Φ(h') − Φ(h),
where c is the actual cost of the operation and h and h' are the states of the data structure before and after the operation, respectively. Thus the amortized time is the actual time plus the change in potential. Ideally, Φ should be defined so that the amortized time of each operation is small. Thus the change in potential should be positive for low-cost operations and negative for high-cost operations.
Now consider a sequence of n operations taking actual times c0, c1, c2, ..., cn−1 and producing data structures h1, h2, ..., hn starting from h0. The total amortized time is the sum of the individual amortized times:
(c0 + Φ(h1) − Φ(h0)) + (c1 + Φ(h2) − Φ(h1)) + ... + (cn−1 + Φ(hn) − Φ(hn−1))
= c0 + c1 + ... + cn−1 + Φ(hn) − Φ(h0)
= c0 + c1 + ... + cn−1 + Φ(hn).
Therefore the amortized time for a sequence of operations overestimates of the actual time by Φ(hn), which by assumption is always positive. Thus the total amortized time is always an upper bound on the actual time.
For dynamically resizable arrays with resizing by doubling, we can use the potential function
Φ(h) = 2n − m,
where n is the current number of elements and m is the current length of the array. If we start with an array of length 0 and allocate an array of length 1 when the first element is added, and thereafter double the array size whenever we need more space, we have Φ(h0) = 0 and Φ(ht) ≥ 0 for all t. The latter inequality holds because the number of elements is always at least half the size of the array.
Now we would like to show that adding an element takes amortized constant time. There are two cases.
In both cases, the amortized time is O(1).
The key to amortized analysis with the physicist's method is to define the right potential function. The potential function needs to save up enough time to be used later when it is needed. But it cannot save so much time that it causes the amortized time of the current operation to be too high.