The claim that hash tables give have O(1) expected performance for lookup and insert is based on the assumption that the number of elements stored in the table is comparable to the number of buckets. If a hash table has many more elements than buckets, the number of elements stored at each bucket will become large. For instance, with a constant number of buckets and O(n) elements, the lookup time is O(n) and not O(1).
The solution to this problem is to increase the size of the table
when the number of elements in the table gets too large compared to the
size of the table. If we let the
The linear running time of a resizing operation isn't as much of a
problem as it might sound (although it can be an issue for some
real-time computing systems). If the table is doubled in size every time
it is needed, then the resizing operation occurs with exponentially
decreasing frequency. As a consequence, the insertion
of n elements into an empty array
takes only O(n) time in all, including
the cost of resizing. We say that the insertion operation
has O(1)
It is crucial that the array size grow geometrically (doubling). It might be tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this results in asymptotic linear rather than constant amortized running time.
Now we turn to a more detailed description of amortized analysis.
Amortized analysis is a worst-case analysis of a a sequence of operations — to obtain a tighter bound on the overall or average cost per operation in the sequence than is obtained by separately analyzing each operation in the sequence. For instance when we considered the union and find operations for the disjoint set data abstraction earlier in the semester, we were able to bound the running time of individual operations by O(log n). However, for a sequence of n operations it is possible to obtain tighter than an O(n log n) bound (although that analysis is more appropriate to 4820 than to this course). Here we will consider a simplified version of the hash table problem above, and show that a sequence of n insert operations has overall time O(n).
There are three techniques used for amortized analysis
Consider a resizable array that can store an arbitrary number of integers,
like an ArrayList
or Vector
in Java. These are
implemented in terms of ordinary (non-resizable) arrays. Each add
operation inserts a new element after all the elements previously inserted.
If there are no empty cells left, a new array of double the size is allocated,
and all the data from the old array is copied to the corresponding entries
in the new array. For instance, consider the following sequence of insertions,
starting with an array of length 1:
+--+ Insert 11 |11| +--+ +--+--+ Insert 12 |11|12| +--+--+ +--+--+--+--+ Insert 13 |11|12|13| | +--+--+--+--+ +--+--+--+--+ Insert 14 |11|12|13|14| +--+--+--+--+ +--+--+--+--+--+--+--+--+ Insert 15 |11|12|13|14|15| | | | +--+--+--+--+--+--+--+--+
The table is doubled in the second, third, and fifth steps. As each insertion takes O(n) time in the worst case, a simple analysis would yield a bound of O(n2) time for n insertions. But it is not this bad. Let's analyze a sequence of n operations using the three methods.
Let ci be the cost of the i-th insertion:
ci = i if i−1 is a power of 2 1 otherwise
Let's consider the size of the table si and the cost ci for the first few insertions in a sequence:
i 1 2 3 4 5 6 7 8 9 10 si 1 2 4 4 8 8 8 8 16 16 ci 1 2 3 1 5 1 1 1 9 1
Alteratively we can see that ci=1+di where di is the cost of doubling the table size. That is
di = i−1 if i−1 is a power of 2 0 otherwiseThen summing over the entire sequence, all the 1's sum to O(n), and all the di also sum to O(n). That is,
Σ1≤i≤n ci ≤ n + Σ0≤j≤m 2j−1
where m=log(n−1). Both terms on the right hand side of the inequality are O(n), so the total running time of n insertions is O(n).
In contrast with the aggregate method, which directly seeks a bound on the overall running time of an operation sequence, the accounting method seeks to find a payment charged to each individual operation such that the sum of the payments is at least as large as the total actual cost. Intuitively, one can think of maintaining a bank account. Low-cost operations are charged just a little bit more than the true cost of the operation, and the surplus is deposited into the bank account for later use. High-cost operations are paid for by the savings in the bank account. The charges must be set just large enough that the balance always remains positive.
If we let c'i be the charge for the i-th operation and ci be the true cost, then we would like
Σ1≤i≤n ci ≤ Σ1≤i≤n c'i
for all n.
Back to the example of the dynamic table. Say it costs 1 unit to insert an element and 1 unit to move it when the table is doubled. Clearly a charge of 1 unit per insertion is not enough, because there is nothing left over to pay for the moving. A charge of 2 per insertion again is not enough, but a charge of 3 appears to be:
i 1 2 3 4 5 6 7 8 9 10 si 1 2 4 4 8 8 8 8 16 16 ci 1 2 3 1 5 1 1 1 9 1 c'i 3 3 3 3 3 3 3 3 3 3 bi 2 3 3 5 3 5 7 9 3 4
where bi is the balance after the i-th insertion.
In fact, this is enough in general. Let m refer to the m-th element inserted. The three units charged to m are spent as follows:
In fact, we can do slightly better, by charging just 1 for the first insertion and then 3 for each insertion after that, because for the first insertion there are no elements to copy. This will yield a zero balance after the first insertion and then a positive one thereafter.
Above we saw the aggregate method and the banker's method for dealing with dynamically resizable arrays. Here, let us have a look at the physicist's method on that same problem.
Suppose we can define a
Intuitively, the potential function will keep track of the precharged time at any point in the computation. It measures how much saved-up time is available to pay for expensive operations. It is analogous to the bank balance in the banker's method. But interestingly, it depends only on the current state of the data structure, irrespective of the history of the computation that got it into that state.
We then define the
c + Φ(h') − Φ(h),
where c is the actual cost of the operation and h and h' are the states of the data structure before and after the operation, respectively. Thus the amortized time is the actual time plus the change in potential. Ideally, Φ should be defined so that the amortized time of each operation is small.
Now consider a sequence of n operations taking actual times c0, c1, c2, ..., cn−1 and producing data structures h1, h2, ..., hn starting from h0. The total amortized time is the sum of the individual amortized times:
c0 + (Φ(h1) − Φ(h0)) + c1 + (Φ(h2) − Φ(h1)) + ... + cn−1 + (Φ(hn) − Φ(hn−1))
= c0 + c1 + ... + cn−1 + Φ(hn) − Φ(h0)
= c0 + c1 + ... + cn−1 + Φ(hn).
Therefore the amortized time for a sequence of operations overestimates of the actual time by Φ(hn), which by assumption is always positive. Thus the total amortized time is always an upper bound on the actual time.
For dynamically resizable arrays with resizing by doubling, we use the potential function
Φ(h) = 2n − m,
where n is the current number of elements and m is the current length of the array. If we start with an array of length 0 and allocate an array of length 1 when the first element is added, and thereafter double the array size whenever we need more space, we have Φ(h0) = 0 and Φ(ht) ≥ 0 for all t. The latter inequality holds because the number of elements is always at least half the size of the array.
Now we would like to show that adding an element takes amortized constant time. There are two cases.
In both cases, the amortized time is O(1).
The key to amortized analysis is to define the right potential function. The potential function needs to save up enough time to be used later when it is needed. But it cannot save so much time that it causes the amortized time of the current operation to be too high.