The claim that hash tables give O(1) performance
is based on the assumption that m = O(n).
If a hash table has many elements inserted into it, n
may become much larger than m and violate
this assumption. The effect will be that the bucket sets will become large
enough that their bad asymptotic performance will show through. The solution to
this problem is relatively simple: the array must be increased in size and all
the element rehashed into the new buckets using an appropriate hash
function when the load factor exceeds some constant factor. Each resizing
operation therefore takes O(n) time
where n is the size of the hash table being
resized. Therefore the O(1) performance of
the hash table operations no longer holds in the case of add
: its
worst-case performance is O(n).
This isn't really as much of a problem as it might sound. If the bucket array is doubled in size every time it is needed, then the insertion of n elements in a row into an empty array takes only O(n) time, perhaps surprisingly. We say that add has O(1) amortized run time because the time required to insert an element is O(1) on the average even though some elements trigger a lengthy rehashing of all the elements of the hash table.
To see why this is, suppose we insert n elements into a hash table while doubling the number of buckets when the load factor crosses some threshold. A given element may be rehashed many times, but the total time to insert the n elements is still O(n). Consider inserting n = 2k elements, and suppose that we hit the worst case, where the resizing occurs on the very last element. Since the bucket array is being doubled at each rehashing, the rehashes must all occur at powers of two. The final rehash rehashes all n elements, the previous one rehashes n/2 elements, the one previous to that n/4 elements, and so on. So the total number of hashes computed is n hashes for the actual insertions of the elements, plus n + n/2 + n/4 + n/8 + ... = n(1 + 1/2 + 1/4 + 1/8 + ...) = 2n hashes, for a total of 3n hashing operations.
No matter how many elements
we add to the hash table, there will be at most three hashing operations performed
per element added. Therefore, add
takes amortized O(1)
time even if we start out with a bucket array of one element!
Another way to think about this is that the true cost of performing an add
is about triple the cost observed on a typical call to add
. The remaining 2/3 of
the cost is paid as the array is resized later. It is useful to think about this
in monetary terms. Suppose that a hashing operation costs $1 (that is, 1 unit of
time). Then a call to add
costs $3, but only $1 is required up
front for the initial hash. The remaining $2 is placed into the hash table
element just added and used to pay for future rehashing. Assume each time the
array is resized, all of the remaining money gets used up. At the next resizing,
there are n elements and n/2
of them have $2 on them; this is exactly enough to pay for the resizing. This is
a really an argument by induction, so we'd better examine the base case: when
the array is resized from one bucket to two, there is $2 available, which is $1
more than needed to pay for the resizing. That extra $1 will stick around
indefinitely, so inserting n elements
starting from a 1-element array takes at most 3n-1
element hashes, which is O(n) time.
This kind of analysis, in which we precharge an operation for some time that
will be taken later, typifies amortized analysis of run time.
Notice that it was crucial that the array size grows geometrically (doubling). It is tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this causes n elements to be rehashed O(n) times on average, resulting in O(n2) asymptotic insertion time!
Any fixed threshold load factor is equally good from the standpoint of asymptotic run time, but a good rule of thumb is that rehashing should take place at a=3. One might think that a=1 is the right place to rehash, but in fact the best performance is seen (for buckets implemented as linked lists) when load factors are in the 1-2 range. When a<1, the bucket array contains many empty entries, resulting in suboptimal performance of the computer's memory system. There are many other tricks that are important for getting the very best performance out of hash tables.