CS312 Lecture 24: Hash Tables

Data Sets

Until now we have used different types of sets to store values, as lists and trees. The following table shows the running time associated with different operations over this structures:

Set type Insert Delete Member

Link list O(1) O(n) O(n)

Red Black trees O(log n) O(log n) O(log n)

Set type		Insert	Delete	Member
Link list	O(1)	O(n)	O(n)
Red Black trees	O(log n)	O(log n)	O(log n)

We are interested in improving this results. For that, we will introduce a structure that will take time O(1) in all the above operations.

Hash Tables

The basic idea is to define a Map as a set of (key,value) pairs. Map is nothing else but a partial function from keys to values. When the keys are string we say that this map is a Dictionary.

A mutable map is a map where an element (key-value pair) can be removed or changed after it was inserted. We call this kind of maps a Hash Table.

The running time is obtained by exploiting the fact that arrays have O(1) access to any position. We define a bucket as a block of this array where we can store one element of the map.

As we are storing our elements in an array, we would like to compute an index from the key of the element. This index will allow us to choose where to store the element in the array. The function that computes this indexes is called a hash function .

What if the hash function returns the same index for two different keys? This is a case where there is a conflict, it generally happens in one of the following situations:

The map is too small comparing to the size of the set:
Let m be the size of the Hash Table, and n the size of the set, we define the load factor lf = n/m as the average of number of elements per bucket. A big load factor will become into this problem.
We have a bad behaved hash function:
A good behaved function ideally produces indexes for buckets uniformly at random. For instance, if the key is a string, a bad behaved function will be to use the length of the string, obviously this will create many conflicts.

There are many ways to solve conflicts. A simple approach is to store a list of elements on each bucket, but if the load factor is too high, then the structure will start behaving like a linked list, decreasing the performance we were looking for.

There are some hash functions frequently used. For instance modular hashing that takes a integer key and produces the modulus on a base m = 2^p. Multiplicative hashing with a integer key, computes k*m/2^p mod 2^g, with an appropriate choice of p and q. These functions in general are well behaved.

Practical use

An immediate use of a hash table would be to represent the bindings-environment in our evaluator. We could implement our environment as a hash table, where the keys-values correspond to names(bindings)-values.

This could work to define a top level environment, but how do we model nested environments? For instance, if we have

val x = 2;
let x = 4 in x end;

We could do this by creating a list of values in each bucket. Inserting and deleting them as we enter and exit the scope of local environments.

On next lecture we will look closely to several implementations for this problem.

CS312 Lecture 24: Hash Tables

Data Sets

Hash Tables

Practical use

CS312 � 2002 Cornell University Computer Science