Hash Tables—Theory and Practice

The first time I heard about hash tables was after taking a compilers course during my BSc. The truth is, I was not able to understand and appreciate their usefulness fully back then. Now that I know more about hash tables, I decided to write about them so others will see their importance as well.

Hash tables can be implemented in any programming language, including Awk. However, the choice of programming language is not the most important thing compared to other critical choices. Hash tables are used in compilers, databases, caching, associative arrays and so on. Hash tables are one of the most important data structures in computer science.

The Problem

The problem that will serve as an example for this article is finding out how many words from one text file appear in another text file. All programs in this article use a text file (Pride and Prejudice) for populating the hash table. Another text file (The Adventures of Tom Sawyer) will be used for testing the performance of the hash table. You can download both text files from Project Gutenberg.

The following output shows how many words each file contains:


$ wc AofTS.txt
    9206   73845  421884 AofTS.txt
$ wc PandP.txt
   13426  124589  717573 PandP.txt

As you can see, both text files are relatively large, which is good for benchmarking. Your real-life hash tables may not be as big. In order to remove various control characters, as well as punctuation marks and numbers, both text files were processed further:


$ strings PandP.txt > temp.INPUT
$ awk '{for (i = 1; i <= NF; i++) print $i}' temp.INPUT > new.INPUT
$ cat new.INPUT |  tr -cd '![a-zA-Z]\n' > INPUT
$ strings AofTS.txt > temp.CHECK
$ awk '{for (i = 1; i <= NF; i++) print $i}' temp.CHECK > new.CHECK
$ cat new.CHECK |  tr -cd '![a-zA-Z]\n' > empty.CHECK
$ sed '/!/d' empty.CHECK > temp.CHECK
$ sed '/^\s*$/d' temp.CHECK > CHECK

The reason for simplifying both files is that some control characters made the C programs crash. As the purpose of this article is to showcase hash tables, I decided to simplify the input instead of spending time trying to figure out the problem and modifying the C code.

After constructing the hash table using the first file (INPUT) as input, the second one (CHECK) will be used for testing the hash table. This will be the actual use of the hash table.

Theory

Let me start with the definition of a hash table. A hash table is a data structure that stores one or more key and value pairs. A hash table can store keys of any type.

A hash table uses a hash function to compute an index into an array of buckets or slots, from which the correct value can be found. Ideally, the hash function will assign each key to a unique bucket. Unfortunately, this rarely happens. In practice, more than one of the keys will hash to the same bucket. The most important characteristic of a hash table is the number of buckets. The number of buckets is used by the hashing function. The second most important characteristic is the hash function used. The most crucial feature of the hash function is that it should produce a uniform distribution of the hash values.

You can say that the search time is now O(n/k), where n is the number of keys, and k is the size of the hash array. Although the improvement looks small, you should realize that for a hash array with 20 buckets, the search time is now 20 times smaller.

It is important for the hash function to behave consistently and output the same hash value for identical keys. A collision happens when two keys are hashing to the same index—that's not an unusual situation. There are many ways to deal with a collision.

A good solution is to use separate chaining. The hash table is an array of pointers, each one pointing to the next key with the same hash value. When a collision occurs, the key will be inserted in constant time to the head of a linked list. The problem now is that when you have to search a hash value for a given key, you will have to search the whole linked list for this key. In the worst case, you might need to traverse the entire linked list—that's the main reason the linked list should be moderately small, giving the requirement for a large number of buckets.

As you can imagine, resolving collisions involves some kind of linear search; therefore, you need a hash function that minimizes collisions as much as possible. Other techniques for resolving collisions include open addressing, Robin Hood hashing and 2-choice hashing.

Hash tables are good at the following:

  • In a hash table with the "correct" number of buckets, the average cost for each lookup is independent of the number of elements stored in the table.

  • Hash tables are particularly efficient when the maximum number of entries can be predicted in advance so that the bucket array can be allocated once with the optimum size and never resized.

  • If the set of key-value pairs is fixed and known ahead of time (so insertions and deletions are not allowed), you can reduce the average lookup cost by a careful choice of the hash function, bucket table size and internal data structures.

Hash tables also have some disadvantages:

  • They are not good at keeping sorted data. It is not efficient to use a hash table if you want your data sorted.

  • Hash tables are not effective when the number of entries is very small, because despite the fact that operations on a hash table take constant time on average, the cost of a good hash function can be significantly higher than the inner loop of the lookup algorithm for a sequential list or search tree.

  • For certain string processing applications, such as spell-checking, hash tables may be less efficient than trees or finite automata.

  • Although the average cost per operation is constant and fairly small, the cost of a single operation may be fairly high. In particular, if the hash table uses dynamic resizing, inserting or deleting a key may, once in a while, take time proportional to the number of entries. This can be a serious drawback in applications where you want to get results fast.

  • Hash tables become quite inefficient when there are many collisions.

As I'm sure you understand, not every problem can be solved equally well with the help of a hash table. You always should consider and examine all your options before deciding what to use.

Figure 1 shows a simple hash table with keys and values shown. The hash function is the modulo 10 function; therefore, ten buckets are needed because only ten results can come from a modulo 10 calculation. Having only ten buckets is not considered very good, especially if the number of values grows large, but it is fine for illustrative purposes.

Figure 1. A Simple Hash Table

To summarize, a hash table should follow these principles:

  • Do not have too many buckets, just as many as needed.

  • It is good for the hash function to take into account as much information provided by the key as possible. This is not a trivial task.

  • The hash function must be able to hash similar keys to different hash values.

  • Each bucket should have the same number of keys or at least as close to being equal as possible (this is a very desirable property).

  • Following some principles will make collisions less likely. First, you should use a prime number of buckets. Second, the bigger the size of the array, the smaller the probability of collisions. Finally, you should make sure that the hash function is smart enough to distribute its return values as evenly as possible.

______________________