Fast and Powerful Hashing Using Tabulation

Fast and Powerful Hashing Using Tabulation, illustration

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here, we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees.

Simple tabulation hashing dates back to Zobrist (A new hashing method with application for game playing. Technical Report 88, Computer Sciences Department, University of Wisconsin). Keys are viewed as consisting of c characters and we have precomputed character tables h₁, . . ., h_c mapping characters to random hash values. A key x = (x₁, . . ., x_c) is hashed to h₁[x₁] ⊕ h₂[x₂]….. ⊕ h_c[x_c] This schemes is very fast with character tables in cache. Although simple tabulation is not even four-independent, it does provide many of the guarantees that are normally obtained via higher independence, for example, linear probing and Cuckoo hashing.

Next, we consider twisted tabulation where one input character is “twisted” in a simple way. The resulting hash function has powerful distributional properties: Chernoff-style tail bounds and a very small bias for minwise hashing. This is also yields an extremely fast pseudorandom number generator that is provably good for many classic randomized algorithms and data-structures.

Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Wegman and Carter.²⁶ In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space.

Although these tabulation schemes are all easy to implement and use, their analysis is not.

1. Introduction

A useful assumption in the design of randomized algorithms and data structures is the free availability of fully random hash functions, which can be computed in unit time. Removing this unrealistic assumption is the subject of a large body of work. To implement a hash-based algorithm, a concrete hash function has to be chosen. The space, time, and random choices made by this hash function affects the overall performance. The generic goal is therefore to provide efficient constructions of hash functions that for important randomized algorithms yield probabilistic guarantees similar to those obtained assuming fully random hashing.

To fully appreciate the significance of this program, we note that many randomized algorithms are very simple and popular in practice, but often they are implemented with too simple hash functions without the necessary guarantees. This may work very well in random tests, adding to their popularity, but the real world is full of structured data, for example, generated by computers, that could be bad for the hash function. This was illustrated in Ref.²¹ showing how simple common inputs made linear probing fail with popular hash functions, explaining its perceived unreliability in practice. The problems disappeared when sufficiently strong hash functions were used.

In this paper, we will survey recent results from Refs.^{6, 7, 8, 9, 21, 22, 25} showing how simple realistic hashing schemes based on tabulation provide unexpectedly strong guarantees for many popular randomized algorithms, for example, linear probing, Cuckoo hashing, minwise independence, treaps, planar partitions, power-of-two-choices, Chernoff-style concentration bounds, and even high independence. The survey is from a users perspective, explaining how these tabulation schemes can be applied. While these schemes are all very simple to describe and use, the analysis showing that they work is nontrivial. For this analysis, the reader is referred to the above papers. The reader is also referred to these papers for a historical account of previous work.

1.1. Background