Technical Perspective: Building a Better Hash Function

Hashing is everywhere. To Start, hash tables are one of the most widely used primitive data structures, with numerous variations (open address, chained, linear probing, multiple-choice, cuckoo, and so on). Hashing is also frequently used for sampling; hash all items and keep only those with certain hash values as the sample. Hashing further plays a major role in a variety of algorithms and data structures for data sketches for both streaming and non-streaming data, such as Bloom filters and approximate counting structures.

For much of the early history of hashing, there was a clear divide between theory and practice. The mathematical analysis of hashing and hashing algorithms was (and often still is) based on perfect randomness. You assume that for each input x, the hash value h(x) is uniformly distributed over all possible values it could take on, and that each value h(x) is independent of all other hash values h(y) for y≠x. Such perfect hash functions make mathematical analysis much simpler, as every new hash value looks completely random.

Of course, nobody actually uses perfect hash functions; they would take exponential space to store under any reasonable model of computation. Instead, in practice, people use various approaches to obtain pseudo-random hash functions. Knuth provides an early guide to various hashing methods of this type, such as multiplying by a constant and shifting to obtain the higher order bits.² Some people turn to cryptographic hash functions, although such functions may not actually be suitably random for many hashing purposes, and can be slow enough to become a bottleneck in systems that use them.

There are a large variety of real-world hash functions that come with no provable guarantees. Perhaps the biggest danger with such hash functions is that they may work deceptively well in a huge number of tests, creating a false sense of security, but they may fail miserably when faced with real data that is not random. They often seem to behave just as the analysis assuming perfect randomness predicts—that is, until they do not. An unfortunate, structured dataset can break such hash functions, in turn breaking the systems that rely on them.

Theoretical computer science has tried to develop frameworks that can provide the best of both worlds: practical hash functions along with provable guarantees. The key insight is that perfect randomness, while easier to analyze, is usually not necessary to guarantee the desired result. By taking more care in the analysis, we can often determine what we really need from our hash function, and tailor our choice of hash function to those needs. This line of work appears to have begun with the seminal work of Carter and Wegman,¹ who argued that for well-performing hash tables, hash values did not all need to be completely independent. Instead, suppose we choose a hash function randomly from a family of hash functions with range [0, B) so that for any two elements x and y, the probability they end up with the same hash value is only 1/B. This is enough to show that standard chained hash tables perform well. More generally, there are other settings where the analysis may only require we choose a hash function randomly from a family of hash functions so that any collection of k hash values are independent, for a small value of k. Fortunately, there are families of such k-independent hash functions that require only a small amount of space and computation time, each proportional to k, and these are suitable for many applications. Unfortunately, many other applications we care about appear to require logarithmic independence (or more), and for such applications the evaluation time for the hash function may become prohibitive. Still, these fundamental ideas have clarified that we can indeed obtain practical hash functions with provable guarantees in many situations, as long as we have a clear understanding of what exactly we need from our hash functions.

In the following paper, Mikkel Thorup describes another variation of simple but surprisingly effective and powerful hash functions based on using small tables of random hash values. The approach is referred to as tabular hashing. Tabular hashing actually dates back to the late 1960s, where it was used by Zobrist to create identifiers for board positions in computer games.³ For decades, its power remained essentially unnoticed, until Thorup (and his colleagues) revived the approach. He shows how tabular hashing provides the types of general concentration guarantees often needed in practice, as well as specific guarantees for certain key algorithms and data structures, including cuckoo hashing, linear probing, and bucket-based sampling. In some cases these results arise from simple tabular hashing, but for some problems he also shows how certain improvements can provide even stronger guarantees without too much of a price in space and running time. One of the "twists" he introduces is even called twisted tabular hashing.

In short, Thorup’s work has shown that tabular hashing provides great potential for more situations where we can have our cake and eat it too: that is, we can have the security of knowing our hashing offers theoretically sound guarantees, while also having the efficiency of a practical hash function that does not become a system bottleneck.

Footnotes

To view the accompanying paper, visit doi.acm.org/10.1145/3068772

Technical Perspective: Building a Better Hash Function

Technical Perspective: Building a Better Hash Function

DOI

July 2017 Issue

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.

Technical Perspective: Building a Better Hash Function

DOI

July 2017 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.