Computing Applications Research highlights

Theory and Applications of b-Bit Minwise Hashing

Posted Aug 1 2011

Abstract
1. Introduction
2. The Fundamental Results
3. Experiments
4. Comparisons with Hamming Distance Algorithms
5. Improvement by Combining Bits
6. Computational Improvements
7. Extensions and Applications
8. Conclusion
References
Authors
Footnotes
Figures
Tables

Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing. This paper describes b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice.

We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison with other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme.

1. Introduction

With the advent of the Internet, many applications are faced with very large and inherently high-dimensional datasets. A common task on these is similarity search, that is, given a high-dimensional data point, the retrieval of data points that are close under a given distance function. In many scenarios, the storage and computational requirements for computing exact distances between all data points are prohibitive, making data representations that allow compact storage and efficient approximate distance computation necessary.

In this paper, we describe b-bit minwise hashing, which leverages properties common to many application scenarios to obtain order-of-magnitude improvements in the storage space and computational overhead required for a given level of accuracy over existing techniques. Moreover, while the theoretical analysis of these gains is technically challenging, the resulting algorithm is simple and easy to implement.

To describe our approach, we first consider the concrete task of Web page duplicate detection, which is of critical importance in the context of Web search and was one of the motivations for the development of the original minwise hashing algorithm by Broder et al.^{2, 4} Here, the task is to identify pairs of pages that are textually very similar. For this purpose, Web pages are modeled as “a set of shingles,” where a shingle corresponds to a string of w contiguous words occurring on the page. Now, given two such sets S₁, , the normalized similarity known as resemblance or Jaccard similarity, denoted by R, is

Duplicate detection now becomes the task of detecting pairs of pages for which R exceeds a threshold value. Here, w is a tuning parameter and was set to be w = 5 in several studies.^{2, 4, 7} Clearly, the total number of possible shingles is huge. Considering 10⁵ unique English words, the total number of possible 5-shingles should be D = (10⁵)⁵ = O(10²⁵). A prior study⁷ used D = 2⁶⁴ and even earlier studies^{2, 4} used D = 2⁴⁰. Due to the size of D and the number of pages crawled as part of Web search, computing the exact similarities for all pairs of pages may require prohibitive storage and computational overhead, leading to approximate techniques based on more compact data structures.

1.1. Minwise hashing

To address this issue, Broder and his colleagues developed minwise hashing in their seminal work.^{2, 4} Here, we give a brief introduction to this algorithm. Suppose a random permutation Π is performed on Ω, that is,

An elementary probability argument shows that

After k minwise independent permutations, Π₁, Π₂,…, Π_k, one can estimate R without bias, as a binomial probability:

We will frequently use the terms “sample” and “sample size” (i.e., k). For minwise hashing, a sample is a hashed value, min(Π_j(S_i)), which may require, for example, 64 bits.⁷

Since the original minwise hashing work,^{2, 4} there have been considerable theoretical and methodological developments.^{3, 5, 12, 14, 16, 17, 22}

Applications: As a general technique for estimating set similarity, minwise hashing has been applied to a wide range of applications, for example, content matching for online advertising,²³ detection of redundancy in enterprise file systems,⁸ syntactic similarity algorithms for enterprise information management,²¹ Web spam,²⁵ etc.

Many of the applications of minwise hashing are targeted at detecting duplicates or pairs of somewhat high similarity. By proposing an estimator that is particularly accurate for these scenarios, we can reduce the required storage and computational overhead dramatically. Here, the computational savings are a function of how the min-wise hashes are used. For any technique that does compute the pairwise similarity for (a large subset of) all pairs, the computation is typically bound by the speed at which the samples can be brought into memory (as the computation itself is simple); hence, the space reduction our technique offers directly translates into order-of-magnitude speedup as well.

However, even with the data-size reduction, computing all pairwise similarities is prohibitively expensive in many scenarios. This has lead to a number of approaches that avoid this computation by grouping (subsets of) the samples into buckets and only computing the pairwise similarities for items within the same (set of) buckets. This approach avoids the quadratic number of comparisons, at the cost of some loss in accuracy. Examples of such approaches are the supershingles⁴ or techniques based on locally sensitive hashing (LSH)^{1, 5, 13} (also see Chapter 3 of Rajaraman and Ullman²⁴ for an excellent detailed explanation of LSH and see Cohen et al.⁶ for nice applications of LSH ideas in mining associations).

1.2. b-Bit minwise hashing

In this paper, we establish a unified theoretical framework for b-bit minwise hashing. In our scheme, a sample consists of b bits only, as opposed to, for example, b = 64 bits⁷ in the original minwise hashing. Intuitively, using fewer bits per sample will increase the estimation variance, compared to (3), at the same sample size k. Thus, we will have to increase k to maintain the same accuracy. Interestingly, our theoretical results will demonstrate that, when resemblance is not too small (which is the case in many applications, e.g., consider R≥0.5, the threshold used in Broder et al.^{2, 4}), we do not have to increase k much. This means our proposed b-bit minwise hashing can be used to improve estimation accuracy and significantly reduce storage requirements at the same time.

For example, when b = 1 and R = 0.5, the estimation variance will increase at most by a factor of 3. In order not to lose accuracy, we have to increase the sample size by a factor of 3. If we originally stored each hashed value using 64 bits, the improvement by using b = 1 will be 64/3 = 21.3.

Algorithm 1 illustrates the procedure of b-bit minwise hashing, based on the theoretical results in Section 2.

1.3. Related work

Locality sensitive hashing (LSH)^{1, 5, 13} is a set of techniques for performing approximate search in high dimensions. In the context of estimating set intersections, there exist LSH families for estimating the resemblance, the arccosine, and the hamming distance. Our b-bit minwise hashing proposes a new construction of an LSH family (Section 7.4).

Algorithm 1 The b-bit minwise hashing algorithm, applied to estimating pairwise resemblances in a collection of N sets.

Input: Sets , n = 1 to N.

Preprocessing

(1): Generate k random permutations Π_j: Ω → Ω, j = 1 to k.

(2): For each set S_n and each permutation Π_j, store the lowest b bits of min (Π_j(S_n)), denoted by e_{n,i,π_j}, i = 1 to b.

Estimation: (Use two sets S₁ and S₂ as an example)

(1): Compute .

(2): Estimate the resemblance by , where C_1,b and C_2,b are from Theorem 1 in Section 2.

In Charikar⁵ and Gionis et al.,¹⁰ the authors describe hashing schemes that map objects to {0, 1}. The algorithms for the construction, however, are problem specific. Three discovered 1-bit schemes are (i) the simhash⁵ based on sign random projection,^{11, 18} (ii) the hamming distance algorithm based on simple random sampling,¹³ and (iii) the hamming distance algorithm based on a variant of random projection.¹⁵

Section 4 will compare our method with two hamming distance algorithms.^{13, 15} We also wrote a report (http://www.stat.cornell.edu/~li/b-bit-hashing/RP_minwise.pdf), which demonstrated that, unless the similarity is very low, b-bit minwise hashing outperforms sign random projections.

A related approach is conditional random sampling (CRS)^{16, 17} which uses only a single permutation and instead of a single minimum retains as set of the smallest hashed values. CRS provides more accurate (in some scenarios substantially so) estimators for binary data and naturally extends to real-value data and dynamic streaming data; moreover, the same set of hashed values can be used to estimate a variety of summary statistics including histograms, l_p distances (for any p), number of distinct values, χ² distances, entropies, etc. However, we have not developed a b-bit scheme for CRS, which appears to be a challenging task.

2. The Fundamental Results

Consider two sets S₁, . Apply a random permutation Π on S₁ and S₂, where Π: Ω → Ω. Define the minimum values under Π to be z₁ and z₂:

Define e_1,i = ith lowest bit of z₁, and e_2,i = ith lowest bit of z₂. Theorem 1 derives the main probability formula. Its proof assumes that D is large, which is virtually always satisfied in practice. This result is a good example of approaching a difficult problem by reasonable approximations.

THEOREM 1. Assume D is large.

The intuition for the difference between (5) and the equivalent equation for minwise hashing (1) is that even when R = 0, the collision probability P_b (i.e., the probability that two minima agree on their last b bits) is not zero, but rather C_1,b. Having to account for this type of “false positives” makes the derivation more difficult, resulting in the additional terms in (5). Of course, as expected, if R = 1, then P_b = 1 (because in this case r₁ = r₂ and C_1,b = C_2,b).

Note that the only assumption needed in the proof of Theorem 1 is that D is large, which is virtually always satisfied in practice. Interestingly, (5) is remarkably accurate even for very small D. Figure 1 shows that when D = 20 (D = 500), the absolute error caused by using (5) is <0.01 (<0.0004).

2.1. The unbiased estimator

Theorem 1 suggests an unbiased estimator for R:

where e_{1,i,Π_j} (e_{2,i,Π_j}) denotes the ith lowest bit of z₁ (z₂), under the permutation Π_j. The variance is

For large b, Var ( ) converges to the variance of , the estimator for the original minwise hashing:

In fact, when b = 64, Var ( ) and Var ( ) are numerically indistinguishable for practical purposes.

2.2. The variance-space trade-off

As we decrease b, the space needed for storing each “sample” will be smaller; the estimation variance (11) at the same sample size k, however, will increase. This variance-space trade-off can be precisely quantified by B(b; R, r₁, r₂):

Lower B(b) is better. The ratio, , measures the improvement of using b = b₂ (e.g., b₂ = 1) over using b = b₁ (e.g., b₁ = 64). Some algebra yields the following Lemma.

LEMMA 1. If r₁ = r₂ and b₁ > b₂, then

is a monotonically increasing function of R [0, 1].

If R → 1 (which implies r₁, r₂ → 1), then

If r₁ = r₂, b₂ = 1, b₁ = 64 (hence we treat A_1,b = 0), then

Suppose the original minwise hashing used b = 64, then the maximum improvement of b-bit minwise hashing would be 64-fold, attained when r₁ = r₂ = 1 and R = 1. In the least favorable situation, that is, r₁, r₂ → 0, the improvement will still be when R = 0.5.

Figure 2 plots to directly visualize the relative improvements, which are consistent with what Lemma 1 predicts. The plots show that, when R is very large (which is the case in many practical applications), it is always good to use b = 1. However, when R is small, using larger b may be better. The cut-off point depends on r₁, r₂, R. For example, when r₁ = r₂ and both are small, it would be better to use b = 2 than b = 1 if R < 0.4, as shown in Figure 2.

3. Experiments

In the following, we evaluate the accuracy of the theoretical derivation and the practical performance of our approach using two sets of experiments. Experiment 1 is a sanity check, to verify: (i) our proposed estimator in (9) is unbiased and (ii) its variance follows the prediction by our formula in (11). Experiment 2 is a duplicate detection task using a Microsoft proprietary collection of 1,000,000 news articles.

3.1. Experiment 1

The data, extracted from Microsoft Web crawls, consists of six pairs of sets. Each set consists of the document IDs, which contain the word at least once. We now use b-bit min-wise hashing to estimate the similarities of these sets (i.e., we estimate the strength of the word associations).

Table 1 summarizes the data and provides the theoretical improvements . The words were selected to include highly frequent pairs (e.g., “OF-AND”), highly rare pairs (e.g., “GAMBIA-KIRIBATI”), highly unbalanced pairs (e.g., “A-TEST”), highly similar pairs (e.g., “KONG-HONG”), as well as pairs that are not quite similar (e.g., “LOW-PAY”).

We estimate the resemblance using the original minwise hashing estimator and the b-bit estimator (b = 1, 2, 3).

Figure 3 plots the empirical mean square errors (MSE = variance + bias²) in solid lines and the theoretical variances (11) in dashed lines for all word pairs. All dashed lines are invisible because they overlap with the corresponding solid curves. Thus, this experiment validates that the variance formula (11) is accurate and is indeed unbiased (otherwise, the MSE will differ from the variance).

3.2. Experiment 2

To illustrate the improvements by the use of b-bit minwise hashing on a real-life application, we conducted a duplicate detection experiment using a corpus of 10⁶ news documents. The dataset was crawled as part of the BLEWS project at Microsoft.⁹ We computed pairwise resemblances for all documents and retrieved document pairs with resemblance R larger than a threshold R₀. We estimate the resemblances using with b = 1, 2, 4 bits and the original minwise hashing. Figure 4 presents the precision and recall curves. The recall values (bottom two panels in Figure 4) are all very high and do not differentiate the estimators.

The precision curves for (using 4 bits per sample) and (assuming 64 bits per sample) are almost indistinguishable, suggesting a 16-fold improvement in space using b = 4.

When using b = 1 or 2, the space improvements are normally around 20- to 40-fold, compared to (assuming 64 bits per sample), especially for achieving high precision.

4. Comparisons with Hamming Distance Algorithms

Closely related to the resemblance, the hamming distance H is another important similarity measure. In the context of hamming distance, a set is mapped to a D-dimensional binary vector Y_i: Y_it = 1, if t S_i and 0 otherwise. The hamming distance between Y₁ and Y₂ is

Thus, one can apply b-bit minwise hashing to estimate H, by converting the estimated resemblance (9) to :

The variance of can be computed from Var ( ) (11) by the “delta method” (i.e., :

We will first compare with an algorithm based on simple random sampling¹³ and then with another algorithm based on a variant of random projection.¹⁵

4.1. Simple random sampling algorithm

To reduce the storage, we can randomly sample k coordinates from the original data Y₁ and Y₂ in D-dimensions. The samples, denoted by h₁ and h₂, are k-dimensional bit vectors, from which we can estimate H:

whose variance would be (assuming k ≪ D)

Comparing the two variances, (17) and (19), we find that the variance of using simple random sampling, that is, Var , is substantially larger than the variance of using b-bit minwise hashing, that is, Var ( ), especially when the data is sparse. We consider in practice one will most likely implement the random sampling algorithm by storing only the original locations (coordinates) of the nonzeros in the samples. If we do so, the total bits on average will be (per set). This motivates us to define the following ratio:

to compare the storage costs. Recall each sample of b-bit minwise hashing requires b bits (i.e., bk bits per set). The following Lemma may help characterize the improvement:

LEMMA 2. If r₁, r₂ → 0, then G_s,b as defined in (20)

In other words, for small r₁, r₂, ; if R ≈ 0; and , if R ≈ 1. Figure 5 plots G_s,b = 1, verifying the substantial improvement of b-bit minwise hashing over simple random sampling (often 10- to 30-fold).

4.2. Random projection + modular arithmetic

An interesting 1-bit scheme was developed in Kushilevitz et al.¹⁵ using random projection followed by modular arithmetic. A random matrix is generated with entries being i.i.d. samples u_ij from a binomial distribution: u_ij = 1 with probability and u_ij = 0 with probability . Let ν₁ = Y₁ × U (mod 2) and ν₂ = Y₂ × U (mod 2). Kushilevitz et al.¹⁵ showed that

which allows us to estimate the hamming distance H by

We calculate the variance of to be

which suggests that the performance of this 1-bit scheme might be sensitive to β that must be predetermined for all sets at the processing time (i.e., it cannot be modified in the estimation phrase for a particular pair). Figure 6 provides the “optimal” β (denoted by β*) values (as function of H) by numerically minimizing the variance (24).

It is interesting to compare this random projection-based 1-bit scheme with our b-bit minwise hashing using the following ratio of their variances:

Figure 7 shows that if it is possible to choose the optimal β* for random projection, one can achieve good performance, similar to (or even better than) b-bit minwise hashing.

The problem is that we must choose the same β for all sets. Figure 8 presents a typical example, which uses H*/D = 10⁻⁴ to compute the “optimal” β for a wide range of (r₁, r₂, s) values. The left bottom panel illustrates that when r₁ = 10⁻⁴ using this particular choice of β results in fairly good performance compared to b-bit minwise hashing. (Recall H/D = r₁ + r₂ – 2s.) As soon as the true H substantially deviates from the guessed H*, the performance of using random projection degrades dramatically.

There is one more issue. At the optimal β*(H), our calculations show that the probability (22) E_β* ≈ 0.2746. However, if the chosen β > β*(H), then E_β may approach 1/2. As is random, it is likely that the observed > 1/2, that is, log (1 − 2 ) becomes undefined in (23). Thus, it is safer to “overestimate” H when choosing β. When we have a large collection of sets, this basically means the chosen β will be very different from its optimal value for most pairs.

Finally, Figure 9 provides an empirical study as a sanity check that the variance formula (24) is indeed accurate and that, if the guessed H for selecting β deviates from the true H, then the random projection estimator exhibits much larger errors than the b-bit hashing estimator .

5. Improvement by Combining Bits

Our theoretical and empirical results have confirmed that, when the resemblance R is reasonably high, even a single bit per sample may contain sufficient information for accurately estimating the similarity. This naturally leads to the conjecture that, when R is close to 1, one might further improve the performance by looking at a combination of multiple bits (i.e., “b < 1″). One simple approach is to combine two bits from two permutations using XOR ( ) operations.

Recall e_1,1,Π denotes the lowest bit of the hashed value under Π. Theorem 1 has proved that

Consider two permutations Π₁ and Π₂. We store

Then x₁ = x₂ either when and , or, when and . Thus,

which is a quadratic equation with a solution:

This estimator is slightly biased at small sample size k. We use to indicate that two bits are combined into one (but each sample is still stored using 1 bit). The asymptotic variance of can be derived to be

Interestingly, as R → 1, does twice as well as :

On the other hand, may not be good when R is not too large. For example, one can numerically show that

Figure 10 plots the empirical MSEs for two-word pairs in Experiment 1, for , , and . For the highly similar pair, “KONG-HONG,” exhibits superior performance compared to . For “UNITED-STATES,” whose R = 0.591, performs similarly to .

In summary, for applications which care about very high similarities, combining bits can reduce storage even further.

6. Computational Improvements

When computing set similarity for large sets of samples, the key operation is determining the number of identical b-bit samples. While samples for values of b that are multiples of 16 bits can easily be compared using a single machine instruction, efficiently computing the overlap between b-bit samples for small b is less straightforward. In the following, we will describe techniques for computing the number of identical b-bit samples when these are packed into arrays , l = 1,2 of w-bit words. To compute the number of identical b-bit samples, we iterate through the arrays; for each offset h, we first compute ν = A₁[h] A₂[h]. Now, the number of b-bit blocks in u that contain only 0s corresponds to the number of identical b-bit samples.

The case of b = 1 corresponds to the problem of counting the number of 0-bits in a word. We tested a number of different methods and found the fastest approach to be precomputing an array bits[1,…, 2¹⁶] such that bits[t] corresponds to the number of 0-bits in the binary representation of t and using lookups into this array. This approach extends to b > 1 as well.

To evaluate this approach we timed a tight loop computing the number of identical samples in two arrays of b-bit hashes covering a total of 1.8 billion 32-bit words (using a 64-bit Intel 6600 Processor). Here, the 1-bit hashing requires 1.67× the time that the 32-bit minwise hashing requires (1.73× when comparing to 64-bit minwise hashing). The results were essentially identical for b = 2, 4, 8. Given that, when R > 0.5, we can gain a storage reduction of 21.3-fold, we expect the resulting improvement in computational efficiency to be 21.3/1.67 = 12.8-fold in the above setup.

7. Extensions and Applications

7.1. Three-way resemblance

Many applications in data mining or data cleaning require not only estimates of two-way, but also of multi-way similarities. The original minwise hashing naturally extends to multi-way resemblance. In Li et al.,¹⁹ we extended b-bit minwise hashing to estimate three-way resemblance. We developed a highly accurate, but complicated estimator, as well as a much simplified estimator suitable for sparse data. Interestingly, at least b ≥ 2 bits are needed in order to estimate three-way resemblance. Similar to the two-way case, b-bit minwise hashing can result in an order-of-magnitude reduction in the storage space required for a given estimation accuracy when testing for moderate to high similarity.

7.2. Large-scale machine learning

A different category of applications for b-bit minwise hashing is machine learning on very large datasets. For example, one of our projects²⁰ focuses on linear support vector machines (SVM). We were able to show that the resemblance matrix, the minwise hashing matrix, and the b-bit minwise hashing matrix are all positive definite matrices (kernels), and we integrated b-bit minwise hashing with linear SVM. This allows us to significantly speed up training and testing times with almost no loss in classification accuracy for many practical scenarios. In addition, this provides an elegant solution to the problem of SVM training in scenarios where the training data cannot fit in memory.

Interestingly, the technique we used for linear SVM essentially provides a universal strategy for integrating b-bit minwise hashing with many other learning algorithms, for example, logistic regression.

7.3. Improving estimates by maximum likelihood estimators

While b-bit minwise hashing is particularly effective in applications which mainly concern sets of high similarities (e.g., R > 0.5), there are other important applications in which not just pairs of high similarities matter. For example, many learning algorithms require all pairwise similarities and it is expected that only a small fraction of the pairs are similar. Furthermore, many applications care more about containment (e.g., which fraction of one set is contained in another set) than the resemblance. In a recent technical report (http://www.stat.cornell.edu/~li/b-bit-hashing/AccurateHashing.pdf), we showed that the estimators for minwise hashing and b-bit minwise hashing used in the current practice can be systematically improved and the improvements are most significant for set pairs of low resemblance and high containment.

For minwise hashing, instead of only using Pr(z₁ = z₂), where z₁ and z₂ are two hashed values, we can combine it with Pr(z₁ < z₂) and Pr(z₁ > z₂) to form a three-cell multinomial estimation problem, whose maximum likelihood estimator (MLE) is the solution to a cubic equation. For b-bit minwise hashing, we formulate a 2^b × 2^b-cell multinomial problem, whose MLE requires a simple numerical procedure.

7.4. The new LSH family

Applications such as near neighbor search, similarity clustering, and data mining will significantly benefit from b-bit minwise hashing. It is clear that b-bit minwise hashing will significantly improve the efficiency of simple linear algorithms (for near neighbor search) or simple quadratic algorithms (for similarity clustering), when the key bottleneck is main-memory throughput.

Techniques based on LSH^{1, 5, 13} have been successfully used to achieve sub-linear (for near neighbor search) or sub-quadratic (for similarity clustering) performance. It is interesting that b-bit minwise hashing is a new family of LSH; hence, in this section, we would like to provide more theoretical properties in the context of LSH and approximate near neighbor search.

Consider a set S₁. Suppose there exists another set S₂ whose resemblance distance (1 R) from S₁ is at most d₀, that is, 1 – R ≤ d₀. The goal of c-approximate d₀–near neighbor algorithms is to return sets (with high probability) whose resemblance distances from S₁ are at most c×d₀ with c > 1.

Recall z₁ and z₂ denote the minwise hashed values for sets S₁ and S₂, respectively. The performance of the LSH algorithm depends on the difference (gap) between the following P⁽¹⁾ and P⁽²⁾ (respectively corresponding to d₀ and cd₀):

A larger gap between P⁽¹⁾ and P⁽²⁾ implies a more efficient LSH algorithm. The following “ρ” value (ρ_M for minwise hashing) characterizes the gap:

A smaller ρ (i.e., larger difference between P⁽¹⁾ and P⁽²⁾ leads to a more efficient LSH algorithm and is particularly desirable.^{1, 13} The general LSH theoretical result tells us that the query time for c-approximate d₀-near neighbor is dominated by O(N^ρ) distance evaluations, where N is the total number of sets in the collection.

Recall P_b, as defined in (5), denotes the collision probability for b-bit minwise hashing. The ρ_b value for c-approximate d₀-near neighbor search can be computed as follows:

Figure 11 suggests that b-bit minwise hashing can potentially achieve very similar ρ values compared to the original minwise hashing, when the applications care mostly about highly similar sets (e.g., d₀ = 0.1, the top panels of Figure 11), even using merely b = 1. If the applications concern sets that are not necessarily highly similar (e.g., d₀ = 0.5, the bottom panels), using b = 3 or 4 will still have similar ρ values as using the original minwise hashing.

We expect that these theoretical properties regarding the ρ values will potentially be useful in future work. We are currently developing new variants of LSH algorithms for near neighbor search based on b-bit minwise hashing.

Subsequent documents will be made available at www.stat.cornell.edu/~li/b-bit-hashing, which is a repository for maintaining the papers and technical reports related to b-bit minwise hashing.

8. Conclusion

Minwise hashing is a standard technique for efficiently estimating set similarity in massive datasets. In this paper, we gave an overview of b-bit minwise hashing, which modifies the original scheme by storing the lowest b bits of each hashed value. We proved that, when the similarity is reasonably high (e.g., resemblance ≥ 0.5), using b = 1 bit per hashed value can, even in the worst case, gain a 21.3-fold improvement in storage space (at similar estimation accuracy), compared to storing each hashed value using 64 bits. As many applications are primarily interested in identifying duplicates of reasonably similar sets, these improvements can result in substantial reduction in storage (and consequently computational) overhead in practice.

We also compared our scheme to other approaches that map the hashed objects to single bits, both in theory as well as experimentally.

Our proposed method is simple and requires only minimal modification to the original minwise hashing algorithm. It can be used in the context of a number of different applications, such as duplicate detection, clustering, similarity search, and machine learning, and we expect that it will be adopted in practice.

Acknowledgment

This work is supported by NSF (DMS-0808864), ONR (YIP-N000140910911), and a grant from Microsoft. We thank the Board Members for suggesting a direct comparison with Kushilevitz et al.¹⁵

Figures

Figure 1. The absolute errors (approximateexact) by using (5) are very small even for D = 20 (left panels) or D = 500 (right panels). The exact probability can be numerically computed for small D (from a probability matrix of size D × D). For each D, we selected three f₁ values. We always let f₂ = 2, 3,…, f₁ and a = 0, 1, 2,…, f₂.

Figure 2. , the relative storage improvement of using b = 1, 2, 3, 4 bits, compared to using 64 bits. B(b) is defined in (12).

Figure 3. Mean square errors (MSEs). “M” denotes the original minwise hashing. “Theor.” denotes the theoretical variances Var ( ) (11) and Var( )(3). The dashed curves, however, are invisible because the empirical MSEs overlap the theoretical variances. At the same k, . However, only requires 1/2/3 bits per sample, while may require 64 bits.

Figure 4. The task is to retrieve news article pairs with resemblance R ≥ R₀. The recall curves cannot differentiate estimators. The precision curves are more interesting. When R₀ = 0.4, to achieve a precision = 0.80, the estimators , and require k = 50, 50, 75, and 145, respectively, indicating , and , respectively, improve (assuming 64 bits per sample) by 16-, 21.4-, and 22-fold. The improvement becomes larger as R₀ increases.

Figure 5. G_s,b=1 as defined in (20) for illustrating the improvement of b-bit minwise hashing over simple random sampling. We consider r₁ = 10⁻⁴, 0.1, 0.5, 0.9, r₂ ranging from 0.1 r₁ to 0.9 r₁ and s from 0 to r₂. Note that r₁ + r₂ – s ≤ 1 has to be satisfied.

Figure 6. Left panel: the β* values at which the smallest variances (24) are attained. Right panel: the corresponding optimal variances.

Figure 7. G_rp,b=1,β* as defined in (25) for comparing with . For each combination of r₁, r₂, s, we computed and used the optimal β* for the variance.

Figure 8. G_rp,b=1,β (25) computed by using the fixed β, which is the optimal β when H = H* = 10⁻⁴ D.

Figure 9. The exact H for this pair of sets is 0.0316. We use the optimal β at H*/D = 0.1 (left panel) and H*/D = 0.5 (right panel). Compared to , the MSEs of (labeled by “rp“) are substantially larger. The theoretical variances (dashed lines) (24) and (17), essentially overlap the empirical MSEs.

Figure 10. MSEs for comparing (27) with and . Due to the bias of , the theoretical variances Var ( ), that is, (28), deviate from the empirical MSEs when k is small.

Figure 11. ρ_M and ρ_b defined in (30) and (31) for measuring the potential performance of LSH algorithms. “M” denotes the original minwise hashing.

Tables

Table 1. Six word pairs for Experiment 1.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Theory and Applications of b-Bit Minwise Hashing

View in the ACM Digital Library

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

DOI

10.1145/1978542.1978566

August 2011 Issue

Published: August 1, 2011

Vol. 54 No. 8

Pages: 101-109

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Oct 3 2024

Leveraging Graph Databases for Fraud Detection in Financial Systems

Alex Williams

Architecture and Hardware

bank vault and analytics graphs, illustration

News Oct 2 2024

How Laser Communications Are Improving Satellites

Logan Kugler

Data and Information

satellite spacecraft above the Earth, illustration

BLOG@CACM Sep 30 2024

Leveraging SaaS and Cloud Solutions for Enhanced Business Agility

Alex Tray

Data and Information

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More