Research and Advances
Computing Applications Research highlights

Theory and Applications of b-Bit Minwise Hashing

Posted
  1. Abstract
  2. 1. Introduction
  3. 2. The Fundamental Results
  4. 3. Experiments
  5. 4. Comparisons with Hamming Distance Algorithms
  6. 5. Improvement by Combining Bits
  7. 6. Computational Improvements
  8. 7. Extensions and Applications
  9. 8. Conclusion
  10. References
  11. Authors
  12. Footnotes
  13. Figures
  14. Tables
Read the related Technical Perspective
large set of similar chicken peeps

Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing. This paper describes b-bit minwise hashing, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice.

We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison with other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme.

Back to Top

1. Introduction

With the advent of the Internet, many applications are faced with very large and inherently high-dimensional datasets. A common task on these is similarity search, that is, given a high-dimensional data point, the retrieval of data points that are close under a given distance function. In many scenarios, the storage and computational requirements for computing exact distances between all data points are prohibitive, making data representations that allow compact storage and efficient approximate distance computation necessary.

In this paper, we describe b-bit minwise hashing, which leverages properties common to many application scenarios to obtain order-of-magnitude improvements in the storage space and computational overhead required for a given level of accuracy over existing techniques. Moreover, while the theoretical analysis of these gains is technically challenging, the resulting algorithm is simple and easy to implement.

To describe our approach, we first consider the concrete task of Web page duplicate detection, which is of critical importance in the context of Web search and was one of the motivations for the development of the original minwise hashing algorithm by Broder et al.2, 4 Here, the task is to identify pairs of pages that are textually very similar. For this purpose, Web pages are modeled as “a set of shingles,” where a shingle corresponds to a string of w contiguous words occurring on the page. Now, given two such sets S1, cacm5408_a.gif , the normalized similarity known as resemblance or Jaccard similarity, denoted by R, is

ueq01.gif

Duplicate detection now becomes the task of detecting pairs of pages for which R exceeds a threshold value. Here, w is a tuning parameter and was set to be w = 5 in several studies.2, 4, 7 Clearly, the total number of possible shingles is huge. Considering 105 unique English words, the total number of possible 5-shingles should be D = (105)5 = O(1025). A prior study7 used D = 264 and even earlier studies2, 4 used D = 240. Due to the size of D and the number of pages crawled as part of Web search, computing the exact similarities for all pairs of pages may require prohibitive storage and computational overhead, leading to approximate techniques based on more compact data structures.

*  1.1. Minwise hashing

To address this issue, Broder and his colleagues developed minwise hashing in their seminal work.2, 4 Here, we give a brief introduction to this algorithm. Suppose a random permutation Π is performed on Ω, that is,

ueq02.gif

An elementary probability argument shows that

eq01.gif

After k minwise independent permutations, Π1, Π2,…, Πk, one can estimate R without bias, as a binomial probability:

eq02.gif

eq03.gif

We will frequently use the terms “sample” and “sample size” (i.e., k). For minwise hashing, a sample is a hashed value, min(Πj(Si)), which may require, for example, 64 bits.7

Since the original minwise hashing work,2, 4 there have been considerable theoretical and methodological developments.3, 5, 12, 14, 16, 17, 22

Applications: As a general technique for estimating set similarity, minwise hashing has been applied to a wide range of applications, for example, content matching for online advertising,23 detection of redundancy in enterprise file systems,8 syntactic similarity algorithms for enterprise information management,21 Web spam,25 etc.

Many of the applications of minwise hashing are targeted at detecting duplicates or pairs of somewhat high similarity. By proposing an estimator that is particularly accurate for these scenarios, we can reduce the required storage and computational overhead dramatically. Here, the computational savings are a function of how the min-wise hashes are used. For any technique that does compute the pairwise similarity for (a large subset of) all pairs, the computation is typically bound by the speed at which the samples can be brought into memory (as the computation itself is simple); hence, the space reduction our technique offers directly translates into order-of-magnitude speedup as well.

However, even with the data-size reduction, computing all pairwise similarities is prohibitively expensive in many scenarios. This has lead to a number of approaches that avoid this computation by grouping (subsets of) the samples into buckets and only computing the pairwise similarities for items within the same (set of) buckets. This approach avoids the quadratic number of comparisons, at the cost of some loss in accuracy. Examples of such approaches are the supershingles4 or techniques based on locally sensitive hashing (LSH)1, 5, 13 (also see Chapter 3 of Rajaraman and Ullman24 for an excellent detailed explanation of LSH and see Cohen et al.6 for nice applications of LSH ideas in mining associations).

*  1.2. b-Bit minwise hashing

In this paper, we establish a unified theoretical framework for b-bit minwise hashing. In our scheme, a sample consists of b bits only, as opposed to, for example, b = 64 bits7 in the original minwise hashing. Intuitively, using fewer bits per sample will increase the estimation variance, compared to (3), at the same sample size k. Thus, we will have to increase k to maintain the same accuracy. Interestingly, our theoretical results will demonstrate that, when resemblance is not too small (which is the case in many applications, e.g., consider R≥0.5, the threshold used in Broder et al.2, 4), we do not have to increase k much. This means our proposed b-bit minwise hashing can be used to improve estimation accuracy and significantly reduce storage requirements at the same time.

For example, when b = 1 and R = 0.5, the estimation variance will increase at most by a factor of 3. In order not to lose accuracy, we have to increase the sample size by a factor of 3. If we originally stored each hashed value using 64 bits, the improvement by using b = 1 will be 64/3 = 21.3.

Algorithm 1 illustrates the procedure of b-bit minwise hashing, based on the theoretical results in Section 2.

*  1.3. Related work

Locality sensitive hashing (LSH)1, 5, 13 is a set of techniques for performing approximate search in high dimensions. In the context of estimating set intersections, there exist LSH families for estimating the resemblance, the arccosine, and the hamming distance. Our b-bit minwise hashing proposes a new construction of an LSH family (Section 7.4).

Algorithm 1 The b-bit minwise hashing algorithm, applied to estimating pairwise resemblances in a collection of N sets.

Input: Sets cacm5408_b.gif , n = 1 to N.

Preprocessing

(1): Generate k random permutations Πj: Ω → Ω, j = 1 to k.

(2): For each set Sn and each permutation Πj, store the lowest b bits of min (Πj(Sn)), denoted by en,i,πj, i = 1 to b.

Estimation: (Use two sets S1 and S2 as an example)

(1): Compute cacm5408_c.gif .

(2): Estimate the resemblance by cacm5408_d.gif , where C1,b and C2,b are from Theorem 1 in Section 2.

In Charikar5 and Gionis et al.,10 the authors describe hashing schemes that map objects to {0, 1}. The algorithms for the construction, however, are problem specific. Three discovered 1-bit schemes are (i) the simhash5 based on sign random projection,11, 18 (ii) the hamming distance algorithm based on simple random sampling,13 and (iii) the hamming distance algorithm based on a variant of random projection.15

Section 4 will compare our method with two hamming distance algorithms.13, 15 We also wrote a report (http://www.stat.cornell.edu/~li/b-bit-hashing/RP_minwise.pdf), which demonstrated that, unless the similarity is very low, b-bit minwise hashing outperforms sign random projections.

A related approach is conditional random sampling (CRS)16, 17 which uses only a single permutation and instead of a single minimum retains as set of the smallest hashed values. CRS provides more accurate (in some scenarios substantially so) estimators for binary data and naturally extends to real-value data and dynamic streaming data; moreover, the same set of hashed values can be used to estimate a variety of summary statistics including histograms, lp distances (for any p), number of distinct values, χ2 distances, entropies, etc. However, we have not developed a b-bit scheme for CRS, which appears to be a challenging task.

Back to Top

2. The Fundamental Results

Consider two sets S1, cacm5408_e.gif . Apply a random permutation Π on S1 and S2, where Π: Ω → Ω. Define the minimum values under Π to be z1 and z2:

eq04.gif

Define e1,i = ith lowest bit of z1, and e2,i = ith lowest bit of z2. Theorem 1 derives the main probability formula. Its proof assumes that D is large, which is virtually always satisfied in practice. This result is a good example of approaching a difficult problem by reasonable approximations.

THEOREM 1. Assume D is large.

eq05.gif

eq06.gif

eq07.gif

eq08.gif

The intuition for the difference between (5) and the equivalent equation for minwise hashing (1) is that even when R = 0, the collision probability Pb (i.e., the probability that two minima agree on their last b bits) is not zero, but rather C1,b. Having to account for this type of “false positives” makes the derivation more difficult, resulting in the additional terms in (5). Of course, as expected, if R = 1, then Pb = 1 (because in this case r1 = r2 and C1,b = C2,b).

Note that the only assumption needed in the proof of Theorem 1 is that D is large, which is virtually always satisfied in practice. Interestingly, (5) is remarkably accurate even for very small D. Figure 1 shows that when D = 20 (D = 500), the absolute error caused by using (5) is <0.01 (<0.0004).

*  2.1. The unbiased estimator

Theorem 1 suggests an unbiased estimator cacm5408_f.gif for R:

eq09.gif

eq10.gif

where e1,ij (e2,ij) denotes the ith lowest bit of z1 (z2), under the permutation Πj. The variance is

eq11.gif

For large b, Var ( cacm5408_f.gif ) converges to the variance of cacm5408_g.gif , the estimator for the original minwise hashing:

ueq03.gif

In fact, when b = 64, Var ( cacm5408_f.gif ) and Var ( cacm5408_g.gif ) are numerically indistinguishable for practical purposes.

*  2.2. The variance-space trade-off

As we decrease b, the space needed for storing each “sample” will be smaller; the estimation variance (11) at the same sample size k, however, will increase. This variance-space trade-off can be precisely quantified by B(b; R, r1, r2):

eq12.gif

Lower B(b) is better. The ratio, cacm5408_h.gif , measures the improvement of using b = b2 (e.g., b2 = 1) over using b = b1 (e.g., b1 = 64). Some algebra yields the following Lemma.

LEMMA 1. If r1 = r2 and b1 > b2, then

eq13.gif

is a monotonically increasing function of R isin.gif [0, 1].

If R → 1 (which implies r1, r2 → 1), then

eq14.gif

If r1 = r2, b2 = 1, b1 = 64 (hence we treat A1,b = 0), then

eq15.gif

Suppose the original minwise hashing used b = 64, then the maximum improvement of b-bit minwise hashing would be 64-fold, attained when r1 = r2 = 1 and R = 1. In the least favorable situation, that is, r1, r2 → 0, the improvement will still be cacm5408_i.gif when R = 0.5.

Figure 2 plots cacm5408_j.gif to directly visualize the relative improvements, which are consistent with what Lemma 1 predicts. The plots show that, when R is very large (which is the case in many practical applications), it is always good to use b = 1. However, when R is small, using larger b may be better. The cut-off point depends on r1, r2, R. For example, when r1 = r2 and both are small, it would be better to use b = 2 than b = 1 if R < 0.4, as shown in Figure 2.

Back to Top

3. Experiments

In the following, we evaluate the accuracy of the theoretical derivation and the practical performance of our approach using two sets of experiments. Experiment 1 is a sanity check, to verify: (i) our proposed estimator cacm5408_f.gif in (9) is unbiased and (ii) its variance follows the prediction by our formula in (11). Experiment 2 is a duplicate detection task using a Microsoft proprietary collection of 1,000,000 news articles.

*  3.1. Experiment 1

The data, extracted from Microsoft Web crawls, consists of six pairs of sets. Each set consists of the document IDs, which contain the word at least once. We now use b-bit min-wise hashing to estimate the similarities of these sets (i.e., we estimate the strength of the word associations).

Table 1 summarizes the data and provides the theoretical improvements cacm5408_k.gif . The words were selected to include highly frequent pairs (e.g., “OF-AND”), highly rare pairs (e.g., “GAMBIA-KIRIBATI”), highly unbalanced pairs (e.g., “A-TEST”), highly similar pairs (e.g., “KONG-HONG”), as well as pairs that are not quite similar (e.g., “LOW-PAY”).

We estimate the resemblance using the original minwise hashing estimator cacm5408_g.gif and the b-bit estimator cacm5408_f.gif (b = 1, 2, 3).

Figure 3 plots the empirical mean square errors (MSE = variance + bias2) in solid lines and the theoretical variances (11) in dashed lines for all word pairs. All dashed lines are invisible because they overlap with the corresponding solid curves. Thus, this experiment validates that the variance formula (11) is accurate and cacm5408_f.gif is indeed unbiased (otherwise, the MSE will differ from the variance).

*  3.2. Experiment 2

To illustrate the improvements by the use of b-bit minwise hashing on a real-life application, we conducted a duplicate detection experiment using a corpus of 106 news documents. The dataset was crawled as part of the BLEWS project at Microsoft.9 We computed pairwise resemblances for all documents and retrieved document pairs with resemblance R larger than a threshold R0. We estimate the resemblances using cacm5408_f.gif with b = 1, 2, 4 bits and the original minwise hashing. Figure 4 presents the precision and recall curves. The recall values (bottom two panels in Figure 4) are all very high and do not differentiate the estimators.

The precision curves for cacm5408_q.gif (using 4 bits per sample) and cacm5408_g.gif (assuming 64 bits per sample) are almost indistinguishable, suggesting a 16-fold improvement in space using b = 4.

When using b = 1 or 2, the space improvements are normally around 20- to 40-fold, compared to cacm5408_g.gif (assuming 64 bits per sample), especially for achieving high precision.

Back to Top

4. Comparisons with Hamming Distance Algorithms

Closely related to the resemblance, the hamming distance H is another important similarity measure. In the context of hamming distance, a set cacm5408_r.gif is mapped to a D-dimensional binary vector Yi: Yit = 1, if t isin.gif Si and 0 otherwise. The hamming distance between Y1 and Y2 is

ueq04.gif

Thus, one can apply b-bit minwise hashing to estimate H, by converting the estimated resemblance cacm5408_f.gif (9) to cacm5408_s.gif :

eq16.gif

The variance of cacm5408_s.gif can be computed from Var ( cacm5408_f.gif ) (11) by the “delta method” (i.e., cacm5408_t.gif :

eq17.gif

We will first compare cacm5408_s.gif with an algorithm based on simple random sampling13 and then with another algorithm based on a variant of random projection.15

*  4.1. Simple random sampling algorithm

To reduce the storage, we can randomly sample k coordinates from the original data Y1 and Y2 in D-dimensions. The samples, denoted by h1 and h2, are k-dimensional bit vectors, from which we can estimate H:

eq18.gif

whose variance would be (assuming kD)

eq19.gif

Comparing the two variances, (17) and (19), we find that the variance of using simple random sampling, that is, Var cacm5408_u.gif , is substantially larger than the variance of using b-bit minwise hashing, that is, Var ( cacm5408_s.gif ), especially when the data is sparse. We consider in practice one will most likely implement the random sampling algorithm by storing only the original locations (coordinates) of the nonzeros in the samples. If we do so, the total bits on average will be cacm5408_v.gif (per set). This motivates us to define the following ratio:

eq20.gif

to compare the storage costs. Recall each sample of b-bit minwise hashing requires b bits (i.e., bk bits per set). The following Lemma may help characterize the improvement:

LEMMA 2. If r1, r2 → 0, then Gs,b as defined in (20)

eq21.gif

In other words, for small r1, r2, cacm5408_w.gif ; if R ≈ 0; and cacm5408_x.gif , if R ≈ 1. Figure 5 plots Gs,b = 1, verifying the substantial improvement of b-bit minwise hashing over simple random sampling (often 10- to 30-fold).

*  4.2. Random projection + modular arithmetic

An interesting 1-bit scheme was developed in Kushilevitz et al.15 using random projection followed by modular arithmetic. A random matrix cacm5408_y.gif is generated with entries being i.i.d. samples uij from a binomial distribution: uij = 1 with probability cacm5408_z.gif and uij = 0 with probability cacm5408_aa.gif . Let ν1 = Y1 × U (mod 2) and ν2 = Y2 × U (mod 2). Kushilevitz et al.15 showed that

eq22.gif

which allows us to estimate the hamming distance H by

eq23.gif

We calculate the variance of cacm5408_ab.gif to be

eq24.gif

which suggests that the performance of this 1-bit scheme might be sensitive to β that must be predetermined for all sets at the processing time (i.e., it cannot be modified in the estimation phrase for a particular pair). Figure 6 provides the “optimal” β (denoted by β*) values (as function of H) by numerically minimizing the variance (24).

It is interesting to compare this random projection-based 1-bit scheme with our b-bit minwise hashing using the following ratio of their variances:

eq25.gif

Figure 7 shows that if it is possible to choose the optimal β* for random projection, one can achieve good performance, similar to (or even better than) b-bit minwise hashing.

The problem is that we must choose the same β for all sets. Figure 8 presents a typical example, which uses H*/D = 10−4 to compute the “optimal” β for a wide range of (r1, r2, s) values. The left bottom panel illustrates that when r1 = 10−4 using this particular choice of β results in fairly good performance compared to b-bit minwise hashing. (Recall H/D = r1 + r2 – 2s.) As soon as the true H substantially deviates from the guessed H*, the performance of cacm5408_ab.gif using random projection degrades dramatically.

There is one more issue. At the optimal β*(H), our calculations show that the probability (22) Eβ* ≈ 0.2746. However, if the chosen β > β*(H), then Eβ may approach 1/2. As cacm5408_ac.gif is random, it is likely that the observed cacm5408_ac.gif > 1/2, that is, log (1 − 2 cacm5408_ac.gif ) becomes undefined in (23). Thus, it is safer to “overestimate” H when choosing β. When we have a large collection of sets, this basically means the chosen β will be very different from its optimal value for most pairs.

Finally, Figure 9 provides an empirical study as a sanity check that the variance formula (24) is indeed accurate and that, if the guessed H for selecting β deviates from the true H, then the random projection estimator cacm5408_ab.gif exhibits much larger errors than the b-bit hashing estimator cacm5408_s.gif .

Back to Top

5. Improvement by Combining Bits

Our theoretical and empirical results have confirmed that, when the resemblance R is reasonably high, even a single bit per sample may contain sufficient information for accurately estimating the similarity. This naturally leads to the conjecture that, when R is close to 1, one might further improve the performance by looking at a combination of multiple bits (i.e., “b < 1″). One simple approach is to combine two bits from two permutations using XOR ( oplus.gif ) operations.

Recall e1,1,Π denotes the lowest bit of the hashed value under Π. Theorem 1 has proved that

ueq05.gif

Consider two permutations Π1 and Π2. We store

ueq06.gif

Then x1 = x2 either when cacm5408_af.gif and cacm5408_ag.gif , or, when cacm5408_ah.gif and cacm5408_ai.gif . Thus,

eq26.gif

which is a quadratic equation with a solution:

eq27.gif

This estimator is slightly biased at small sample size k. We use cacm5408_ae.gif to indicate that two bits are combined into one (but each sample is still stored using 1 bit). The asymptotic variance of cacm5408_ae.gif can be derived to be

eq28.gif

Interestingly, as R → 1, cacm5408_ae.gif does twice as well as cacm5408_o.gif :

eq29.gif

On the other hand, cacm5408_ae.gif may not be good when R is not too large. For example, one can numerically show that

ueq07.gif

Figure 10 plots the empirical MSEs for two-word pairs in Experiment 1, for cacm5408_ae.gif , cacm5408_o.gif , and cacm5408_g.gif . For the highly similar pair, “KONG-HONG,” cacm5408_ae.gif exhibits superior performance compared to cacm5408_o.gif . For “UNITED-STATES,” whose R = 0.591, cacm5408_ae.gif performs similarly to cacm5408_o.gif .

In summary, for applications which care about very high similarities, combining bits can reduce storage even further.

Back to Top

6. Computational Improvements

When computing set similarity for large sets of samples, the key operation is determining the number of identical b-bit samples. While samples for values of b that are multiples of 16 bits can easily be compared using a single machine instruction, efficiently computing the overlap between b-bit samples for small b is less straightforward. In the following, we will describe techniques for computing the number of identical b-bit samples when these are packed into arrays cacm5408_aj.gif , l = 1,2 of w-bit words. To compute the number of identical b-bit samples, we iterate through the arrays; for each offset h, we first compute ν = A1[h] oplus.gif A2[h]. Now, the number of b-bit blocks in u that contain only 0s corresponds to the number of identical b-bit samples.

The case of b = 1 corresponds to the problem of counting the number of 0-bits in a word. We tested a number of different methods and found the fastest approach to be precomputing an array bits[1,…, 216] such that bits[t] corresponds to the number of 0-bits in the binary representation of t and using lookups into this array. This approach extends to b > 1 as well.

To evaluate this approach we timed a tight loop computing the number of identical samples in two arrays of b-bit hashes covering a total of 1.8 billion 32-bit words (using a 64-bit Intel 6600 Processor). Here, the 1-bit hashing requires 1.67× the time that the 32-bit minwise hashing requires (1.73× when comparing to 64-bit minwise hashing). The results were essentially identical for b = 2, 4, 8. Given that, when R > 0.5, we can gain a storage reduction of 21.3-fold, we expect the resulting improvement in computational efficiency to be 21.3/1.67 = 12.8-fold in the above setup.

Back to Top

7. Extensions and Applications

*  7.1. Three-way resemblance

Many applications in data mining or data cleaning require not only estimates of two-way, but also of multi-way similarities. The original minwise hashing naturally extends to multi-way resemblance. In Li et al.,19 we extended b-bit minwise hashing to estimate three-way resemblance. We developed a highly accurate, but complicated estimator, as well as a much simplified estimator suitable for sparse data. Interestingly, at least b ≥ 2 bits are needed in order to estimate three-way resemblance. Similar to the two-way case, b-bit minwise hashing can result in an order-of-magnitude reduction in the storage space required for a given estimation accuracy when testing for moderate to high similarity.

*  7.2. Large-scale machine learning

A different category of applications for b-bit minwise hashing is machine learning on very large datasets. For example, one of our projects20 focuses on linear support vector machines (SVM). We were able to show that the resemblance matrix, the minwise hashing matrix, and the b-bit minwise hashing matrix are all positive definite matrices (kernels), and we integrated b-bit minwise hashing with linear SVM. This allows us to significantly speed up training and testing times with almost no loss in classification accuracy for many practical scenarios. In addition, this provides an elegant solution to the problem of SVM training in scenarios where the training data cannot fit in memory.

Interestingly, the technique we used for linear SVM essentially provides a universal strategy for integrating b-bit minwise hashing with many other learning algorithms, for example, logistic regression.

*  7.3. Improving estimates by maximum likelihood estimators

While b-bit minwise hashing is particularly effective in applications which mainly concern sets of high similarities (e.g., R > 0.5), there are other important applications in which not just pairs of high similarities matter. For example, many learning algorithms require all pairwise similarities and it is expected that only a small fraction of the pairs are similar. Furthermore, many applications care more about containment (e.g., which fraction of one set is contained in another set) than the resemblance. In a recent technical report (http://www.stat.cornell.edu/~li/b-bit-hashing/AccurateHashing.pdf), we showed that the estimators for minwise hashing and b-bit minwise hashing used in the current practice can be systematically improved and the improvements are most significant for set pairs of low resemblance and high containment.

For minwise hashing, instead of only using Pr(z1 = z2), where z1 and z2 are two hashed values, we can combine it with Pr(z1 < z2) and Pr(z1 > z2) to form a three-cell multinomial estimation problem, whose maximum likelihood estimator (MLE) is the solution to a cubic equation. For b-bit minwise hashing, we formulate a 2b × 2b-cell multinomial problem, whose MLE requires a simple numerical procedure.

*  7.4. The new LSH family

Applications such as near neighbor search, similarity clustering, and data mining will significantly benefit from b-bit minwise hashing. It is clear that b-bit minwise hashing will significantly improve the efficiency of simple linear algorithms (for near neighbor search) or simple quadratic algorithms (for similarity clustering), when the key bottleneck is main-memory throughput.

Techniques based on LSH1, 5, 13 have been successfully used to achieve sub-linear (for near neighbor search) or sub-quadratic (for similarity clustering) performance. It is interesting that b-bit minwise hashing is a new family of LSH; hence, in this section, we would like to provide more theoretical properties in the context of LSH and approximate near neighbor search.

Consider a set S1. Suppose there exists another set S2 whose resemblance distance (1 – R) from S1 is at most d0, that is, 1 – Rd0. The goal of c-approximate d0near neighbor algorithms is to return sets (with high probability) whose resemblance distances from S1 are at most c×d0 with c > 1.

Recall z1 and z2 denote the minwise hashed values for sets S1 and S2, respectively. The performance of the LSH algorithm depends on the difference (gap) between the following P(1) and P(2) (respectively corresponding to d0 and cd0):

ueq08.gif

A larger gap between P(1) and P(2) implies a more efficient LSH algorithm. The following “ρ” value (ρM for minwise hashing) characterizes the gap:

eq30.gif

A smaller ρ (i.e., larger difference between P(1) and P(2) leads to a more efficient LSH algorithm and cacm5408_ak.gif is particularly desirable.1, 13 The general LSH theoretical result tells us that the query time for c-approximate d0-near neighbor is dominated by O(Nρ) distance evaluations, where N is the total number of sets in the collection.

Recall Pb, as defined in (5), denotes the collision probability for b-bit minwise hashing. The ρb value for c-approximate d0-near neighbor search can be computed as follows:

eq31.gif

Figure 11 suggests that b-bit minwise hashing can potentially achieve very similar ρ values compared to the original minwise hashing, when the applications care mostly about highly similar sets (e.g., d0 = 0.1, the top panels of Figure 11), even using merely b = 1. If the applications concern sets that are not necessarily highly similar (e.g., d0 = 0.5, the bottom panels), using b = 3 or 4 will still have similar ρ values as using the original minwise hashing.

We expect that these theoretical properties regarding the ρ values will potentially be useful in future work. We are currently developing new variants of LSH algorithms for near neighbor search based on b-bit minwise hashing.

Subsequent documents will be made available at www.stat.cornell.edu/~li/b-bit-hashing, which is a repository for maintaining the papers and technical reports related to b-bit minwise hashing.

Back to Top

8. Conclusion

Minwise hashing is a standard technique for efficiently estimating set similarity in massive datasets. In this paper, we gave an overview of b-bit minwise hashing, which modifies the original scheme by storing the lowest b bits of each hashed value. We proved that, when the similarity is reasonably high (e.g., resemblance ≥ 0.5), using b = 1 bit per hashed value can, even in the worst case, gain a 21.3-fold improvement in storage space (at similar estimation accuracy), compared to storing each hashed value using 64 bits. As many applications are primarily interested in identifying duplicates of reasonably similar sets, these improvements can result in substantial reduction in storage (and consequently computational) overhead in practice.

We also compared our scheme to other approaches that map the hashed objects to single bits, both in theory as well as experimentally.

Our proposed method is simple and requires only minimal modification to the original minwise hashing algorithm. It can be used in the context of a number of different applications, such as duplicate detection, clustering, similarity search, and machine learning, and we expect that it will be adopted in practice.

*  Acknowledgment

This work is supported by NSF (DMS-0808864), ONR (YIP-N000140910911), and a grant from Microsoft. We thank the Board Members for suggesting a direct comparison with Kushilevitz et al.15

Back to Top

Back to Top

Back to Top

Back to Top

Figures

F1 Figure 1. The absolute errors (approximate–exact) by using (5) are very small even for D = 20 (left panels) or D = 500 (right panels). The exact probability can be numerically computed for small D (from a probability matrix of size D × D). For each D, we selected three f1 values. We always let f2 = 2, 3,…, f1 and a = 0, 1, 2,…, f2.

F2 Figure 2. cacm5408_j.gif , the relative storage improvement of using b = 1, 2, 3, 4 bits, compared to using 64 bits. B(b) is defined in (12).

F3 Figure 3. Mean square errors (MSEs). “M” denotes the original minwise hashing. “Theor.” denotes the theoretical variances Var ( cacm5408_f.gif ) (11) and Var( cacm5408_g.gif )(3). The dashed curves, however, are invisible because the empirical MSEs overlap the theoretical variances. At the same k, cacm5408_l.gif . However, cacm5408_m.gif only requires 1/2/3 bits per sample, while cacm5408_g.gif may require 64 bits.

F4 Figure 4. The task is to retrieve news article pairs with resemblance RR0. The recall curves cannot differentiate estimators. The precision curves are more interesting. When R0 = 0.4, to achieve a precision = 0.80, the estimators cacm5408_n.gif , and cacm5408_o.gif require k = 50, 50, 75, and 145, respectively, indicating cacm5408_p.gif , and cacm5408_o.gif , respectively, improve cacm5408_g.gif (assuming 64 bits per sample) by 16-, 21.4-, and 22-fold. The improvement becomes larger as R0 increases.

F5 Figure 5. Gs,b=1 as defined in (20) for illustrating the improvement of b-bit minwise hashing over simple random sampling. We consider r1 = 10−4, 0.1, 0.5, 0.9, r2 ranging from 0.1 r1 to 0.9 r1 and s from 0 to r2. Note that r1 + r2s ≤ 1 has to be satisfied.

F6 Figure 6. Left panel: the β* values at which the smallest variances (24) are attained. Right panel: the corresponding optimal variances.

F7 Figure 7. Grp,b=1,β* as defined in (25) for comparing cacm5408_s.gif with cacm5408_ab.gif . For each combination of r1, r2, s, we computed and used the optimal β* for the variance.

F8 Figure 8. Grp,b=1,β (25) computed by using the fixed β, which is the optimal β when H = H* = 10−4 D.

F9 Figure 9. The exact H for this pair of sets is 0.0316. We use the optimal β at H*/D = 0.1 (left panel) and H*/D = 0.5 (right panel). Compared to cacm5408_ad.gif , the MSEs of cacm5408_ab.gif (labeled by “rp“) are substantially larger. The theoretical variances (dashed lines) (24) and (17), essentially overlap the empirical MSEs.

F10 Figure 10. MSEs for comparing cacm5408_ae.gif (27) with cacm5408_o.gif and cacm5408_g.gif . Due to the bias of cacm5408_ae.gif , the theoretical variances Var ( cacm5408_ae.gif ), that is, (28), deviate from the empirical MSEs when k is small.

F11 Figure 11. ρM and ρb defined in (30) and (31) for measuring the potential performance of LSH algorithms. “M” denotes the original minwise hashing.

Back to Top

Tables

T1 Table 1. Six word pairs for Experiment 1.

Back to top

    1. Andoni, A., Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51 (2008), 117–122.

    2. Broder, A.Z. On the resemblance and containment of documents. In The Compression and Complexity of Sequences (Positano, Italy, 1997), 21–29.

    3. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M. Min-wise independent permutations. J. Comput. Syst. Sci. 60, 3 (2000), 630–659.

    4. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G. Syntactic clustering of the web. In WWW (Santa Clara, CA, 1997), 1157–1166.

    5. Charikar, M.S. Similarity estimation techniques from rounding algorithms. In STOC (Montreal, Quebec, Canada, 2002), 380–388.

    6. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13, 1 (2001), 64–78.

    7. Fetterly, D., Manasse, M., Najork, M., Wiener, J.L. A large-scale study of the evolution of web pages. In WWW (Budapest, Hungary, 2003), 669–678.

    8. Forman, G., Eshghi, K., Suermondt, J. Efficient detection of large-scale redundancy in enterprise file systems. SIGOPS Oper. Syst. Rev. 43, 1 (2009), 84–91.

    9. Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C. Blews: Using blogs to provide context for news articles. In AAAI Conference on Weblogs and Social Media (Redmond, WA, 2008).

    10. Gionis, A., Gunopulos, D., Koudas, N. Efficient and tunable similar set retrieval. In SIGMOD (Santa Barbara, CA, 2001), 247–258.

    11. Goemans, M.X., Williamson, D.P. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM 42, 6 (1995), 1115–1145.

    12. Indyk, P. A small approximately min-wise independent family of hash functions. J. Algorithms 38, 1 (2001), 84–90.

    13. Indyk, P., Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC (Dallas, TX, 1998), 604–613.

    14. Itoh, T., Takei, Y., Tarui, J. On the sample size of k-restricted min-wise independent permutations and other k-wise distributions. In STOC (San Diego, CA, 2003), 710–718.

    15. Kushilevitz, E., Ostrovsky, R., Rabani, Y. Efficient search for approximate nearest neighbor in high dimensional spaces. In STOC (Dallas, TX, 1998), 614–623.

    16. Li, P., Church, K.W. A sketch algorithm for estimating two-way and multi-way associations. Comput. Linguist. 33, 3 (2007), 305–354 (Preliminary results appeared in HLT/EMNLP 2005).

    17. Li, P., Church, K.W., Hastie, T.J. One sketch for all: Theory and applications of conditional random sampling. In NIPS (Vancouver, British Columbia, Canada, 2008) (Preliminary results appeared in NIPS 2006).

    18. Li, P., Hastie, T.J., Church, K.W. Improving random projections using marginal information. In COLT (Pittsburgh, PA, 2006), 635–649.

    19. Li, P., König, A.C., Gui, W. b-Bit minwise hashing for estimating three-way similarities. In NIPS (Vancouver, British Columbia, Canada, 2010).

    20. Li, P., Moore, J., König, A.C. b-Bit minwise hashing for large-scale linear SVM. Technical report, 2011. http://www.stat.cornell.edu/~li/b-bit-hashing/HashingSVM.pdf

    21. Cherkasova, L., Eshghi, K., Morrey III, C.B., Tucek, J., Veitch, A. Applying Syntactic similarity algorithms for enterprise information management. In KDD (Paris, France, 2009), 1087–1096.

    22. Manasse, M., McSherry, F., Talwar, K. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.

    23. Pandey, S., Broder, A., Chierichetti, F., Josifovski, V., Kumar, R., Vassilvitskii, S. Nearest-neighbor caching for content-match applications. In WWW (Madrid, Spain, 2009), 441–450.

    24. Rajaraman, A., Ullman, J. Mining of Massive Datasets. http://i.stanford.edu/ullman/mmds.html

    25. Urvoy, T., Chauveau, E., Filoche, P., Lavergne, T. Tracking web spam with html style similarities. ACM Trans. Web 2, 1 (2008), 1–28.

    The previous version of this paper, entitled "b-Bit Minwise Hashing for Estimating Three-way Similarities," was published in Proceedings of the Neural Information Processing Systems: NIPS 2010 (Vancouver, British Columbia, Canada).

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More