Home/Magazine Archive/August 2011 (Vol. 54, No. 8)/Theory and Applications of *b*-Bit Minwise Hashing/Full Text

Research highlights
## Theory and Applications of *b*-Bit Minwise Hashing

Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is *minwise hashing*. This paper describes *b-bit minwise hashing*, which can provide an order of magnitude improvements in storage requirements and computational overhead over the original scheme in practice.

We give both theoretical characterizations of the performance of the new algorithm as well as a practical evaluation on large real-life datasets and show that these match very closely. Moreover, we provide a detailed comparison with other important alternative techniques proposed for estimating set similarities. Our technique yields a very simple algorithm and can be realized with only minor modifications to the original minwise hashing scheme.

With the advent of the Internet, many applications are faced with very large and inherently high-dimensional datasets. A common task on these is *similarity search*, that is, given a high-dimensional data point, the retrieval of data points that are close under a given distance function. In many scenarios, the storage and computational requirements for computing exact distances between all data points are prohibitive, making data representations that allow compact storage and efficient approximate distance computation necessary.

In this paper, we describe *b-bit minwise hashing*, which leverages properties common to many application scenarios to obtain order-of-magnitude improvements in the storage space and computational overhead required for a given level of accuracy over existing techniques. Moreover, while the theoretical analysis of these gains is technically challenging, the resulting algorithm is simple and easy to implement.

To describe our approach, we first consider the concrete task of Web page duplicate detection, which is of critical importance in the context of Web search and was one of the motivations for the development of the original *minwise hashing* algorithm by Broder et al.^{2, 4} Here, the task is to identify pairs of pages that are textually very similar. For this purpose, Web pages are modeled as "a set of shingles," where a shingle corresponds to a string of *w* contiguous words occurring on the page. Now, given two such sets *S*_{1}, , the normalized similarity known as *resemblance* or *Jaccard similarity*, denoted by *R*, is

Duplicate detection now becomes the task of detecting pairs of pages for which *R* exceeds a threshold value. Here, *w* is a tuning parameter and was set to be *w* = 5 in several studies.^{2, 4, 7} Clearly, the total number of possible shingles is huge. Considering 10^{5} unique English words, the total number of possible 5-shingles should be *D* = (10^{5})^{5} = *O*(10^{25}). A prior study^{7} used *D* = 2^{64} and even earlier studies^{2, 4} used *D* = 2^{40}. Due to the size of *D* and the number of pages crawled as part of Web search, computing the exact similarities for all pairs of pages may require prohibitive storage and computational overhead, leading to approximate techniques based on more compact data structures.

**1.1. Minwise hashing**

To address this issue, Broder and his colleagues developed *minwise hashing* in their seminal work.^{2, 4} Here, we give a brief introduction to this algorithm. Suppose a random permutation is performed on , that is,

An elementary probability argument shows that

After *k* minwise independent permutations, _{1}, _{2},..., _{k}, one can estimate *R* without bias, as a binomial probability:

We will frequently use the terms "sample" and "sample size" (i.e., *k*). For minwise hashing, a sample is a hashed value, min(_{j}(*S*_{i})), which may require, for example, 64 bits.^{7}

Since the original minwise hashing work,^{2, 4} there have been considerable theoretical and methodological developments.^{3, 5, 12, 14, 16, 17, 22}

*Applications:* As a general technique for estimating set similarity, minwise hashing has been applied to a wide range of applications, for example, content matching for online advertising,^{23} detection of redundancy in enterprise file systems,^{8} syntactic similarity algorithms for enterprise information management,^{21} Web spam,^{25} etc.

Many of the applications of minwise hashing are targeted at detecting duplicates or pairs of somewhat high similarity. By proposing an estimator that is particularly accurate for these scenarios, we can reduce the required storage and computational overhead dramatically. Here, the computational savings are a function of how the min-wise hashes are used. For any technique that does compute the pairwise similarity for (a large subset of) all pairs, the computation is typically bound by the speed at which the samples can be brought into memory (as the computation itself is simple); hence, the space reduction our technique offers directly translates into order-of-magnitude speedup as well.

However, even with the data-size reduction, computing all pairwise similarities is prohibitively expensive in many scenarios. This has lead to a number of approaches that avoid this computation by grouping (subsets of) the samples into buckets and only computing the pairwise similarities for items within the same (set of) buckets. This approach avoids the quadratic number of comparisons, at the cost of some loss in accuracy. Examples of such approaches are the *supershingles*^{4} or techniques based on *locally sensitive hashing (LSH)*^{1, 5, 13} (also see Chapter 3 of Rajaraman and Ullman^{24} for an excellent detailed explanation of LSH and see Cohen et al.^{6} for nice applications of LSH ideas in mining associations).

**1.2. b-Bit minwise hashing**

In this paper, we establish a unified theoretical framework for *b-bit minwise hashing*. In our scheme, a *sample* consists of *b* bits only, as opposed to, for example, *b* = 64 bits^{7} in the original minwise hashing. Intuitively, using fewer bits per sample will increase the estimation variance, compared to (3), at the same *sample size k*. Thus, we will have to increase *k* to maintain the same accuracy. Interestingly, our theoretical results will demonstrate that, when resemblance is not too small (which is the case in many applications, e.g., consider *R*0.5, the threshold used in Broder et al.^{2, 4}), we do not have to increase *k* much. This means our proposed *b*-bit minwise hashing can be used to improve estimation accuracy and significantly reduce storage requirements at the same time.

For example, when *b* = 1 and *R* = 0.5, the estimation variance will increase at most by a factor of 3. In order not to lose accuracy, we have to increase the sample size by a factor of 3. If we originally stored each hashed value using 64 bits, the improvement by using *b* = 1 will be 64/3 = 21.3.

Algorithm 1 illustrates the procedure of *b*-bit minwise hashing, based on the theoretical results in Section 2.

**1.3. Related work**

*Locality sensitive hashing* (LSH)^{1, 5, 13} is a set of techniques for performing approximate search in high dimensions. In the context of estimating set intersections, there exist LSH families for estimating the *resemblance*, the *arccosine*, and the *hamming distance*. Our *b*-bit minwise hashing proposes a new construction of an LSH family (Section 7.4).

**Algorithm 1** The *b*-bit minwise hashing algorithm, applied to estimating pairwise resemblances in a collection of *N* sets.

**Input:** Sets , *n* = 1 to *N*.

**Preprocessing**

(1): Generate *k* random permutations _{j}: , *j* = 1 to *k*.

(2): For each set *S*_{n} and each permutation _{j}, store the lowest *b* bits of min (_{j}(*S*_{n})), denoted by *e*_{n,i,j}, *i* = 1 to *b*.

**Estimation:** (Use two sets *S*_{1} and *S*_{2} as an example)

(1): Compute .

(2): Estimate the resemblance by , where *C*_{1,b} and *C*_{2,b} are from Theorem 1 in Section 2.

In Charikar^{5} and Gionis et al.,^{10} the authors describe hashing schemes that map objects to {0, 1}. The algorithms for the construction, however, are problem specific. Three discovered 1-bit schemes are (i) the *simhash*^{5} based on *sign random projection*,^{11, 18} (ii) the hamming distance algorithm based on simple random sampling,^{13} and (iii) the hamming distance algorithm based on a variant of random projection.^{15}

Section 4 will compare our method with two *hamming distance* algorithms.^{13, 15} We also wrote a report (http://www.stat.cornell.edu/~li/b-bit-hashing/RP_minwise.pdf), which demonstrated that, unless the similarity is very low, *b*-bit minwise hashing outperforms sign random projections.

A related approach is *conditional random sampling* (CRS)^{16, 17} which uses only a single permutation and instead of a single minimum retains as set of the smallest hashed values. CRS provides more accurate (in some scenarios substantially so) estimators for binary data and naturally extends to real-value data and dynamic streaming data; moreover, the same set of hashed values can be used to estimate a variety of summary statistics including histograms, *l*_{p} distances (for any *p*), number of distinct values, ^{2} distances, entropies, etc. However, we have not developed a *b*-bit scheme for CRS, which appears to be a challenging task.

Consider two sets *S*_{1}, . Apply a random permutation on *S*_{1} and *S*_{2}, where : . Define the minimum values under to be *z*_{1} and *z*_{2}:

Define *e*_{1,i} = *i*th lowest bit of *z*_{1}, and *e*_{2,i} = *i*th lowest bit of *z*_{2}. Theorem 1 derives the main probability formula. Its proof assumes that *D* is large, which is virtually always satisfied in practice. This result is a good example of approaching a difficult problem by reasonable approximations.

THEOREM 1. *Assume D is large*.

The intuition for the difference between (5) and the equivalent equation for minwise hashing (1) is that even when *R* = 0, the collision probability *P*_{b} (i.e., the probability that two minima agree on their last *b* bits) is not zero, but rather *C*_{1,b}. Having to account for this type of "false positives" makes the derivation more difficult, resulting in the additional terms in (5). Of course, as expected, if *R* = 1, then *P*_{b} = 1 (because in this case *r*_{1} = *r*_{2} and *C*_{1,b} = *C*_{2,b}).

Note that the only assumption needed in the proof of Theorem 1 is that *D* is large, which is virtually always satisfied in practice. Interestingly, (5) is remarkably accurate even for very small *D.* Figure 1 shows that when *D* = 20 (*D* = 500), the absolute error caused by using (5) is <0.01 (<0.0004).

**2.1. The unbiased estimator**

Theorem 1 suggests an unbiased estimator for *R*:

where *e*_{1,i,j} (*e*_{2,i,j}) denotes the *i*th lowest bit of *z*_{1} (*z*_{2}), under the permutation _{j}. The variance is

For large *b*, Var () converges to the variance of , the estimator for the original minwise hashing:

In fact, when *b* = 64, Var () and Var () are numerically indistinguishable for practical purposes.

**2.2. The variance-space trade-off**

As we decrease *b*, the space needed for storing each "sample" will be smaller; the estimation variance (11) at the same sample size *k*, however, will increase. This variance-space trade-off can be precisely quantified by *B*(*b*; *R*, *r*_{1}, *r*_{2}):

Lower *B*(*b*) is better. The ratio, , measures the improvement of using *b* = *b*_{2} (e.g., *b*_{2} = 1) over using *b* = *b*_{1} (e.g., *b*_{1} = 64). Some algebra yields the following Lemma.

LEMMA 1. *If* *r*_{1} = *r*_{2} and *b*_{1} > *b*_{2}, then

*is a monotonically increasing function of R* [0, 1].

*If R* 1 (*which implies r*_{1}, *r*_{2} 1), *then*

*If r*_{1} = *r*_{2}, *b*_{2} = 1, *b*_{1} = 64 (*hence we treat A*_{1,b} = 0), *then*

Suppose the original minwise hashing used *b* = 64, then the maximum improvement of *b*-bit minwise hashing would be 64-fold, attained when *r*_{1} = *r*_{2} = 1 and *R* = 1. In the least favorable situation, that is, *r*_{1}, *r*_{2} 0, the improvement will still be when *R* = 0.5.

Figure 2 plots to directly visualize the relative improvements, which are consistent with what Lemma 1 predicts. The plots show that, when *R* is very large (which is the case in many practical applications), it is always good to use *b* = 1. However, when *R* is small, using larger *b* may be better. The cut-off point depends on *r*_{1}, *r*_{2}, *R.* For example, when *r*_{1} = *r*_{2} and both are small, it would be better to use *b* = 2 than *b* = 1 if *R* < 0.4, as shown in Figure 2.

In the following, we evaluate the accuracy of the theoretical derivation and the practical performance of our approach using two sets of experiments. *Experiment 1* is a sanity check, to verify: (i) our proposed estimator in (9) is unbiased and (ii) its variance follows the prediction by our formula in (11). *Experiment 2* is a duplicate detection task using a Microsoft proprietary collection of 1,000,000 news articles.

**3.1. Experiment 1**

The data, extracted from Microsoft Web crawls, consists of six pairs of sets. Each set consists of the document IDs, which contain the word at least once. We now use *b*-bit min-wise hashing to estimate the similarities of these sets (i.e., we estimate the strength of the word associations).

Table 1 summarizes the data and provides the theoretical improvements . The words were selected to include highly frequent pairs (e.g., "OF-AND"), highly rare pairs (e.g., "GAMBIA-KIRIBATI"), highly unbalanced pairs (e.g., "A-TEST"), highly similar pairs (e.g., "KONG-HONG"), as well as pairs that are not quite similar (e.g., "LOW-PAY").

We estimate the resemblance using the original minwise hashing estimator and the *b*-bit estimator (*b* = 1, 2, 3).

Figure 3 plots the empirical mean square errors (MSE = variance + bias^{2}) in solid lines and the theoretical variances (11) in dashed lines for all word pairs. All dashed lines are invisible because they overlap with the corresponding solid curves. Thus, this experiment validates that the variance formula (11) is accurate and is indeed unbiased (otherwise, the MSE will differ from the variance).

**3.2. Experiment 2**

To illustrate the improvements by the use of *b*-bit minwise hashing on a real-life application, we conducted a duplicate detection experiment using a corpus of 10^{6} news documents. The dataset was crawled as part of the BLEWS project at Microsoft.^{9} We computed pairwise resemblances for all documents and retrieved document pairs with resemblance *R* larger than a threshold *R*_{0}. We estimate the resemblances using with *b* = 1, 2, 4 bits and the original minwise hashing. Figure 4 presents the precision and recall curves. The recall values (bottom two panels in Figure 4) are all very high and do not differentiate the estimators.

The precision curves for (using 4 bits per sample) and (assuming 64 bits per sample) are almost indistinguishable, suggesting a 16-fold improvement in space using *b* = 4.

When using *b* = 1 or 2, the space improvements are normally around 20- to 40-fold, compared to (assuming 64 bits per sample), especially for achieving high precision.

Closely related to the resemblance, the *hamming distance H* is another important similarity measure. In the context of hamming distance, a set is mapped to a *D*-dimensional binary vector *Y*_{i}: *Y*_{it} = 1, if *t* *S*_{i} and 0 otherwise. The hamming distance between *Y*_{1} and *Y*_{2} is

Thus, one can apply *b*-bit minwise hashing to estimate *H*, by converting the estimated resemblance (9) to :

The variance of can be computed from Var () (11) by the "delta method" (i.e., :

We will first compare with an algorithm based on simple random sampling^{13} and then with another algorithm based on a variant of random projection.^{15}

**4.1. Simple random sampling algorithm**

To reduce the storage, we can randomly sample *k* coordinates from the original data *Y*_{1} and *Y*_{2} in *D*-dimensions. The samples, denoted by *h*_{1} and *h*_{2}, are *k*-dimensional bit vectors, from which we can estimate *H*:

whose variance would be (assuming *k* *D*)

Comparing the two variances, (17) and (19), we find that the variance of using simple random sampling, that is, Var , is substantially larger than the variance of using *b*-bit minwise hashing, that is, Var (), especially when the data is sparse. We consider in practice one will most likely implement the random sampling algorithm by storing only the original locations (coordinates) of the nonzeros in the samples. If we do so, the total bits on average will be (per set). This motivates us to define the following ratio:

to compare the storage costs. Recall each sample of *b*-bit minwise hashing requires *b* bits (i.e., *bk* bits per set). The following Lemma may help characterize the improvement:

LEMMA 2. *If r*_{1}, *r*_{2} 0, *then G*_{s,b} *as defined in* (20)

In other words, for small *r*_{1}, *r*_{2}, ; if *R* 0; and , if *R* 1. Figure 5 plots *G*_{s,b} = 1, verifying the substantial improvement of *b*-bit minwise hashing over simple random sampling (often 10- to 30-fold).

**4.2. Random projection + modular arithmetic**

An interesting 1-bit scheme was developed in Kushilevitz et al.^{15} using random projection followed by modular arithmetic. A random matrix is generated with entries being i.i.d. samples *u*_{ij} from a binomial distribution: *u*_{ij} = 1 with probability and *u*_{ij} = 0 with probability . Let _{1} = *Y*_{1} × *U* (mod 2) and _{2} = *Y*_{2} × *U* (mod 2). Kushilevitz et al.^{15} showed that

which allows us to estimate the hamming distance *H* by

We calculate the variance of to be

which suggests that the performance of this 1-bit scheme might be sensitive to that must be predetermined for all sets at the processing time (i.e., it cannot be modified in the estimation phrase for a particular pair). Figure 6 provides the "optimal" (denoted by *) values (as function of *H*) by numerically minimizing the variance (24).

It is interesting to compare this random projection-based 1-bit scheme with our *b*-bit minwise hashing using the following ratio of their variances:

Figure 7 shows that if it is possible to choose the optimal * for random projection, one can achieve good performance, similar to (or even better than) *b*-bit minwise hashing.

The problem is that we must choose the same for all sets. Figure 8 presents a typical example, which uses *H**/*D* = 10^{4} to compute the "optimal" for a wide range of (*r*_{1}, *r*_{2}, *s*) values. The left bottom panel illustrates that when *r*_{1} = 10^{4} using this particular choice of results in fairly good performance compared to *b*-bit minwise hashing. (Recall *H/D* = *r*_{1} + *r*_{2} - 2*s*.) As soon as the true *H* substantially deviates from the guessed *H**, the performance of using random projection degrades dramatically.

There is one more issue. At the optimal *(*H*), our calculations show that the probability (22) *E*_{*} 0.2746. However, if the chosen > *(*H*), then *E*_{} may approach 1/2. As is random, it is likely that the observed > 1/2, that is, log (1 2) becomes undefined in (23). Thus, it is safer to "overestimate" *H* when choosing . When we have a large collection of sets, this basically means the chosen will be very different from its optimal value for most pairs.

Finally, Figure 9 provides an empirical study as a sanity check that the variance formula (24) is indeed accurate and that, if the guessed *H* for selecting deviates from the true *H*, then the random projection estimator exhibits much larger errors than the *b*-bit hashing estimator .

Our theoretical and empirical results have confirmed that, when the resemblance *R* is reasonably high, even a single bit per sample may contain sufficient information for accurately estimating the similarity. This naturally leads to the conjecture that, when *R* is close to 1, one might further improve the performance by looking at a combination of multiple bits (i.e., "*b* < 1"). One simple approach is to combine two bits from two permutations using XOR () operations.

Recall *e*_{1,1,} denotes the lowest bit of the hashed value under . Theorem 1 has proved that

Consider two permutations _{1} and _{2}. We store

Then *x*_{1} = *x*_{2} either when and , or, when and . Thus,

which is a quadratic equation with a solution:

This estimator is slightly biased at small sample size *k.* We use to indicate that two bits are combined into one (but each sample is still stored using 1 bit). The asymptotic variance of can be derived to be

Interestingly, as *R* 1, does twice as well as :

On the other hand, may not be good when *R* is not too large. For example, one can numerically show that

Figure 10 plots the empirical MSEs for two-word pairs in Experiment 1, for , , and . For the highly similar pair, "KONG-HONG," exhibits superior performance compared to . For "UNITED-STATES," whose *R* = 0.591, performs similarly to .

In summary, for applications which care about very high similarities, combining bits can reduce storage even further.

When computing set similarity for large sets of samples, the key operation is determining the number of identical *b*-bit samples. While samples for values of *b* that are multiples of 16 bits can easily be compared using a single machine instruction, efficiently computing the overlap between *b*-bit samples for small *b* is less straightforward. In the following, we will describe techniques for computing the number of identical *b*-bit samples when these are packed into arrays , *l* = 1,2 of *w*-bit words. To compute the number of identical *b*-bit samples, we iterate through the arrays; for each offset *h*, we first compute = *A*_{1}[*h*] *A*_{2}[*h*]. Now, the number of *b*-bit blocks in *u* that contain only 0s corresponds to the number of identical *b*-bit samples.

The case of *b* = 1 corresponds to the problem of counting the number of 0-bits in a word. We tested a number of different methods and found the fastest approach to be precomputing an array *bits*[1,..., 2^{16}] such that *bits*[*t*] corresponds to the number of 0-bits in the binary representation of *t* and using lookups into this array. This approach extends to *b* > 1 as well.

To evaluate this approach we timed a tight loop computing the number of identical samples in two arrays of *b*-bit hashes covering a total of 1.8 billion 32-bit words (using a 64-bit Intel 6600 Processor). Here, the 1-bit hashing requires 1.67× the time that the 32-bit minwise hashing requires (1.73× when comparing to 64-bit minwise hashing). The results were essentially identical for *b* = 2, 4, 8. Given that, when *R* > 0.5, we can gain a storage reduction of 21.3-fold, we expect the resulting improvement in computational efficiency to be 21.3/1.67 = 12.8-fold in the above setup.

**7.1. Three-way resemblance**

Many applications in data mining or data cleaning require not only estimates of two-way, but also of multi-way similarities. The original minwise hashing naturally extends to multi-way resemblance. In Li et al.,^{19} we extended *b*-bit minwise hashing to estimate three-way resemblance. We developed a highly accurate, but complicated estimator, as well as a much simplified estimator suitable for sparse data. Interestingly, at least *b* 2 bits are needed in order to estimate three-way resemblance. Similar to the two-way case, *b*-bit minwise hashing can result in an order-of-magnitude reduction in the storage space required for a given estimation accuracy when testing for moderate to high similarity.

**7.2. Large-scale machine learning**

A different category of applications for *b*-bit minwise hashing is machine learning on very large datasets. For example, one of our projects^{20} focuses on linear support vector machines (SVM). We were able to show that the resemblance matrix, the minwise hashing matrix, and the *b*-bit minwise hashing matrix are all positive definite matrices (kernels), and we integrated *b*-bit minwise hashing with linear SVM. This allows us to significantly speed up training and testing times with almost no loss in classification accuracy for many practical scenarios. In addition, this provides an elegant solution to the problem of SVM training in scenarios where the training data cannot fit in memory.

Interestingly, the technique we used for linear SVM essentially provides a universal strategy for integrating *b*-bit minwise hashing with many other learning algorithms, for example, logistic regression.

**7.3. Improving estimates by maximum likelihood estimators**

While *b*-bit minwise hashing is particularly effective in applications which mainly concern sets of high similarities (e.g., *R* > 0.5), there are other important applications in which not just pairs of high similarities matter. For example, many learning algorithms require all pairwise similarities and it is expected that only a small fraction of the pairs are similar. Furthermore, many applications care more about *containment* (e.g., which fraction of one set is contained in another set) than the *resemblance*. In a recent technical report (http://www.stat.cornell.edu/~li/b-bit-hashing/AccurateHashing.pdf), we showed that the estimators for minwise hashing and *b*-bit minwise hashing used in the current practice can be systematically improved and the improvements are most significant for set pairs of low resemblance and high containment.

For minwise hashing, instead of only using **Pr**(*z*_{1} = *z*_{2}), where *z*_{1} and *z*_{2} are two hashed values, we can combine it with **Pr**(*z*_{1} < *z*_{2}) and **Pr**(*z*_{1} > *z*_{2}) to form a three-cell multinomial estimation problem, whose maximum likelihood estimator (MLE) is the solution to a cubic equation. For *b*-bit minwise hashing, we formulate a 2^{b} × 2^{b}-cell multinomial problem, whose MLE requires a simple numerical procedure.

**7.4. The new LSH family**

Applications such as near neighbor search, similarity clustering, and data mining will significantly benefit from *b*-bit minwise hashing. It is clear that *b*-bit minwise hashing will significantly improve the efficiency of simple linear algorithms (for near neighbor search) or simple quadratic algorithms (for similarity clustering), when the key bottleneck is main-memory throughput.

Techniques based on *LSH*^{1, 5, 13} have been successfully used to achieve sub-linear (for near neighbor search) or sub-quadratic (for similarity clustering) performance. It is interesting that *b*-bit minwise hashing is a new family of LSH; hence, in this section, we would like to provide more theoretical properties in the context of LSH and approximate near neighbor search.

Consider a set *S*_{1}. Suppose there exists another set *S*_{2} whose resemblance distance (1 *R*) from *S*_{1} is at most *d*_{0}, that is, 1 - *R* *d*_{0}. The goal of *c-approximate d*_{0}-*near neighbor* algorithms is to return sets (with high probability) whose resemblance distances from *S*_{1} are at most *c×d*_{0} with *c* > 1.

Recall *z*_{1} and *z*_{2} denote the minwise hashed values for sets *S*_{1} and *S*_{2}, respectively. The performance of the LSH algorithm depends on the difference (gap) between the following *P*^{(1)} and *P*^{(2)} (respectively corresponding to *d*_{0} and *cd*_{0}):

A larger gap between *P*^{(1)} and *P*^{(2)} implies a more efficient LSH algorithm. The following "" value (_{M} for minwise hashing) characterizes the gap:

A smaller (i.e., larger difference between *P*^{(1)} and *P*^{(2)} leads to a more efficient LSH algorithm and is particularly desirable.^{1, 13} The general LSH theoretical result tells us that the query time for *c*-approximate *d*_{0}-near neighbor is dominated by *O*(*N*^{}) distance evaluations, where *N* is the total number of sets in the collection.

Recall *P*_{b}, as defined in (5), denotes the collision probability for *b*-bit minwise hashing. The _{b} value for *c*-approximate *d*_{0}-near neighbor search can be computed as follows:

Figure 11 suggests that *b*-bit minwise hashing can potentially achieve very similar values compared to the original minwise hashing, when the applications care mostly about highly similar sets (e.g., *d*_{0} = 0.1, the top panels of Figure 11), even using merely *b* = 1. If the applications concern sets that are not necessarily highly similar (e.g., *d*_{0} = 0.5, the bottom panels), using *b* = 3 or 4 will still have similar values as using the original minwise hashing.

We expect that these theoretical properties regarding the values will potentially be useful in future work. We are currently developing new variants of LSH algorithms for near neighbor search based on *b*-bit minwise hashing.

Subsequent documents will be made available at www.stat.cornell.edu/~li/b-bit-hashing, which is a repository for maintaining the papers and technical reports related to b-bit minwise hashing.

*Minwise hashing* is a standard technique for efficiently estimating set similarity in massive datasets. In this paper, we gave an overview of *b-bit minwise hashing*, which modifies the original scheme by storing the lowest *b* bits of each hashed value. We proved that, when the similarity is reasonably high (e.g., resemblance 0.5), using *b* = 1 bit per hashed value can, even in the worst case, gain a 21.3-fold improvement in storage space (at similar estimation accuracy), compared to storing each hashed value using 64 bits. As many applications are primarily interested in identifying duplicates of reasonably similar sets, these improvements can result in substantial reduction in storage (and consequently computational) overhead in practice.

We also compared our scheme to other approaches that map the hashed objects to single bits, both in theory as well as experimentally.

Our proposed method is simple and requires only minimal modification to the original minwise hashing algorithm. It can be used in the context of a number of different applications, such as duplicate detection, clustering, similarity search, and machine learning, and we expect that it will be adopted in practice.

**Acknowledgment**

This work is supported by NSF (DMS-0808864), ONR (YIP-N000140910911), and a grant from Microsoft. We thank the Board Members for suggesting a direct comparison with Kushilevitz et al.^{15}

1. Andoni, A., Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. *Commun. ACM 51* (2008), 117122.

2. Broder, A.Z. On the resemblance and containment of documents. In *The Compression and Complexity of Sequences* (Positano, Italy, 1997), 2129.

3. Broder, A.Z., Charikar, M., Frieze, A.M., Mitzenmacher, M. Min-wise independent permutations. *J. Comput. Syst. Sci. 60*, 3 (2000), 630659.

4. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G. Syntactic clustering of the web. In *WWW* (Santa Clara, CA, 1997), 11571166.

5. Charikar, M.S. Similarity estimation techniques from rounding algorithms. In *STOC* (Montreal, Quebec, Canada, 2002), 380388.

6. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C. Finding interesting associations without support pruning. *IEEE Trans. Knowl. Data Eng. 13*, 1 (2001), 6478.

7. Fetterly, D., Manasse, M., Najork, M., Wiener, J.L. A large-scale study of the evolution of web pages. In *WWW* (Budapest, Hungary, 2003), 669678.

8. Forman, G., Eshghi, K., Suermondt, J. Efficient detection of large-scale redundancy in enterprise file systems. *SIGOPS Oper. Syst. Rev. 43*, 1 (2009), 8491.

9. Gamon, M., Basu, S., Belenko, D., Fisher, D., Hurst, M., König, A.C. Blews: Using blogs to provide context for news articles. In *AAAI Conference on Weblogs and Social Media* (Redmond, WA, 2008).

10. Gionis, A., Gunopulos, D., Koudas, N. Efficient and tunable similar set retrieval. In *SIGMOD* (Santa Barbara, CA, 2001), 247258.

11. Goemans, M.X., Williamson, D.P. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. *J. ACM 42*, 6 (1995), 11151145.

12. Indyk, P. A small approximately min-wise independent family of hash functions. *J. Algorithms 38*, 1 (2001), 8490.

13. Indyk, P., Motwani, R. Approximate nearest neighbors: towards removing the curse of dimensionality. In *STOC* (Dallas, TX, 1998), 604613.

14. Itoh, T., Takei, Y., Tarui, J. On the sample size of k-restricted min-wise independent permutations and other k-wise distributions. In *STOC* (San Diego, CA, 2003), 710718.

15. Kushilevitz, E., Ostrovsky, R., Rabani, Y. Efficient search for approximate nearest neighbor in high dimensional spaces. In *STOC* (Dallas, TX, 1998), 614623.

16. Li, P., Church, K.W. A sketch algorithm for estimating two-way and multi-way associations. *Comput. Linguist. 33*, 3 (2007), 305354 (Preliminary results appeared in HLT/EMNLP 2005).

17. Li, P., Church, K.W., Hastie, T.J. One sketch for all: Theory and applications of conditional random sampling. In *NIPS* (Vancouver, British Columbia, Canada, 2008) (Preliminary results appeared in NIPS 2006).

18. Li, P., Hastie, T.J., Church, K.W. Improving random projections using marginal information. In *COLT* (Pittsburgh, PA, 2006), 635649.

19. Li, P., König, A.C., Gui, W. b-Bit minwise hashing for estimating three-way similarities. In *NIPS* (Vancouver, British Columbia, Canada, 2010).

20. Li, P., Moore, J., König, A.C. b-Bit minwise hashing for large-scale linear SVM. Technical report, 2011. http://www.stat.cornell.edu/~li/b-bit-hashing/HashingSVM.pdf

21. Cherkasova, L., Eshghi, K., Morrey III, C.B., Tucek, J., Veitch, A. Applying Syntactic similarity algorithms for enterprise information management. In *KDD* (Paris, France, 2009), 10871096.

22. Manasse, M., McSherry, F., Talwar, K. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.

23. Pandey, S., Broder, A., Chierichetti, F., Josifovski, V., Kumar, R., Vassilvitskii, S. Nearest-neighbor caching for content-match applications. In *WWW* (Madrid, Spain, 2009), 441450.

24. Rajaraman, A., Ullman, J. *Mining of Massive Datasets*. http://i.stanford.edu/ullman/mmds.html

25. Urvoy, T., Chauveau, E., Filoche, P., Lavergne, T. Tracking web spam with html style similarities. *ACM Trans. Web 2*, 1 (2008), 128.

The previous version of this paper, entitled "b-Bit Minwise Hashing for Estimating Three-way Similarities," was published in *Proceedings of the Neural Information Processing Systems: NIPS 2010* (Vancouver, British Columbia, Canada).

Figure 1. The absolute errors (approximateexact) by using (5) are very small even for *D* = 20 (left panels) or *D* = 500 (right panels). The exact probability can be numerically computed for small *D* (from a probability matrix of size *D* × *D*). For each *D*, we selected three *f*_{1} values. We always let *f*_{2} = 2, 3,..., *f*_{1} and *a* = 0, 1, 2,..., *f*_{2}.

Figure 2. , the relative storage improvement of using *b* = 1, 2, 3, 4 bits, compared to using 64 bits. *B*(*b*) is defined in (12).

Figure 3. Mean square errors (MSEs). "*M*" denotes the original minwise hashing. "Theor." denotes the theoretical variances Var () (11) and Var()(3). The dashed curves, however, are invisible because the empirical MSEs overlap the theoretical variances. At the same k, . However, only requires 1/2/3 bits per sample, while may require 64 bits.

Figure 4. The task is to retrieve news article pairs with resemblance *R* *R*_{0}. The recall curves cannot differentiate estimators. The precision curves are more interesting. When *R*_{0} = 0.4, to achieve a precision = 0.80, the estimators , and require *k* = 50, 50, 75, and 145, respectively, indicating , and , respectively, improve (assuming 64 bits per sample) by 16-, 21.4-, and 22-fold. The improvement becomes larger as *R*_{0} increases.

Figure 5. *G*_{s,b=1} as defined in (20) for illustrating the improvement of *b*-bit minwise hashing over simple random sampling. We consider *r*_{1} = 10^{4}, 0.1, 0.5, 0.9, *r*_{2} ranging from 0.1 *r*_{1} to 0.9 *r*_{1} and *s* from 0 to *r*_{2}. Note that *r*_{1} + *r*_{2} - *s* 1 has to be satisfied.

Figure 6. Left panel: the * values at which the smallest variances (24) are attained. Right panel: the corresponding optimal variances.

Figure 7. *G*_{rp,b=1,*} as defined in (25) for comparing with . For each combination of *r*_{1}, *r*_{2}, *s*, we computed and used the optimal * for the variance.

Figure 8. *G*_{rp,b=1,} (25) computed by using the fixed , which is the optimal when *H* = *H** = 10^{4} *D.*

Figure 9. The exact *H* for this pair of sets is 0.0316. We use the optimal at *H**/*D* = 0.1 (left panel) and *H**/*D* = 0.5 (right panel). Compared to , the MSEs of (labeled by "*rp*") are substantially larger. The theoretical variances (dashed lines) (24) and (17), essentially overlap the empirical MSEs.

Figure 10. MSEs for comparing (27) with and . Due to the bias of , the theoretical variances Var (), that is, (28), deviate from the empirical MSEs when *k* is small.

Figure 11. _{M} and _{b} defined in (30) and (31) for measuring the potential performance of LSH algorithms. "*M*" denotes the original minwise hashing.

**©2011 ACM 0001-0782/11/0800 $10.00**

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2011 ACM, Inc.

Isn't there an error in the first formula, when the article describes the "resemblance or Jaccard similarity, denoted by R"?

The numerator and denominator of the division are the same number (cardinalty of the union of S1 and S2), the result should be 1.

Regards.

Thanks, Joan

You are correct about the first formula; instead, please refer to equation (1), which has the correct definition of the Jaccard-Overlap.

We just checked and our final submission really did not have that error in the first formula. This error must have occurred later and we did not catch it during the review of the page proofs.

Hopefully the online version will be corrected.

Regards,

Ping and Christian

Displaying **all 2** comments