Efficient (approximate) computation of set similarity in very large datasets is a common task with many applications in information retrieval and data management. One common approach for this task is minwise hashing.
Isn't there an error in the first formula, when the article describes the "resemblance or Jaccard similarity, denoted by R"?
The numerator and denominator of the division are the same number (cardinalty of the union of S1 and S2), the result should be 1.
You are correct about the first formula; instead, please refer to equation (1), which has the correct definition of the Jaccard-Overlap.
We just checked and our final submission really did not have that error in the first formula. This error must have occurred later and we did not catch it during the review of the page proofs.
Hopefully the online version will be corrected.
Ping and Christian
Displaying all 2 comments