Credit: Getty Images
Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points S and a radius parameter r > 0, the r-near neighbor (r-NN) problem asks for a data structure that, given any query point q, returns a point p within distance at most r from q. In this paper, we study the r-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance r from the query should have the same probability to be returned. The problem is of special interest in high dimensions, where Locality Sensitive Hashing (LSH), the theoretically leading approach to similarity search, does not provide any fairness guarantee. In this work, we show that LSH-based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also carried out an experimental evaluation that highlights the inherent unfairness of existing NN data structures.
In recent years, following a growing concern about the fairness of algorithms and their bias toward a specific population or feature, there has been an increasing interest in building algorithms that achieve (appropriately defined) fairness.14 The goal is to remove, or at least minimize, unethical behavior such as discrimination and bias in algorithmic decision making, as nowadays, many important decisions, such as college admissions, offering home loans, or estimating the likelihood of recidivism, rely on machine learning algorithms. While algorithms are not inherently biased, nevertheless, they may create it by careless design, or by amplifying the already existing biases in the data.
There is no unique definition of fairness (see Hardt et al.18 and references therein), but different formulations that depend on the computational problem at hand, and on the ethical goals we aim for. Fairness goals are often defined in the political context of socio-technical systems and have to be seen in an interdisciplinary spectrum covering many fields outside computer science. In particular, researchers have studied both group fairness (also known as statistical fairness), where demographics of the population are preserved in the outcome,12 and individual fairness, where the goal is to treat individuals with similar conditions similarly.14 The latter concept of "equal opportunity" requires that people who can achieve a certain advantaged outcome, such as finishing a university degree, or paying back a loan, have an equal opportunity of being able to get access to it in the first place.
Bias in the data used for training machine learning algorithms is a monumental challenge in creating fair algorithms. Here, we are interested in a somewhat different problem of handling the bias introduced by the data structures used by such algorithms. Specifically, data structures may introduce bias in the data stored in them and the way they answer queries, because of the way a data is stored and how it is being accessed. It is also possible that some techniques for boosting performance, like randomization and approximation that result in non-deterministic behavior, add to the overall algorithmic bias. For instance, some database indexes for fast search might give an (unexpected) advantage to some portions of the input data. Such a defect leads to selection bias by the algorithms using such data structures. It is thus natural to want data structures that do not introduce a selection bias into the data when handling queries. To this end, imagine a data structure that can return, as an answer to a query, an item out of a set of acceptable answers. The purpose is then to return uniformly a random item out of the set of acceptable outcomes, without explicitly computing the whole set of acceptable answers (which might be prohibitively expensive).
The Near Neighbor Problem
In this work, we study similarity search and in particular the near neighbor problem from the perspective of individual fairness. Similarity search is an important primitive in many applications in computer science such as machine learning, recommender systems, data mining, computer vision, and many others (see e.g., Andoni and Indyk5 for an overview). One of the most common formulations of similarity search is the r-near neighbor (r-NN) problem, formally defined as follows. Let (X, D) be a metric space where the distance function D(·, ·) reflects the (dis)similarity between two data points. Given a set S ∪ X of n points and a radius parameter r, the goal of the r-NN problem is to preprocess S and construct a data structure, such that for a query point q ∈ X, one can report a point p ∈ S, such that D(p, q) ≤ r if such a point exists. As all the existing algorithms for the exact variant of the problem have either space or query time that depends exponentially on the ambient dimension of X, people have considered the approximate variant of the problem. In the c-approximate near neighbor (ANN) problem, the algorithm is allowed to report a point p whose distance to the query is at most cr if a point within distance r of the query exists, for some prespecified constant c > 1.
Fair Near Neighbor
As detailed below, common existing data structures for similarity search have a behavior that introduces bias in the output. Our goal is to capture and algorithmically remove this bias from these data structures. Our goal is to develop a data structure for the r-near neighbor problem that provides fairness among "all the points" in the neighborhood. That is all the points within distance r from the given query have the same probability to be returned. We introduce and study the fair near neighbor problem: If BS(q, r) is the ball of input points at distance at most r from a query q, we would like that each point in BS(q, r) is returned as near neighbor of q with the uniform probability of 1/n(q, r) where n(q, r) = |BS(q, r)|.
Locality Sensitive Hashing
Perhaps the most prominent approach to get an ANN data structure is via Locality Sensitive Hashing (LSH) as proposed by Indyk and Motwani,20 which leads to sublinear query time and sub-quadratic space. In particular, for X = ℝd, by using LSH one can get a query time of nρ+o(1) and space n1+ρ+o(1) where for the L1 distance metric ρ = 1/c,16 and for the L2 distance metric ρ = 1/c2+oc(1).5 In the LSH framework, the idea is to hash all points using several hash functions that are chosen randomly, with the property that the collision probability between two points is a decreasing function of their distance. Therefore, closer points to a query have a higher probability of falling into a bucket being probed than far points. Thus, reporting a random point from a random bucket computed for the query produces a distribution that is biased by the distance to the query: closer points to the query have a higher probability of being chosen. On the other hand, the uniformity property required in fair NN can be trivially achieved by finding all r-near neighbors of a query and then randomly selecting one of them. However, this is computationally inefficient since the query time is a function of the size of the neighborhood. One contribution in this paper is the description of much more efficient data structures that still use LSH in a black-box way.
When Random Nearby Is Better than Nearest
The bias mentioned above toward nearer points is usually a good property, but is not always desirable. Indeed, consider the following scenarios:
To the best of our knowledge, previous results focused on exact near neighbor sampling in the Euclidean space up to three dimensions.2, 19, 24 Although these results might be extended to ℝd for any d > 1, they suffer from the curse of dimensionality as the query time increases exponentially with the dimension, making the data structures too expensive in moderately high dimensions. These bounds are unlikely to be significantly improved since several conditional lower bounds show that an exponential dependency on d in query time or space is unavoidable for exact near neighbor search.4
1.1. An example
Is a standard LSH approach really biased? As an example, we used the MinHash LSH scheme10 to sample similar users from the Last.FM dataset used in the HetRec challenge (http://ir.ii.uam.es/hetrec2011). We associate each user with their top-20 artists and use Jaccard Similarity as similarity measure. We select one user at random as query, and repeatedly sample a random point from a random bucket and keep it if its similarity is above 0.2. Figure 1 reports on the ratio between the frequencies observed via this sampling approach from LSH buckets against an unbiased sample. We see a large discrepancy: the higher the similarity, the more biased the LSH is in reporting these points as near neighbors. This would strongly affect statistics such as estimating the average similarity of a neighbor.
Figure 1. Bias introduced by uniform sampling from LSH buckets on the Last.FM dataset. The task is to (repeatedly) retrieve a uniform user among all users with similarity at least 0.2 to a fixed user. The result is split up into four buckets by rounding down the similarity to the first decimal. Error bars show the standard deviation. Compared to an unbiased sample, user vectors with small similarity are underrepresented, and users with high similarity are, by a factor of approximately 4 on average, overrepresented.
1.2. Problem formulations
Here, we formally define the variants of the fair NN problem that we consider. For all the constructions presented in this article, these guarantees fail with probability at most δ for some prespecified small δ > 0.
DEFINITION 1 (r-NNIS OR FAIR NN). Let S ⊆ X be a set of n points in a metric space (X, D). The r-near neighbor independent sampling (r-NNIS), or simply the Fair NN problem, asks to construct a data structure for S that for any sequence of up to n queries q1, q2, …, qn satisfies the following properties with probability at least 1 – δ:
In the low-dimensional setting,2,19 the r-near neighbor independent sampling problem is usually known as independent range sampling (IRS) problem. Next, motivated by applications, we define two approximate variants of the problem that we study in this work. More precisely, we slightly relax the fairness constraint, allowing the probabilities of reporting a neighbor to be an "almost uniform" distribution.
DEFINITION 2 (APPROXIMATELY FAIR NN). Consider a set S ⊆ X of n points in a metric space (X, D). The Approximately Fair NN problem asks to construct a data structure for S that for any query q, returns each point p ∈ BS(q, r) with probability μp where μ is an approximately uniform probability distribution:
where ℙ(q, r) = 1/n (q, r). We require the same independence guarantee as in Definition 1, that is, the result for query qi must be independent of the results for q1, …, qi-1, with i ∈ {2, …, n}.
Furthermore, similar in spirit to the behavior of ANN, we allow the algorithm to report an almost uniform distribution from an approximate neighborhood of the query.
DEFINITION 3 (APPROXIMATELY FAIR ANN). Consider a set S ⊆ X of n points in a metric space (X, D). The Approximately Fair ANN problem asks to construct a data structure for S that for any query q, returns each point p ∈ S' with probability μp where /(1+ε) ≤ μp ≤ (1+ε), where S' is a point set such that BS(q, r) ⊆ S' ⊆ BS(q, cr), and = 1/|S'|. As before, the same independence guarantee as in Definition 1 is needed, that is, the result for query qi must be independent of the results for q1, …, qi-1, with i ∈ {2, …, n}.
1.3. Our results
We propose several solutions to the different variants of the Fair NN problem. Our solutions build upon the LSH data structure.16 Let S(n, c) and Q(n, c) denote space and query time, respectively, of an LSH data structure that solves the c-ANN problem in the space (X, D).
The dependence of our algorithms on ε in the approximate variant is only O(log(1/ε)). While we omitted the exact poly-logarithmic factors in the list above, they are generally lower for the approximate versions. Furthermore, these methods can be embedded into existing LSH methods to achieve unbiased query results in a straightforward way. On the other hand, the exact methods will have higher logarithmic factors and use additional data structures.
A more exhaustive presentation of our results and further solutions for the Fair NN problem can be found in the full version of the paper.8 Preliminary versions of our results were published independently in Har-Peled and Mahabadi,17 Aumüller et al.9 and then jointly in Aumüller et al.7
1.4. Sampling from a sub-collection of sets
In order to obtain our results, we first study a more generic problem in Section 2: Given a collection F of sets from a universe of n elements, a query is a sub-collection G ∪ F of these sets and the goal is to sample (almost) uniformly from the union of the sets in this sub-collection. We also show how to modify the data structure to handle outliers in Section 3. This is useful for LSH, as the sampling algorithm needs to ignore such points once they are reported as a sample. This setup allows us to derive most of the results concerning variants of Fair NN in Section 4 as corollaries from these more abstract data structures.
Some examples of applications of a data structure that provides uniform samples from a union of sets are as follows:
Being unaware of any previous work on this problem, we believe this data structure is of independent interest.
The problem. Assume you are given a data structure that contains a large collection F of sets of objects. In total, there are n = |∪F| objects. The sets in F are not necessarily disjoint. The task is to preprocess the data structure, such that given a sub-collection G ⊆ F of the sets, one can quickly pick uniformly at random an object from the set ∪G:=∪A∈GA.
Naive solution. The naive solution is to take the sets under consideration (in G), compute their union, and sample directly from the union set ∪G. Our purpose is to do (much) better—in particular, the goal is to get a query time that depends logarithmically on the total size of all the sets in G.
Parameters. The query is a family G ⊆ F, and define m = ||G||:= ΣA∈G|A| (which should be distinguished from g = |G| and from N = |∪G|).
Preprocessing. For each set A ∈ F, we build a set representation such that for a given element, we can decide if the element is in A in constant time. In addition, we assume that each set is stored in a data structure that enables easy random access or uniform sampling on this set (for example, store each set in its own array).
Variants. As in Section 1.2, we consider problem variants where sample probabilities are either exact or approximate.
2.1. Almost uniform sampling
The query is a family G ⊆ F. The degree of an element x ∈ ∪G, is the number of sets of G that contain it—that is, dG(X) = | DG(X)|, where DG(x)={A ∈ G|x ∈ A}. We start with an algorithm (similar to the algorithm of Section 4 in Karp and Luby21) that repeatedly does the following:
Since computing dG(x) exactly to be used in Step (III) is costly, our goal is instead to simulate a process that accepts x with probability approximately 1/dG(x). We start with the process described in the following lemma.
LEMMA 1. Assume we have g urns, and exactly d > 0 of them, are non-empty. Furthermore, assume that we can check if a specific urn is empty in constant time. Then, there is a randomized algorithm, that outputs a number Y ≥ 0, such that 𝔼[Y] = 1/d. The expected running time of the algorithm is O(g/d).
PROOF. The algorithm repeatedly probes urns (uniformly at random), until it finds a non-empty urn. Assume it found a non-empty urn in the ith probe. The algorithm outputs the value i/g and stops.
Setting p = d/g, and let Y be the output of the algorithm. We have that
using the formula . The expected number of probes performed by the algorithm until it finds a non-empty urn is 1/p = g/d, which implies that the expected running time of the algorithm is O(g/d).
The natural way to deploy Lemma 1 is to run its algorithm to get a number y and then return 1 with probability y. The problem is that y can be strictly larger than 1, which is meaningless for probabilities. Instead, we back-off by using the value y/Δ, for some parameter Δ. If the returned value is larger than 1, we just treat it at zero. If the zeroing never happened, the algorithm would return one with probability 1/(dG(x)Δ). The probability of success is going to be slightly smaller, but fortunately, the loss can be made arbitrarily small by taking Δ to be sufficiently large.
LEMMA 2. There are g urns, and exactly d > 0 of them are not empty. Furthermore, assume one can check if a specific urn is empty in constant time. Let γ ∈ (0, 1) be a parameter. Then one can output a number Z ≥ 0, such that Z ∈ [0, 1], and , where Δ = ⌈In γ-1 + 4 = Θ(log γ-1). The expected running time of the algorithm is O(g/d). Alternatively, the algorithm can output a bit X, such that ℙ [X = 1] ∈ I.
PROOF. We modify the algorithm of Lemma 1, so that it outputs i/(gΔ) instead of i/g. If the algorithm does not stop in the first gΔ + 1 iterations, then the algorithm stops and outputs 0. Observe that the probability that the algorithm fails to stop in the first gΔ iterations, for p = d/g, is .
Let Z be the random variable that is the number output by the algorithm. Arguing as in Lemma 1, we have that 𝔼[Z] ≤ 1/(dΔ). More precisely, we have
Easy calculations shows that
Let . We have that
, where
. Furthermore, for j ≥ Δ, we have
As such, we have that
by the choice of value for Δ. This implies that 𝔼[Z] ≥ 1/(dΔ) – β ≥ 1/(dΔ) – γ, as desired.
The alternative algorithm takes the output Z, and returns 1 with probability Z, and zero otherwise.
LEMMA 3. The input is a family of sets F that one pre-processes in linear time. Let G ⊆ F be a sub-family and let N = |∪G|, g = |G|, and let ε ∈ (0, 1) be a parameter. One can sample an element x ∈ ∪ G with almost uniform probability distribution. Specifically, the probability p of an element to be output is (1/N)/(1+ε) ≤ p ≤ (1+ε)(1/N). After linear time preprocessing, the query time is O(g log(g/ε)), in expectation, and the query succeeds, with high probability (in g).
PROOF. The algorithm repeatedly samples an element x using steps (I) and (II). The algorithm returns x if the algorithm of Lemma 2, invoked with γ = (ε/g)o(1) returns 1. We have that Δ=Θ(log(g/ε)). Let α = 1/(dG(x)Δ). The algorithm returns x in this iteration with probability p, where p ∈ [α – γ, α]. Observe that α ≥ 1/(gΔ), which implies that γ << (ε/4)α, it follows that (1/(dG(x)Δ))/(1 + ε)≤p≤(1 + ε)(1/dG(x)Δ)), as desired. The expected running time of each round is O(g/dG(x)).
We prove the runtime analysis of the algorithm in the full version of the paper. In short, the above argument implies that each round, in expectation takes O(Ng/m) time, where m = ||G||. Further, the expected number of rounds, in expectation, will be O(Δm/N). Finally, this implies that the expected running time of the algorithm is O(gΔ) = O(g log(g/ε)).
REMARK 1. The query time of Lemma 3 can be made to work with high probability with an additional logarithmic factor. Specifically, with high probability, the query time is O(g log(g/ε) log N).
2.2. Uniform sampling
In this section, we present a data structure that samples an element uniformly at random from ∪G. The data structure uses rejection sampling as seen before but splits up all data points using random ranks. Instead of picking an element from a weighted sample of the sets, it will pick a random segment among these ranks and consider only elements whose rank is in the selected range. Let Λ be the sequence of the n = |∪ F| input elements after a random permutation; the rank of an element is its position in Λ. We first highlight the main idea of the query procedure.
Let k ≥ 1 be a suitable value that depends on the collection G and assume that Λ is split into k segments &Lambdai, with i ∈ {0, …, k – 1}. (We assume for simplicity that n and k are powers of two.) Each segment &Lambdai contains the n/k elements in Λ with rank in [i · n/k, (i + 1) · n/k). We denote with λg,i the number of elements from ∪G in Λi, and with λ ≥ maxi λG,i being an upper bound on the number of these elements in each segment. By the initial random permutation, we have that each segment contains at most λ = Θ((N/k) log n) elements from ∪G with probability at least 1 – 1/n2. (Of course, N is not known at query time.)
The query algorithm works in the following steps in which all random choices are independent.
Since each object in ∪G has probability 1/(kλ) of being returned in Step (C), the result is a uniform sample of ∪G. Note that the main iteration in Step (B) works for all values k, but a good choice has to depend on G for the following reasons. On the one hand, the segments should be small, because otherwise Step (IV) will take too long. On the other hand, they have to contain at least one element from ∪G, otherwise we sample many "empty" segments in Step (II). We will see that the number k of segments should be roughly set to N to balance the trade-off. However, the number N of distinct elements in ∪G is not known. Thus, we use the naive upper bound of k = n. To compute λG,h efficiently, we assume that, at construction time, the elements in each set in F are sorted by their rank.
LEMMA 4. Let N = |∪G|, g = |G|, m = ΣX∈G |X|, and n = |∪F|. With probability at least 1 – 1/n2, the algorithm described above returns an element x ∈ ∪G according to the uniform distribution. With high probability, the algorithm has a running time of O(g log4 n).
PROOF. We start by bounding the initial failure probability of the data structure. By a union bound, we have that the following two events hold simultaneously with probability at least 1 – 1/n2:
From now on assume that these events are true.
As noted earlier, each element has probability of 1/(kλ) of being returned as output, and thus, elements are equally likely to be sampled. Note also that the guarantees are independent of the initial random permutation as soon as the two events above hold. This means that the data structure returns a uniform sample from a union-of-sets.
For the running time, first focus on the round where k = 2⌈log N⌉. In this round, we carry out Θ(log2 n) iterations. In Step (IV), λG,h is computed by iterating through the g sets and collecting points using a range query on segment Λh. Since elements in each set are sorted by their rank, the range query can be carried out by searching for rank hn/k using a binary search in O(log n) time and then enumerating all elements with rank smaller than (h+1)n/k. This takes time O(log n + o) for each set, where o is the output size. Since each segment contains O(log n) elements from ∪G with high probability, one iteration of Step (IV) takes time O(g log n).
The time to carry out all Σ = Θ(log2 n) iterations is thus bounded by O(g log3 n). Observe that for all the rounds carried out before, k is only larger and thus, the segments are smaller. This means that we may multiply our upper bound with log n, which completes the proof.
Using count distinct sketches to find a good choice for the number of segments k, the running time can be decreased to O(g log3 n); we refer to the full version8 for more details.
Imagine a situation where we have a marked set of outliers O. We are interested in sampling from ∪G\O. We assume that the total degree of the outliers in the query is at most mO for some prespecified parameter mO. More precisely, we have dG(O) = Σx∈O dG(x)≤mO. We get the following results by running the original algorithms from the previous section and removing outliers once we encounter them. If we encounter more than mO outliers, we report that the number of outliers exceeds mO.
Running the algorithm described in Section 2.1 provides the guarantees summarized in the following lemma.
LEMMA 5. The input is a family of sets F that one can preprocess in linear time. A query is a sub-family G ⊆ F, a set of outliers O, a parameter mO, and a parameter ε ∈ (0, 1). One can either:
The expected query time is O(mO + g log(n/ε)), and the query succeeds with high probability, where g = |G|, and n = ||F||.
Running the algorithm described in Section 2.2 and keeping track of outliers has the following guarantees.
LEMMA 6. The input is a family of sets F that one can preprocess in linear time. A query is a sub-family G ⊆ F, a set of outliers O, and a parameter mO. With high probability, one can either:
The expected query time is O((g + mO) log4 n).
In this section, we employ the data structures developed in the previous sections to show the results on fair near neighbor search listed in Section 1.3.
First, let us briefly give some preliminaries on LSH. We refer the reader to Har-Peled et al.16 for further details. Throughout the section, we assume our metric space (X, D) admits an LSH data structure.
4.1. Background on LSH
Locality Sensitive Hashing (LSH) is a common tool for solving the ANN problem and was introduced in Har-Peled et al.16
DEFINITION 4. A distribution H over maps h: X → U, for a suitable set U, is (r, c·r, p1, p2)-sensitive if the following holds for any x, y ∈ X:
The distribution H is an LSH family, and has quality .
For the sake of simplicity, we assume that p2 ≤ 1/n: if p2 > 1/n, then it suffices to create a new LSH family HK obtained by concatenating K = Θ (logp2 (1/n)) independent and identically distributed hashing functions from H. The new family -sensitive and ρ does not change.
The standard approach to (c, r)-ANN using LSH functions is the following. Let D denote the data structure constructed by LSH, and let c denote the approximation parameter of LSH. Each D consists of L = nρ hash functions ℓ1, …, ℓL randomly and uniformly selected from H. D contains L hash tables H1, … HL: each hash table Hi contains the input set S and uses the hash function ℓi to split the point set into buckets. For each query q, we iterate over the L hash tables: For any hash function, compute ℓi(q) and compute, using Hi, the set
of points in S with the same hash value; then, compute the distance D(q, p) for each point p ∈ Hi(q). The procedure stops as soon as a (c, r)-near point is found. It stops and returns ⊥ if there are no remaining points to check or if it found more than 3L far points. We summarize the guarantees in the following lemma.16
LEMMA 7. For a given query point q, let Sq = ∪iHi(q). Then for any point p ε BS(q, r), we have that with a probability of least 1 – 1/e – 1/3, we have (i) p ∈ Sq and (ii) |Sq \ BS(q, cr)| ≤ 3L, that is, the number of outliers is at most 3L. Moreover, the expected number of outliers in any single bucket Si,ℓi(q) is at most 1.
By repeating the construction O(log n) times, we guarantee that with high probability B(q, r) ∪ Sq.
4.2. Approximately Fair ANN
For t = O(log n), let D1, …, Dt be data structures constructed by LSH. Let F be the set of all buckets in all data structures, that is, . For a query point q, consider the family G of all buckets containing the query, that is,
, and thus |G| = O(L log n). Moreover, we letO to be the set of outliers, that is, the points that are farther than cr from q. Note that as mentioned in Lemma 7, the expected number of outliers in each bucket of LSH is at most 1. Therefore, by Lemma 5, we immediately get the following result.
LEMMA 8. Given a set S of n points and a parameter r, we can preprocess it such that given query q, one can report a point p ∈ S with probability μp where /(1 + ε)≤μp≤(1+ε) S is a point set such that BS(q,r)⊆S⊆BS(q,cr), and = 1/|S|. The algorithm uses space O(L log n) and its expected query time is O(L log n log(n/ε)).
REMARK 2. By repeatedly calling the query procedure and disregarding points at distance larger than r, the algorithm described above solves the Approximately Fair NN Problem (Definition 2). The probability that the algorithm succeeds in a round is ρ = n(q, r)/n(q, cr), and as such the expected number of rounds is 1/ρ. Thus, this approach has expected query time .
4.3. Fair NN
We use the same setup as in the previous section and build t = O(log n) data structures D1, …, Dt using LSH. We use the algorithm described in Section 2.2 with all points at distance more than r from the query marked as outliers. By the properties of the LSH and the random ranks, we expect to see points at distance at least r. This allows us to obtain the following results.
LEMMA 9. Given a set S of n points and a parameter r, we can preprocess it such that given a query q, one can report a point p ∈ S with probability 1/n(q, r). The algorithm uses space O(L log n) and has expected query time .
The example provided in Section 1.1 already showed the bias of sampling naively from the LSH buckets. In this section, we want to consider the influence of the approximative variants discussed here and provide a brief overview of the running time differences. A detailed experimental evaluation can be found in the full paper.8
For concreteness, we take the MNIST dataset of handwritten digits available at http://yann.lecun.com/exdb/mnist/. We use the Euclidean space LSH from Datar et al.,13 set a distance threshold of 1250, and initialize the LSH with L = 100 repetitions, k = 15, and w = 3750. These parameter settings provide a false negative rate of around 10%. We take 50 points as queries and test the following four different sampling strategies on the LSH buckets:
Each method removes non-close points that might be selected from the bucket. We remark that the variant Uniform/uniform most closely resembles a standard LSH approach. Weighted/Uniform takes the different bucket sizes into account, but disregards the individual frequency of a point. Thus, the output is not expected to be uniform, but might be closer in distribution to the uniform distribution.
Output Distribution. For each query q, we compute the set of near neighbors M(q) of q in the LSH buckets. For each sampling strategy, we carry out the query 100|M(q)| times. The sampling results give rise to a distribution μ on M(q), and we compare this distribution to the uniform distribution in which each point is sampled with probability 1/|M(q)|. Figure 2 reports on the total variation distance between the uniform distribution and the observed distribution, that is, . As in our introductory example, we see that uniformly picking an LSH bucket results in a heavily biased distribution. Taking the size of the buckets into account in the weighted case helps a bit, but still results in a heavily biased distribution. Even with the easiest approximation strategy for the degree, we see an improvement and achieve a total variation distance of around 0.08, with the optimal algorithm achieving around 0.04.
Figure 2. Total variation distance of different approaches on the MNIST dataset.
Differences in Running Time. Compared to a naïve approach of collecting all colliding points in the buckets and selecting a near neighbor at random, the methods presented here provide a speed-up of more than two orders of magnitude in our experiments. The fair methods based on rejection sampling are approximately a factor of 10 slower than their biased counterparts that just pick a (weighted) point at random. Finally, the approximate degree sampling provides running times that are approximately two times faster than an exact computation of the degree.
In this paper, we have investigated a possible definition of fairness in similarity search by connecting the notion of "equal opportunity" to independent range sampling. An interesting open question is to investigate the applicability of our data structures for problems such as discrimination discovery,27 diversity in recommender systems,1 privacy preserving similarity search,26 and estimation of kernel density.11 Moreover, it would be interesting to investigate techniques for providing incentives (that is, reverse discrimination27) to prevent discrimination: An idea could be to merge the data structures in this paper with distance-sensitive hashing functions in Aumüller et al.,6 which allow to implement hashing schemes where the collision probability is an (almost) arbitrary function of the distance. Finally, the techniques presented here require a manual trade-off between the performance of the LSH part and the additional running time contribution from finding the near points among the non-far points. From a user point of view, we would much rather prefer a parameterless version of our data structure that finds the best trade-off with small overhead, as discussed in Ahle et al.3 in another setting.
S. Har-Peled was partially supported by a NSF AF award CCF-1907400. R. Pagh is part of BARC, supported by the VILLUM Foundation grant 16582. F. Silvestri was partially supported by PRIN Project n. 20174LF3T8 AHeAD.
1. Adomavicius, G., Kwon, Y. Optimization-based approaches for maximizing aggregate recommendation diversity. INFORMS J. Comput 26, 2 (2014), 351–369.
2. Afshani, P., Phillips, J.M. Independent range sampling, revisited again. In G. Barequet, and Y. Wang, eds. Proc. 35th Int. Symposium on Computational Geometry (SoCG), volume 129 of LIPIcs (2019), 4:1–4:13.
3. Ahle, T.D., Aumüller, M., Pagh, R. Parameter-free locality sensitive hashing for spherical range reporting. In Proc. 28th ACM-SIAM Symposium on Discrete Algorithms (SODA) (2017), 239–256.
4. Alman, J., Williams, R. Probabilistic polynomials and hamming nearest neighbors. In Proc. IEEE 56th Annual Symposium on Foundations of Computer Science (FOCS) (2015), 136–150.
5. Andoni, A., Indyk, P. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM 51, 1 (2008), 117–122.
6. Aumüller, M., Christiani, T., Pagh, R., Silvestri, F. Distance-sensitive hashing. In Proc. 37th ACM Symposium on Principles of Database Systems (PODS) (2018).
7. Aumüller, M., Har-Peled, S., Mahabadi, S., Pagh, R., Silvestri, F. Fair near neighbor search via sampling. SIGMOD Rec 50, 1 (2021), 42–49.
8. Aumüller, M., Har-Peled, S., Mahabadi, S., Pagh, R., Silvestri, F. Sampling a near neighbor in high dimensions—Who is the fairest of them all? to appear in ACM Transaction of Database Systems (2022.).
9. Aumüller, M., Pagh, R., Silvestri, F. Fair near neighbor search: Independent range sampling in high dimensions. In Proc. 39th ACM Symposium on Principles of Database Systems (PODS) (2020).
10. Broder, A.Z. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences (1997), 21–29.
11. Charikar, M., Siminelakis, P. Hashing-based-estimators for kernel density in high dimensions. In C. Umans, ed. Proc. 58th IEEE Symposium on Foundations of Computer Science (FOCS) (2017), 1032–1043.
12. Chouldechova, A. Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data 5, 2 (2017), 153–163.
13. Datar, M., Immorlica, N., Indyk, P., Mirrokni, V.S. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. 20th Symposium on Computational Geometry (SoCG) (2004), 253–262.
14. Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R. Fairness through awareness. In Proc. 3rd Innovations in Theoretical Computer Science Conference (ITCS) (2012), 214–226.
15. Everitt, B.S., Landau, S., Leese, M. Cluster Analysis. Wiley Publishing, 2009.
16. Har-Peled, S., Indyk, P., Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. Theory Comput 8 (2012), 321–350. Special issue in honor of Rajeev Motwani.
17. Har-Peled, S., Mahabadi, S. Near neighbor: Who is the fairest of them all? In Proc. 32nd Neural Info. Proc. Sys. (NeurIPS) (2019), 13176–13187.
18. Hardt, M., Price, E., Srebro, N. Equality of opportunity in supervised learning. In Neural Info. Proc. Sys. (NIPS) (2016), 3315–3323.
19. Hu, X., Qiao, M., Tao, Y. Independent range sampling. In Proc. 33rd ACM Symposium on Principles of Database Systems (PODS) (2014), 246–255.
20. Indyk, P., Motwani, R. Approximate nearest neighbors: Towards removing the curse of dimensionality. In Proc. 30th Annu. ACM Sympos. Theory Comput. (STOC) (1998), 604–613.
21. Karp, R.M., Luby, M. Monte-Carlo algorithms for enumeration and reliability problems. In 24th Symposium on Foundations of Computer Science (SFCS), IEEE Computer Society, 1983, 56–64.
22. Keeling, M.J., Eames, K.T. Networks and epidemic models. J. R. Soc. Interface 2, 4 (Sep. 2005), 295–307.
23. Kung Y-H, Lin, P.-S., Kao, C.-H. An optimal k-nearest neighbor for density estimation. Stat. Probab. Lett 82, 10 (2012), 1786–1791.
24. Olken, F., Rotem, D. Sampling from spatial databases. Stat. Comput 5, 1 (Mar 1995), 43–57.
25. Qi, Y., Atallah, M.J. Efficient privacy-preserving k-nearest neighbor search. In Proc. 28th International Conference on Distributed Computing Systems (ICDCS) (2008), 311–319.
26. Riazi, M.S., Chen, B., Shrivastava, A., Wallach, D.S., Koushanfar, F. Sublinear privacy-preserving near-neighbor search with untrusted server on large-scale datasets. arXiv:1612.01835 (2016).
27. Thanh, B.L., Ruggieri, S., Turini, F. k-nn as an implementation of situation testing for discrimination discovery and prevention. In Proc. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2011), 502–510.
To view the accompanying Technical Perspective, visit doi.acm.org/10.1145/3543843
A version of this paper, entitled "Fair Near Neighbor Search via Sampling," was published in SIGMOD Record 50, 1 (Mar. 2021).
©2022 ACM 0001-0782/22/8
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.
No entries found