### Abstract

Similarity search is a fundamental algorithmic primitive, widely used in many computer science disciplines. Given a set of points *S* and a radius parameter *r* > 0, the *r*-near neighbor (*r*-NN) problem asks for a data structure that, given any query point *q*, returns a point *p* within distance at most *r* from *q.* In this paper, we study the *r*-NN problem in the light of individual fairness and providing equal opportunities: all points that are within distance *r* from the query should have the same probability to be returned. The problem is of special interest in high dimensions, where *Locality Sensitive Hashing* (LSH), the theoretically leading approach to similarity search, does not provide any fairness guarantee. In this work, we show that LSH-based algorithms can be made fair, without a significant loss in efficiency. We propose several efficient data structures for the exact and approximate variants of the fair NN problem. Our approach works more generally for sampling uniformly from a sub-collection of sets of a given collection and can be used in a few other applications. We also carried out an experimental evaluation that highlights the inherent unfairness of existing NN data structures.

### 1. Introduction

In recent years, following a growing concern about the fairness of algorithms and their bias toward a specific population or feature, there has been an increasing interest in building algorithms that achieve (appropriately defined) *fairness.*^{14} The goal is to remove, or at least minimize, unethical behavior such as discrimination and bias in algorithmic decision making, as nowadays, many important decisions, such as college admissions, offering home loans, or estimating the likelihood of recidivism, rely on machine learning algorithms. While algorithms are not inherently biased, nevertheless, they may create it by careless design, or by amplifying the already existing biases in the data.

There is no unique definition of fairness (see Hardt et al.^{18} and references therein), but different formulations that depend on the computational problem at hand, and on the ethical goals we aim for. Fairness goals are often defined in the political context of socio-technical systems and have to be seen in an interdisciplinary spectrum covering many fields outside computer science. In particular, researchers have studied both *group fairness* (also known as statistical fairness), where demographics of the population are preserved in the outcome,^{12} and *individual fairness*, where the goal is to treat individuals with similar conditions similarly.^{14} The latter concept of “equal opportunity” requires that people who can achieve a certain advantaged outcome, such as finishing a university degree, or paying back a loan, have an equal opportunity of being able to get access to it in the first place.

Bias in the data used for training machine learning algorithms is a monumental challenge in creating fair algorithms. Here, we are interested in a somewhat different problem of handling the bias introduced by the data structures used by such algorithms. Specifically, data structures may introduce bias in the data stored in them and the way they answer queries, because of the way a data is stored and how it is being accessed. It is also possible that some techniques for boosting performance, like randomization and approximation that result in non-deterministic behavior, add to the overall algorithmic bias. For instance, some database indexes for fast search might give an (unexpected) advantage to some portions of the input data. Such a defect leads to selection bias by the algorithms using such data structures. It is thus natural to want data structures that do not introduce a selection bias into the data when handling queries. To this end, imagine a data structure that can return, as an answer to a query, an item out of a set of acceptable answers. The purpose is then to return uniformly a random item out of the set of acceptable outcomes, without explicitly computing the whole set of acceptable answers (which might be prohibitively expensive).

In this work, we study similarity search and in particular the near neighbor problem from the perspective of individual fairness. Similarity search is an important primitive in many applications in computer science such as machine learning, recommender systems, data mining, computer vision, and many others (see e.g., Andoni and Indyk^{5} for an overview). One of the most common formulations of similarity search is the *r*-near neighbor (*r*-NN) problem, formally defined as follows. Let (*X*, *D*) be a metric space where the distance function *D*(·, ·) reflects the (dis)similarity between two data points. Given a set *S* ∪ *X* of *n* points and a radius parameter *r*, the goal of the *r*-NN problem is to preprocess *S* and construct a data structure, such that for a query point **q** ∈ *X*, one can report a point **p** ∈ *S*, such that *D*(**p, q**) ≤ *r* if such a point exists. As all the existing algorithms for the *exact* variant of the problem have either space or query time that depends exponentially on the ambient dimension of *X*, people have considered the approximate variant of the problem. In the *c-approximate near neighbor* (ANN) problem, the algorithm is allowed to report a point **p** whose distance to the query is at most *cr* if a point within distance *r* of the query exists, for some prespecified constant *c* > 1.

As detailed below, common existing data structures for similarity search have a behavior that introduces bias in the output. Our goal is to capture and algorithmically remove this bias from these data structures. Our goal is to develop a data structure for the *r*-near neighbor problem that provides fairness among “all the points” in the neighborhood. That is all the points within distance *r* from the given query have the same probability to be returned. We introduce and study the *fair near neighbor* problem: If *B _{S}*(

**q**,

*r*) is the ball of input points at distance at most

*r*from a query

**q**, we would like that each point in

*B*(

_{S}**q**,

*r*) is returned as near neighbor of

**q**with the uniform probability of 1/

*n*(

**q**,

*r*) where

*n*(

**q**,

*r*) = |

*B*(

_{S}**q**,

*r*)|.

Perhaps the most prominent approach to get an ANN data structure is *via* Locality Sensitive Hashing (LSH) as proposed by Indyk and Motwani,^{20} which leads to sublinear query time and sub-quadratic space. In particular, for *X* = ℝ^{d}, by using LSH one can get a query time of *n*^{ρ+o(1)} and space *n*^{1+ρ+o(1)} where for the *L*_{1} distance metric ρ = 1/*c*,^{16} and for the *L*_{2} distance metric ρ = 1/*c*^{2}+*o _{c}*(1).

^{5}In the LSH framework, the idea is to hash all points using several hash functions that are chosen randomly, with the property that the collision probability between two points is a decreasing function of their distance. Therefore, closer points to a query have a higher probability of falling into a bucket being probed than far points. Thus, reporting a random point from a random bucket computed for the query produces a distribution that is biased by the distance to the query: closer points to the query have a higher probability of being chosen. On the other hand, the uniformity property required in fair NN can be trivially achieved by finding

*all r*-near neighbors of a query and then randomly selecting one of them. However, this is computationally inefficient since the query time is a function of the size of the neighborhood. One contribution in this paper is the description of much more efficient data structures that still use LSH in a black-box way.

**When Random Nearby Is Better than Nearest**

The bias mentioned above toward nearer points is usually a good property, but is not always desirable. Indeed, consider the following scenarios:

- The nearest neighbor might not be the best if the input is noisy, and the closest point might be viewed as an unrepresentative outlier. Any point in the neighborhood might be then considered to be equivalently beneficial. This is to some extent why
*k*-NN classification^{15}is so effective in reducing the effect of noise. Furthermore,*k*-NN works better in many cases if*k*is large, but computing the*k*nearest neighbors is quite expensive if*k*is large. Fortunately, quickly computing a random nearby neighbor can significantly speed up such classification. - If one wants to estimate the number of items with a desired property within the neighborhood, then the easiest way to do it is
*via*uniform random sampling from the neighborhood, for instance for density estimation^{23}or discrimination discovery in existing databases.^{27}This can be seen as a special case of query sampling in databases,^{24}where the goal is to return a random sample of the output of a given query, for efficiently providing statistics on the query. - We are interested in anonymizing the query: returning a random near-neighbor might serve as the first line of defense in trying to make it harder to recover the query. Similarly, one might want to anonymize the nearest neighbor,
^{25}for applications where we are interested in a “typical” data item close to the query, without identifying the nearest item. - Popular recommender systems based on matrix factorization give recommendations by computing the inner product similarity of a user feature vector with all item feature vectors using some efficient similarity search algorithm. It is common practice to recommend those items that have the largest inner product with the user’s vector. However, in general it is not clear that it is desirable to recommend the “closest” articles. Indeed, it might be desirable to recommend articles that are on the same topic but are not
*too*aligned with the user feature vector and may provide a different perspective. As described in Adomavicius and Kwon,^{1}recommendations can be made more diverse by sampling*k*items from a larger top-*ℓ*list of recommendations at random. Our data structures could replace the final near neighbor search routine employed in such systems.

To the best of our knowledge, previous results focused on exact near neighbor sampling in the Euclidean space up to three dimensions.^{2, 19, 24} Although these results might be extended to ℝ^{d} for any *d* > 1, they suffer from the *curse of dimensionality* as the query time increases exponentially with the dimension, making the data structures too expensive in moderately high dimensions. These bounds are unlikely to be significantly improved since several conditional lower bounds show that an exponential dependency on *d* in query time or space is unavoidable for *exact* near neighbor search.^{4}

Is a standard LSH approach really biased? As an example, we used the MinHash LSH scheme^{10} to sample similar users from the Last.FM dataset used in the HetRec challenge (http://ir.ii.uam.es/hetrec2011). We associate each user with their top-20 artists and use Jaccard Similarity as similarity measure. We select one user at random as query, and repeatedly sample a random point from a random bucket and keep it if its similarity is above 0.2. Figure 1 reports on the ratio between the frequencies observed *via* this sampling approach from LSH buckets against an unbiased sample. We see a large discrepancy: the higher the similarity, the more biased the LSH is in reporting these points as near neighbors. This would strongly affect statistics such as estimating the average similarity of a neighbor.

**Figure 1. Bias introduced by uniform sampling from LSH buckets on the Last.FM dataset. The task is to (repeatedly) retrieve a uniform user among all users with similarity at least 0.2 to a fixed user. The result is split up into four buckets by rounding down the similarity to the first decimal. Error bars show the standard deviation. Compared to an unbiased sample, user vectors with small similarity are underrepresented, and users with high similarity are, by a factor of approximately 4 on average, overrepresented.**

Here, we formally define the variants of the fair NN problem that we consider. For all the constructions presented in this article, these guarantees fail with probability at most δ for some prespecified small δ > 0.

DEFINITION 1 (*r*-NNIS OR FAIR NN). *Let S* ⊆ *X be a set of n points in a metric space* (*X*, *D*). The *r*-near neighbor independent sampling *(r-NNIS), or simply the* Fair NN *problem, asks to construct a data structure for S that for any sequence of up to n queries* **q**_{1}, **q**_{2}, …, **q**_{n} *satisfies the following properties with probability at least* 1 – δ*:*

*For each query***q**_{i},*it returns a point*OUT_{i, qi}*uniformly sampled from B*(_{S}**q**_{i},*r*).*The point returned for query***q**_{i},*with i*> 1,*is independent of previous query results. That is, for any***p**∈*B*(_{S}**q**_{i},*r*)*and any sequence***p**_{1}, …,**p**_{i-1},*we have*Pr[OUT_{i, qi}=**p**| ∀_{j}∈[*i*-1]:OUT_{j,qj}]=**p**_{j}=1/*n*(**q**_{i},*r*).

In the low-dimensional setting,^{2,19} the *r*-near neighbor independent sampling problem is usually known as *independent range sampling* (IRS) problem. Next, motivated by applications, we define two approximate variants of the problem that we study in this work. More precisely, we slightly relax the fairness constraint, allowing the probabilities of reporting a neighbor to be an “almost uniform” distribution.

DEFINITION 2 (APPROXIMATELY FAIR NN). *Consider a set S* ⊆ *X of n points in a metric space* (*X*, *D*). *The* Approximately Fair NN *problem asks to construct a data structure for S that for any query* **q**, *returns each point* **p** ∈ *B _{S}*(

**q**,

*r*)

*with probability*μ

_{p}

*where*μ

*is an approximately uniform probability distribution:*

*where* ℙ(**q**, *r*) = 1/*n* (**q**, *r*). *We require the same independence guarantee as in Definition 1, that is, the result for query* **q**_{i} *must be independent of the results for* **q**_{1}, …, **q**_{i-1}, *with i* ∈ {2, …, *n*}.

Furthermore, similar in spirit to the behavior of ANN, we allow the algorithm to report an almost uniform distribution from an *approximate* neighborhood of the query.

DEFINITION 3 (APPROXIMATELY FAIR ANN). *Consider a set S* ⊆ *X of n points in a metric space* (*X*, *D*). *The* Approximately Fair ANN *problem asks to construct a data structure for S that for any query* **q**, *returns each point* **p** ∈ *S’ with probability* μ_{p} *where* ϕ/(1+ε) ≤ μ_{p} ≤ (1+ε)_{ϕ}, *where S’ is a point set such that B _{S}*(

**q**,

*r*) ⊆

*S’*⊆

*B*(

_{S}**q**,

*cr*),

*and*ϕ = 1/|

*S’*|.

*As before, the same independence guarantee as in Definition 1 is needed, that is, the result for query*

**q**

_{i}

*must be independent of the results for*

**q**

_{1}, …,

**q**

_{i-1},

*with i*∈ {2, …,

*n*}.

We propose several solutions to the different variants of the Fair NN problem. Our solutions build upon the LSH data structure.^{16} Let *S*(*n*, *c*) and Q(*n*, *c*) denote space and query time, respectively, of an LSH data structure that solves the *c*-ANN problem in the space (*X*, *D*).

- In Section 4.2, we provide a data structure for Approximately Fair ANN that uses space
*S*(*n*,*c*) and whose query time is*Õ*(Q(*n*,*c*))*in expectation.*See Lemma 8 for the exact statement. - Section 4.3 shows how to solve the Fair NN problem in expected query time
and space usage
*O*(*S*(*n*,*c*)). See Lemma 9 for the exact statement.

The dependence of our algorithms on ε in the approximate variant is only *O*(log(1/ε)). While we omitted the exact poly-logarithmic factors in the list above, they are generally lower for the approximate versions. Furthermore, these methods can be embedded into existing LSH methods to achieve unbiased query results in a straightforward way. On the other hand, the exact methods will have higher logarithmic factors and use additional data structures.

A more exhaustive presentation of our results and further solutions for the Fair NN problem can be found in the full version of the paper.^{8} Preliminary versions of our results were published independently in Har-Peled and Mahabadi,^{17} Aumüller et al.^{9} and then jointly in Aumüller et al.^{7}

**1.4. Sampling from a sub-collection of sets**

In order to obtain our results, we first study a more generic problem in Section 2: Given a collection *F* of sets from a universe of *n* elements, a query is a sub-collection *G* ∪ *F* of these sets and the goal is to sample (almost) uniformly from the union of the sets in this sub-collection. We also show how to modify the data structure to handle outliers in Section 3. This is useful for LSH, as the sampling algorithm needs to ignore such points once they are reported as a sample. This setup allows us to derive most of the results concerning variants of Fair NN in Section 4 as corollaries from these more abstract data structures.

Some examples of applications of a data structure that provides uniform samples from a union of sets are as follows:

- Given a subset
*A*of vertices in the graph, randomly pick (with uniform distribution) a neighbor to one of the vertices of*A.*This can be used in simulating disease spread.^{22} - As shown in this work, we use variants of the data structure to implement Fair NN.
- Uniform sampling for range searching.
^{19, 2}Indeed, consider a set of points, stored in a data structure for range queries. Using the above, we can support sampling from the points reported by several queries, even if the reported answers are not disjoint.

Being unaware of any previous work on this problem, we believe this data structure is of independent interest.

### 2. Sampling from a Union of Sets

**The problem.** Assume you are given a data structure that contains a large collection *F* of sets of objects. In total, there are *n* = |∪*F*| objects. The sets in *F* are not necessarily disjoint. The task is to preprocess the data structure, such that given a sub-collection *G* ⊆ *F* of the sets, one can quickly pick uniformly at random an object from the set ∪*G*:=∪_{A∈G}*A*.

**Naive solution.** The naive solution is to take the sets under consideration (in *G*), compute their union, and sample directly from the union set ∪*G*. Our purpose is to do (much) better—in particular, the goal is to get a query time that depends logarithmically on the total size of all the sets in *G*.

**Parameters.** The query is a family *G* ⊆ *F*, and define *m* = ||*G*||:= Σ_{A∈G}|*A*| (which should be distinguished from *g* = |*G*| and from *N* = |∪*G*|).

**Preprocessing.** For each set *A* ∈ *F*, we build a set representation such that for a given element, we can decide if the element is in *A* in constant time. In addition, we assume that each set is stored in a data structure that enables easy random access or uniform sampling on this set (for example, store each set in its own array).

**Variants.** As in Section 1.2, we consider problem variants where sample probabilities are either *exact* or *approximate.*

The query is a family *G* ⊆ *F*. The ** degree** of an element

*x*∈ ∪

*G*, is the number of sets of

*G*that contain it—that is, d

_{G}(

*X*) = | D

_{G}(

*X*)|, where D

_{G}(

*x*)={

*A*∈

*G*|

*x*∈

*A*}. We start with an algorithm (similar to the algorithm of Section 4 in Karp and Luby

^{21}) that repeatedly does the following:

- Picks one set from
*G*with probabilities proportional to their sizes. That is, a set*A*∈*G*is picked with probability |*A*|/*m.* - It picks an element
*x*∈*A*uniformly at random. - Outputs
*x*and stops with probability 1/d_{G}(*x*). Otherwise, continues to the next iteration.

Since computing d_{G}(*x*) exactly to be used in Step (III) is costly, our goal is instead to simulate a process that accepts *x* with probability *approximately* 1/d_{G}(*x*). We start with the process described in the following lemma.

LEMMA 1. *Assume we have g urns, and exactly* d > 0 *of them, are non-empty. Furthermore, assume that we can check if a specific urn is empty in constant time. Then, there is a randomized algorithm, that outputs a number Y* ≥ 0, *such that* 𝔼[*Y*] = 1/d. *The expected running time of the algorithm is O*(*g*/d).

PROOF. The algorithm repeatedly probes urns (uniformly at random), until it finds a non-empty urn. Assume it found a non-empty urn in the *i*^{th} probe. The algorithm outputs the value *i/g* and stops.

Setting *p* = d/*g*, and let *Y* be the output of the algorithm. We have that

using the formula
. The expected number of probes performed by the algorithm until it finds a non-empty urn is 1/*p* = *g*/d, which implies that the expected running time of the algorithm is *O*(*g*/d).

The natural way to deploy Lemma 1 is to run its algorithm to get a number *y* and then return 1 with probability *y.* The problem is that *y* can be strictly larger than 1, which is meaningless for probabilities. Instead, we back-off by using the value *y*/Δ, for some parameter Δ. If the returned value is larger than 1, we just treat it at zero. If the zeroing never happened, the algorithm would return one with probability 1/(d_{G}(*x*)Δ). The probability of success is going to be slightly smaller, but fortunately, the loss can be made arbitrarily small by taking Δ to be sufficiently large.

LEMMA 2. *There are g urns, and exactly* d > 0 *of them are not empty. Furthermore, assume one can check if a specific urn is empty in constant time. Let* γ ∈ (0, 1) *be a parameter. Then one can output a number Z* ≥ 0, *such that Z* ∈ [0, 1], *and*
, *where* Δ = ⌈In γ^{-1} + 4 = Θ(log γ^{-1}). *The expected running time of the algorithm is O*(*g*/d). *Alternatively, the algorithm can output a bit X, such that* ℙ [*X* = 1] ∈ *I.*

PROOF. We modify the algorithm of Lemma 1, so that it outputs *i*/(*g*Δ) instead of *i/g.* If the algorithm does not stop in the first *g*Δ + 1 iterations, then the algorithm stops and outputs 0. Observe that the probability that the algorithm fails to stop in the first *g*Δ iterations, for *p* = d/*g*, is
.

Let *Z* be the random variable that is the number output by the algorithm. Arguing as in Lemma 1, we have that 𝔼[*Z*] ≤ 1/(dΔ). More precisely, we have

Easy calculations shows that

Let
. We have that
, where
. Furthermore, for *j* ≥ Δ, we have

As such, we have that

by the choice of value for Δ. This implies that 𝔼[*Z*] ≥ 1/(dΔ) – *β* ≥ 1/(dΔ) – γ, as desired.

The alternative algorithm takes the output *Z*, and returns 1 with probability *Z*, and zero otherwise. ◻

LEMMA 3. *The input is a family of sets F that one pre-processes in linear time. Let* *G* ⊆ *F be a sub-family and let N* = |∪*G*|, *g* = |*G*|, *and let* ε ∈ (0, 1) *be a parameter. One can sample an element x* ∈ ∪ *G* *with almost uniform probability distribution. Specifically, the probability p of an element to be output is* (1/*N*)/(1+ε) ≤ *p* ≤ (1+ε)(1/*N*). *After linear time preprocessing, the query time is O*(*g* log(*g*/ε)), *in expectation, and the query succeeds, with high probability (in g).*

PROOF. The algorithm repeatedly samples an element *x* using steps (I) and (II). The algorithm returns *x* if the algorithm of Lemma 2, invoked with γ = (ε/*g*)^{o(1)} returns 1. We have that Δ=Θ(log(*g*/ε)). Let α = 1/(d_{G}(*x*)Δ). The algorithm returns *x* in this iteration with probability *p*, where *p* ∈ [α – γ, α]. Observe that α ≥ 1/(*g*Δ), which implies that γ << (ε/4)α, it follows that (1/(d_{G}(*x*)Δ))/(1 + ε)≤*p*≤(1 + ε)(1/d_{G}(*x*)Δ)), as desired. The expected running time of each round is *O*(*g*/d_{G}(*x*)).

We prove the runtime analysis of the algorithm in the full version of the paper. In short, the above argument implies that each round, in expectation takes *O*(*Ng/m*) time, where *m* = ||*G*||. Further, the expected number of rounds, in expectation, will be *O*(Δ*m/N*). Finally, this implies that the expected running time of the algorithm is *O*(*g*Δ) = *O*(*g* log(*g*/ε)). ◻

REMARK 1. *The query time of Lemma 3 can be made to work with high probability with an additional logarithmic factor. Specifically, with high probability, the query time is O*(*g* log(*g*/ε) log *N*).

In this section, we present a data structure that samples an element uniformly at random from ∪*G*. The data structure uses rejection sampling as seen before but splits up all data points using random ranks. Instead of picking an element from a weighted sample of the sets, it will pick a random segment among these ranks and consider only elements whose rank is in the selected range. Let Λ be the sequence of the *n* = |∪ *F*| input elements after a random permutation; the rank of an element is its position in Λ. We first highlight the main idea of the query procedure.

Let *k* ≥ 1 be a suitable value that depends on the collection *G* and assume that Λ is split into *k* segments &Lambda_{i}, with *i* ∈ {0, …, *k* – 1}. (We assume for simplicity that *n* and *k* are powers of two.) Each segment &Lambda_{i} contains the *n/k* elements in Λ with rank in [*i* · *n/k*, (*i* + 1) · *n/k*). We denote with λ_{g,i} the number of elements from ∪*G* in Λ_{i}, and with λ ≥ max_{i} λ_{G,i} being an upper bound on the number of these elements in each segment. By the initial random permutation, we have that each segment contains at most λ = Θ((*N/k*) log *n*) elements from ∪*G* with probability at least 1 – 1/*n*^{2}. (Of course, *N* is *not* known at query time.)

The query algorithm works in the following steps in which all random choices are independent.

- Set
*k*=*n*, and let λ = Θ(log*n*), Σ_{fail}= 0 and Σ = Θ(log^{2}*n*). - Repeat the following steps until successful or
*k*< 2:- Assume the input sequence Λ to be split into
*k*segments Λ_{i}of size*n/k*, where Λ_{i}contains the points in ∪*F*with ranks in [*i*·*n/k*, (*i*+1) ·*n/k*). - Select an integer
*h*in {0, …,*k*– 1} uniformly at random (i.e., select a segment Λ_{h}); - Increment σ
_{fail}. If σ_{fail}= Σ, then set*k*=*k*/2 and σ_{fail}= 0. - Compute λ
_{g,h}and with probability λ_{g,h}/λ, declare success.

- Assume the input sequence Λ to be split into
- If the previous loop ended with success, return an element uniformly sampled among the elements in ∪
*G*in Λ_{h}, otherwise return ⊥.

Since each object in ∪*G* has probability 1/(*k*λ) of being returned in Step (C), the result is a uniform sample of ∪*G*. Note that the main iteration in Step (B) works for all values *k*, but a good choice has to depend on *G* for the following reasons. On the one hand, the segments should be small, because otherwise Step (IV) will take too long. On the other hand, they have to contain at least one element from ∪*G*, otherwise we sample many “empty” segments in Step (II). We will see that the number *k* of segments should be roughly set to *N* to balance the trade-off. However, the number *N* of distinct elements in ∪*G* is not known. Thus, we use the naive upper bound of *k* = *n.* To compute λ_{G,h} efficiently, we assume that, at construction time, the elements in each set in *F* are sorted by their rank.

LEMMA 4. *Let N* = |∪*G*|, *g* = |*G*|, *m* = Σ_{X∈G} |*X*|, *and n* = |∪*F*|. *With probability at least* 1 – 1/*n*^{2}, *the algorithm described above returns an element x* ∈ ∪*G* *according to the uniform distribution. With high probability, the algorithm has a running time of O*(*g* log^{4} *n*).

PROOF. We start by bounding the initial failure probability of the data structure. By a union bound, we have that the following two events hold simultaneously with probability at least 1 – 1/*n*^{2}:

- Every segment of size
*n/k*contains no more than λ = Θ(log*n*) elements from ∪*G*for all*k*= 2^{i}where*i*∈ {1, …, log*n*}. Since elements are initially randomly permuted, the claim holds with probability at least 1 – 1/(2*n*^{2}) by suitably setting the constant in λ = Θ(log*n*). - It does not happen that the algorithm reports ⊥. The probability of this event is upper bounded by the probability
*p’*that no element is returned in the Σ iterations where*k*= 2⌈^{log N⌉}(the actual probability is even lower, since an element can be returned in an iteration where*k*> 2⌈^{log N}⌉). By suitably setting constants in

From now on assume that these events are true.

As noted earlier, each element has probability of 1/(*k*λ) of being returned as output, and thus, elements are equally likely to be sampled. Note also that the guarantees are independent of the initial random permutation as soon as the two events above hold. This means that the data structure returns a uniform sample from a union-of-sets.

For the running time, first focus on the round where *k* = 2⌈^{log N}⌉. In this round, we carry out Θ(log^{2} *n*) iterations. In Step (IV), λ_{G,h} is computed by iterating through the *g* sets and collecting points using a range query on segment Λ_{h}. Since elements in each set are sorted by their rank, the range query can be carried out by searching for rank *hn/k* using a binary search in *O*(log *n*) time and then enumerating all elements with rank smaller than (*h*+1)*n/k*. This takes time *O*(log *n* + *o*) for each set, where *o* is the output size. Since each segment contains *O*(log *n*) elements from ∪*G* with high probability, one iteration of Step (IV) takes time *O*(*g* log *n*).

The time to carry out all Σ = Θ(log^{2} *n*) iterations is thus bounded by *O*(*g* log^{3} *n*). Observe that for all the rounds carried out before, *k* is only larger and thus, the segments are smaller. This means that we may multiply our upper bound with log *n*, which completes the proof. ◻

Using count distinct sketches to find a good choice for the number of segments *k*, the running time can be decreased to *O*(*g* log^{3} *n*); we refer to the full version^{8} for more details.

### 3. Handling Outliers

Imagine a situation where we have a marked set of outliers *O*. We are interested in sampling from ∪*G*\*O*. We assume that the total degree of the outliers in the query is at most *m*_{O} for some prespecified parameter *m*_{O}. More precisely, we have **d**_{G}(*O*) = Σ_{x∈O} **d**_{G}(*x*)≤*m*_{O}. We get the following results by running the original algorithms from the previous section and removing outliers once we encounter them. If we encounter more than *m*_{O} outliers, we report that the number of outliers exceeds *m*_{O}.

Running the algorithm described in Section 2.1 provides the guarantees summarized in the following lemma.

LEMMA 5. *The input is a family of sets F that one can preprocess in linear time. A query is a sub-family* *G* ⊆ *F*, *a set of outliers O, a parameter m*_{O}, *and a parameter* ε ∈ (0, 1). *One can either:*

*Sample an element x*∈ ∪*G*\*O**with an ε-approximate uniform distribution: specifically, the probability*μ_{x}*of x to be output is*ϕ/(1 + ε)≤*μ*_{x}≤(1 + ε)ϕ,*with*ϕ=1/|∪*G*\*O*|.*Alternatively, report that*d_{G}(*O*) >*m*_{O}.

*The expected query time is O*(*m*_{O} + *g* log(*n*/ε)), *and the query succeeds with high probability, where g* = |*G*|, *and n* = ||*F*||.

Running the algorithm described in Section 2.2 and keeping track of outliers has the following guarantees.

LEMMA 6. *The input is a family of sets F that one can preprocess in linear time. A query is a sub-family* *G* ⊆ *F*, *a set of outliers* *O*, *and a parameter m*_{O}. *With high probability, one can either:*

*Sample a uniform element x*∈ ∪*G*\*O*,*or**Report that*d_{G}(*O*)>*m*_{O}.

*The expected query time is O*((*g* + *m*_{O}) log^{4} *n*).

### 4. Finding a Fair Near Neighbor

In this section, we employ the data structures developed in the previous sections to show the results on fair near neighbor search listed in Section 1.3.

First, let us briefly give some preliminaries on LSH. We refer the reader to Har-Peled et al.^{16} for further details. Throughout the section, we assume our metric space (*X*, *D*) admits an LSH data structure.

Locality Sensitive Hashing (LSH) is a common tool for solving the ANN problem and was introduced in Har-Peled et al.^{16}

DEFINITION 4. *A distribution H over maps h*: *X* → *U, for a suitable set U, is* (*r*, *c*·*r*, *p*_{1}, *p*_{2})-sensitive *if the following holds for any* **x**, **y** ∈ *X:*

*if**D*(**x**,**y**)≤*r*,*then*Pr_{h}[*h*(**x**)=*h*(**y**)]≥*p*_{1};*if**D*(**x**,**y**)>*c*·*r*,*then*Pr_{h}[*h*(**x**)=*h*(**y**)]≤*p*_{2};

*The distribution H is an* LSH family, *and has quality*
.

For the sake of simplicity, we assume that *p*_{2} ≤ 1/*n*: if *p*_{2} > 1/*n*, then it suffices to create a new LSH family *H*_{K} obtained by concatenating *K* = Θ (log_{p2} (1/*n*)) independent and identically distributed hashing functions from H. The new family
-sensitive and ρ does not change.

The standard approach to (*c*, *r*)-ANN using LSH functions is the following. Let *D* denote the data structure constructed by LSH, and let *c* denote the approximation parameter of LSH. Each *D* consists of L = *n*^{ρ} hash functions *ℓ*_{1}, …, *ℓ*_{L} randomly and uniformly selected from *H.* *D* contains *L* hash tables *H*_{1}, … *H*_{L}: each hash table *H _{i}* contains the input set

*S*and uses the hash function

*ℓ*

_{i}to split the point set into buckets. For each query

**q**, we iterate over the L hash tables: For any hash function, compute

*ℓ*

_{i}(

**q**) and compute, using

*H*, the set

_{i}of points in *S* with the same hash value; then, compute the distance *D*(**q, p**) for each point **p** ∈ *H _{i}*(

**q**). The procedure stops as soon as a (

*c, r*)-near point is found. It stops and returns ⊥ if there are no remaining points to check or if it found more than 3L far points. We summarize the guarantees in the following lemma.

^{16}

LEMMA 7. *For a given query point* **q,** *let S*_{q} = ∪_{i}*H*_{i}(**q**). *Then for any point* **p** ε *B _{S}*(

**q**,

*r*),

*we have that with a probability of least*1 – 1/

*e*– 1/3,

*we have (i)*

**p**∈

*S*

_{q}

*and (ii)*|

*S*

_{q}\

*B*(

_{S}**q**,

*cr*)| ≤ 3L,

*that is, the number of outliers is at most 3L. Moreover, the expected number of outliers in any single bucket*

*S*

_{i,ℓi(q)}

*is at most*1.

By repeating the construction *O*(log *n*) times, we guarantee that with high probability *B*(**q**, *r*) ∪ *S*_{q}.

For *t* = *O*(log *n*), let *D*_{1}, …, *D*_{t} be data structures constructed by LSH. Let *F* be the set of all buckets in all data structures, that is,
. For a query point **q**, consider the family *G* of all buckets containing the query, that is,
, and thus |*G*| = *O*(L log *n*). Moreover, we let*O* to be the set of outliers, that is, the points that are farther than *cr* from *q.* Note that as mentioned in Lemma 7, the expected number of outliers in each bucket of LSH is at most 1. Therefore, by Lemma 5, we immediately get the following result.

LEMMA 8. *Given a set S of n points and a parameter r, we can preprocess it such that given query* **q,** *one can report a point* **p** ∈ *S with probability* μ_{p} *where* ϕ/(1 + *ε*)≤*μ*_{p}≤(1+*ε*)ϕ *S is a point set such that B _{S}*(

**q**,

*r*)⊆

*S*⊆

*B*(

_{S}**q**,cr),

*and*ϕ = 1/|

*S*|.

*The algorithm uses space O*(L log

*n*)

*and its expected query time is O*(L log

*n*log(

*n*/ε)).

REMARK 2. *By repeatedly calling the query procedure and disregarding points at distance larger than r, the algorithm described above solves the Approximately Fair NN Problem (Definition 2). The probability that the algorithm succeeds in a round is* ρ = *n*(**q**, *r*)/*n*(**q**, *cr*), *and as such the expected number of rounds is* 1/ρ. *Thus, this approach has expected query time*
.

We use the same setup as in the previous section and build *t* = *O*(log *n*) data structures *D*_{1}, …, *D _{t}* using LSH. We use the algorithm described in Section 2.2 with all points at distance more than

*r*from the query marked as outliers. By the properties of the LSH and the random ranks, we expect to see points at distance at least

*r.*This allows us to obtain the following results.

LEMMA 9. *Given a set S of n points and a parameter r, we can preprocess it such that given a query* **q,** *one can report a point* **p** ∈ *S with probability* 1/*n*(**q,** *r*). *The algorithm uses space O*(L log *n*) *and has expected query time*
.

### 5. Experimental Evaluation

The example provided in Section 1.1 already showed the bias of sampling naively from the LSH buckets. In this section, we want to consider the influence of the approximative variants discussed here and provide a brief overview of the running time differences. A detailed experimental evaluation can be found in the full paper.^{8}

For concreteness, we take the MNIST dataset of handwritten digits available at http://yann.lecun.com/exdb/mnist/. We use the Euclidean space LSH from Datar et al.,^{13} set a distance threshold of 1250, and initialize the LSH with L = 100 repetitions, *k* = 15, and *w* = 3750. These parameter settings provide a false negative rate of around 10%. We take 50 points as queries and test the following four different sampling strategies on the LSH buckets:

**Uniform/uniform:**Picks bucket uniformly at random and picks a random point in bucket.**Weighted/uniform:**Picks bucket according to its size, and picks uniformly random point inside bucket.**Degree approximation:**Picks bucket according to size, and picks uniformly random point**p**inside bucket. It approximates**p**‘s degree (using Lemma 1) and rejects**p**with probability 1 – 1/ deg'(*p*). This is the approach discussed in Remark 2.**Optimal:**Picks bucket according to size, and picks uniformly random point**p**inside bucket. Then, it computes**p**‘s degree*exactly*and rejects**p**with probability 1 – 1/deg(**p**). This is the approach discussed in Remark 2, but with exact degree approximation, solving Fair NN.

Each method removes non-close points that might be selected from the bucket. We remark that the variant Uniform/uniform most closely resembles a standard LSH approach. Weighted/Uniform takes the different bucket sizes into account, but disregards the individual frequency of a point. Thus, the output is *not expected* to be uniform, but might be closer in distribution to the uniform distribution.

**Output Distribution.** For each query **q,** we compute the set of near neighbors *M*(**q**) of **q** in the LSH buckets. For each sampling strategy, we carry out the query 100|*M*(**q**)| times. The sampling results give rise to a distribution μ on *M*(**q**), and we compare this distribution to the uniform distribution in which each point is sampled with probability 1/|*M*(**q**)|. Figure 2 reports on the total variation distance between the uniform distribution and the observed distribution, that is,
. As in our introductory example, we see that uniformly picking an LSH bucket results in a heavily biased distribution. Taking the size of the buckets into account in the weighted case helps a bit, but still results in a heavily biased distribution. Even with the easiest approximation strategy for the degree, we see an improvement and achieve a total variation distance of around 0.08, with the optimal algorithm achieving around 0.04.

**Figure 2. Total variation distance of different approaches on the MNIST dataset.**

**Differences in Running Time.** Compared to a naïve approach of collecting all colliding points in the buckets and selecting a near neighbor at random, the methods presented here provide a speed-up of more than two orders of magnitude in our experiments. The fair methods based on rejection sampling are approximately a factor of 10 slower than their biased counterparts that just pick a (weighted) point at random. Finally, the approximate degree sampling provides running times that are approximately two times faster than an exact computation of the degree.

### 6. Conclusion and Future Work

In this paper, we have investigated a possible definition of fairness in similarity search by connecting the notion of “equal opportunity” to independent range sampling. An interesting open question is to investigate the applicability of our data structures for problems such as discrimination discovery,^{27} diversity in recommender systems,^{1} privacy preserving similarity search,^{26} and estimation of kernel density.^{11} Moreover, it would be interesting to investigate techniques for providing incentives (that is, reverse discrimination^{27}) to prevent discrimination: An idea could be to merge the data structures in this paper with distance-sensitive hashing functions in Aumüller et al.,^{6} which allow to implement hashing schemes where the collision probability is an (almost) arbitrary function of the distance. Finally, the techniques presented here require a manual trade-off between the performance of the LSH part and the additional running time contribution from finding the near points among the non-far points. From a user point of view, we would much rather prefer a parameterless version of our data structure that finds the best trade-off with small overhead, as discussed in Ahle et al.^{3} in another setting.

### Acknowledgments

S. Har-Peled was partially supported by a NSF AF award CCF-1907400. R. Pagh is part of BARC, supported by the VILLUM Foundation grant 16582. F. Silvestri was partially supported by PRIN Project n. 20174LF3T8 AHeAD.

## Join the Discussion (0)

## Become a Member or Sign In to Post a Comment