acm-header
Sign In

Communications of the ACM

Contributed articles

Private Information Retrieval


data pipes

Credit: Jean-Francois Podevin

The ubiquity of the Internet means a plethora of online public databases and an indispensable resource for retrieving up-to-date information. But it also poses a significant risk to user privacy, since a malicious database owner may monitor user queries and infer what the user is after. Indeed, in cases where user intentions are to be kept secret, users are often cautious about accessing public databases. For example, investors querying a stock-market database for the current market value of certain stocks might prefer not to reveal their interest in the stocks because it could inadvertently influence their price. Alternatively, companies might want to search for certain patents without revealing the patents' identities.

Private information retrieval (PIR) schemes are cryptographic protocols designed to safeguard the privacy of database users. They allow clients to retrieve records from public databases while completely hiding the identity of the retrieved records from database owners. The possibility of retrieving database records without revealing their identities to the owner of the database may seem beyond hope. Note, however, that a trivial solution is available: When users want a single record, they can ask for a copy of the whole database. This solution involves enormous communication overhead and is likely to be unacceptable. It turns out7 that for users who want to keep their privacy fully protected (in the "information-theoretic" sense), this trivial solution is optimal.

Fortunately, the negative result applies only to databases stored on a single server, rather than those replicated across several servers. In 1995, Chor et al.7 came up with PIR schemes that enable private retrieval of records from replicated databases, with a non-trivially small amount of communication. In such protocols, users query each server holding the database. The protocol ensures that each individual server (by observing only the query it receives) gets no information about the identity of the items of user interest. Chor et al's seminal paper7 triggered an important and extensive body of follow-up work.1, 3, 4, 8, 12, 18, 21, 23, 24

Computational PIR. PIR schemes discussed here are often called "information theoretic" because they provide an absolute guarantee that each server participating in protocol execution gets no information about what users are after. PIR has also been studied in a computational setting.6, 10, 15, 16 Computational PIR schemes are similar to their information-theoretic counterparts but provide a weaker guarantee. Specifically, they ensure only that a server cannot get any information about user intent unless it solves a certain computationally hard problem (such as factoring a large random integer). Providing privacy or security guarantees based on computational hardness assumptions is common in modern cryptography. In contrast to information-theoretic PIR schemes, computational PIR protocols with low communication overhead exist (under standard assumptions), even when a database is stored on a single server (see the figure here). However, for typical real-life parameters, known computational protocols are less efficient than known information-theoretic ones.

Locally decodable codes. PIR schemes are intimately related to a special class of error-correcting codes called "locally decodable codes," or LDCs, that, on their own, are objects of interest. All recent constructions of information-theoretic PIR schemes work by first constructing LDCs, then converting them to PIRs.

Error-correcting codes help ensure reliable transmission of information over noisy channels, as well as reliable storage of information on a medium that may be partially corrupted over time or whose reading device is subject to errors. Such codes allow one to add redundancy, or bit strings, to messages, encoding them into longer bit strings, called "codewords," in a way that the message can still be recovered even if a certain fraction of the codeword bits are corrupted. In typical applications of error-correcting codes the message is first partitioned into small blocks, each of which is then encoded separately. This encoding strategy allows efficient random-access retrieval of the information, since one must decode only the portion of data in which one is interested. Unfortunately, this strategy yields poor noise resilience, since, when even a single block (out of possibly tens of thousands) is completely corrupted, some information is lost. In view of this limitation it would seem preferable to encode the whole message into a single codeword of an error-correcting code. Such a solution improves the robustness to noise but is hardly satisfactory, since one needs to look at the whole codeword in order to recover any particular bit of the message (at least when using classical error-correcting codes). Such decoding complexity is prohibitive for today's massive data sets.

LDCs simultaneously provide efficient random-access retrieval and high noise resilience by allowing reliable reconstruction of an arbitrary bit of the message from looking at only a small number of randomly chosen codeword bits. Local decodability comes at the price of certain loss in terms of code efficiency. Specifically, LDCs require longer codeword lengths than their classical counterparts. Though LDCs were discussed in the literature2, 19 in the early 1990s, the first formal definition of LDCs was given in 2000 by Katz and Trevisan.13 Further work on the efficiency of LDCs was covered in.3, 4, 8, 12, 14, 18, 24

Outline. Computational and information theoretic PIR schemes rely on different sets of techniques. The main focus in this article is information-theoretic schemes, reserving the term PIR for information-theoretic protocols. In the following section we make the notions of information-theoretic PIR and LDCs more concrete, explaining the relationship between them. Later, we demonstrate how nontrivial PIR is possible, offering an exposition of one of the simplest and earliest schemes. We then present the most basic PIR protocol of the current generation. Finally, we discuss computational PIR protocols and an early single-server scheme due to Kushilevitz and Ostrovsky.15

Back to Top

Preliminaries

Beginning with a (slightly informal) definition of PIR schemes, we model the database as an n-bit string x that is replicated between a small number k of non-communicating servers S1,...,Sk. The user holds an index i (an integer between 1 and n) and is interested in obtaining the value of the bit xi. To achieve this goal, the user tosses several random coins, queries each server, and receives replies from which the desired bit xi is computed. The query to each server is distributed independently of i; therefore, each server gets no information about what the user is after.

Note that user queries are not necessarily requests for particular individual bits or sets of database bits. Rather, they specify functions computed by the servers; for example, a query may specify a set of indices between 1 and n, and the server's response may be the XOR of the database bits stored at these indices.

Historically, the main parameter of interest in PIR schemes was communication complexity, or a function of n measuring the total number of bits communicated between user and servers, maximized over all choices of x in {0, 1}n, i in 1 to n, and the user's random coin tosses.

The most efficient two-server PIR protocols available today have communication complexity of O(n1/3). These protocols, which are due to Chor et al.,7 have never been improved upon. However, PIR schemes involving three or more servers have seen improvement.1, 4, 7, 8, 12, 24 The best-known are due to Efremenko8 and Itoh and Suzuki.12 Efremenko's three-server PIR schemes require f(n) bits of communication, where f(n) is a certain function that grows slower than any polynomial in n but faster than any power of the logarithm of n. The best lower bound for the communication complexity of two-server PIR is 5 log n due to Wehner and de Wolf.21 Closing the gap between upper and lower bounds for PIR communication complexity is a major open problem.

Now we address a more detailed treatment of LDCs that have three parameters: k, , and . A (k, , )-LDC encodes n-bit messages x to N-bit codewords C(x), such that for every i between 1 and n the bit xi can be recovered with probability 1 by a randomized decoding procedure that reads only k codeword bits, even after the codeword C(x) is adversarially corrupted in up to N locations. Here, the probability is only over the decoder's random coin tosses.

We illustrate the notion of LDCs by presenting a (2, , 2)-(Hadamard) LDC that encodes n-bit messages to 2n-bit codewords. In what follows, let [n] denote the set {1,...,n}.

Hadamard code. Every coordinate in the Hadamard code corresponds to one (of 2n) subsets of [n] and stores the XOR of the corresponding bits of the message x. Let y be an (adversarially corrupted) encoding of x. Given an index i jukcy_u.gif [n] and y, the Hadamard decoder picks a set S in [n] uniformly at random and outputs the XOR of the two coordinates of y corresponding to sets S and S {i}. (Here, denotes the symmetric difference of sets (such as {1, 4, 5} {4} = {1, 5}, and {1, 4, 5} {2} = {1, 2, 4, 5}). It's not difficult to verify that both decoders' queries go to uncorrupted locations if y differs from the correct encoding of x in at most fraction of coordinates than with probability 1 2. In such cases, the decoder correctly recovers the i-th bit of x.

The Hadamard code allows for a super-fast recovery of the message bits (such as, given a codeword corrupted in 0.1 fraction of coordinates, one is able to recover any bit of the message with probability 0.8 by reading only two codeword bits) at a price of very large codeword length. Designing LDCs with optimal, or smallest possible, codeword length is a major challenge.

The definition of LDCs implies that every bit xi of the message can be recovered from many different k-tuples of codeword bits. Note that such k-tuples cannot all belong to a small (o(N)-size) subset S of codeword coordinates, since corrupting the coordinates in S could lead to the high probability of a decoding error. Thus the distribution of every individual query of the decoder must be somewhat close to a uniform distribution on the codeword bits. Indeed, many known LDCs have decoders whose individual queries are distributed perfectly uniformly; such LDCs are called "perfectly smooth."


The protocol ensures that every individual server (by observing only the query it receives) gets no information about the identity of the items of user interest.


There is a strong relationship between LDCs and PIR schemes. Short LDCs yield efficient PIR schemes and vice versa. Here, we demonstrate the flavor of this relationship, presenting a general procedure that obtains a k-server PIR scheme from any perfectly smooth k-query LDC.

Let C be a perfectly smooth LDC encoding n-bit messages to N-bit codewords. At the preprocessing stage, servers S1,...,Sk encode the n-bit database x with the code C. Next, a user interested in obtaining the i-th bit of x tosses random coins and generates a k-tuple of queries (q1,...,qk), such that xi can be computed from C(x)q1,...,C(x)qk. For every j in [k], the user sends the query qj to the server Sj. Each server Sj responds with a one-bit answer C(x)qj. The user combines the servers' responses to obtain xi.

Verifying that the protocol is private is straightforward, since for every j in [k] the query qj is uniformly distributed over the set of codeword coordinates; the total communication is given by k(logN + 1).

Back to Top

Early PIR Scheme

Let d be a small integer. Our goal here is to demonstrate how nontrivial PIR is possible by presenting a simple (d+1)-server scheme with O(n1/d) communication to access an n-bit database. The key idea behind this scheme is polynomial interpolation in a finite-field setting.

We begin with some technical background. Let p > d be a prime. It is well known that addition and multiplication of numbers {0,..., p1} modulo p satisfy the standard identities that one is used to over the real numbers. That is, numbers {0,..., p1} form a finite field with respect to these operations. This field is denoted by Fp. In what follows we deal with polynomials defined over finite fields. Such polynomials have all algebraic properties that one is used to with polynomials over the real numbers. Specifically, a univariate polynomial over Fp of degree d is uniquely determined by its values at any d + 1 points.

Let m be a large integer. Let E1,...,En be a certain collection of n vectors over Fp of dimension m. The collection is fixed and independent of the n-bit database x. We assume the collection is known to both the servers and the user. On the preprocessing stage of the PIR protocol, each of (d + 1) servers represents the database x by the same degree d polynomial f in m variables. The key property of such a polynomial is that for every i in [n]: f(Ei) = xi. In order to ensure that such a polynomial f exists we choose m to be reasonably large compared to n. Setting m = O(n1/d) suffices.

Now suppose the user wants to retrieve the i-th bit of the database and knows the collection of vectors E1,...,En. The user's goal is thus to recover the value of the polynomial f (held by the servers) at Ei. Obviously, the user cannot explicitly request the value of f at Ei from any of the servers, since such a request would ruin the privacy of the protocol; that is, some server will get to know which database bit the user is after. Instead, the user obtains the value of f(Ei) indirectly, relying on the rich structure of local dependencies between the evaluations of a low-degree polynomial f at multiple points. Specifically, the user generates a randomized collection of m-dimensional vectors P1,...,Pd+1 over Fp such that:

  1. Each of the vectors P is individually uniformly random and thus provides no information about Ei; and
  2. The values of any degree d polynomial (including the polynomial f) at P1,...,Pd+1 determine the value of the polynomial at Ei.

The user sends each server one of the vectors P1,...,Pd+1. The servers then evaluate the polynomial f at the vectors they receive and return the values they obtain back to the user. The user combines the values f(P1),...,f(Pd+1) to get the desired value f(Ei). The protocol is perfectly private, and the communication amounts to sending (d + 1) vectors of dimension m to the servers and a single value back to the user. Here, we elaborate on how vectors P1,...,Pd+1 are chosen.

The user picks an m-dimensional vector V uniformly at random and for every between 1 and d+1 sets P = Ei+V. Clearly, every individual vector P is uniformly random. To see that values of f(P1),...,f(Pd+1) determine the value of f(Ei) consider a univariate polynomial g() = f(Ei + V). Note that the degree of g is at most d. Therefore the values of g at d + 1 points determine the polynomial g uniquely. It remains to notice that g() = f(P) and g(0) = f(Ei) = xi.

This PIR scheme is a classical example of PIR schemes. The ideas behind it can be traced back to Babai et al.2 The idea of obtaining PIR schemes and LDCs through polynomials was used extensively thereafter.1, 3, 7 The most powerful constructions in this framework are due to Beimel et al.4; see also Woodruff and Yekhanin.23

Back to Top

Modern LDCs and PIRs

In 2007, the author of the current article proposed a new approach to designing LDCs and PIR schemes,24 leading to development of today's most effective codes and schemes.8, 12, 18 The new constructions are not based on polynomials; their key technical ingredient is a design of a large family of sets with restricted intersections. Here, we offer a bird's-eye view of the construction of LDCs. Such codes are easily turned into PIR schemes.

Let k be a small integer. Here, we design a binary k-query LDC C that encodes n-bit messages to codewords. This construction involves two steps: The first is a reduction to a problem of constructing a certain family of sets with restricted intersections; the second is an algebraic construction of the desired family.

Step 1. Construct an F2-linear code C, so:

  • C is an F2-linear map. For any two messages x1, x2 in Fn2 we have C(x1 + x2) = C(x1) + C(x2), where the sums of vectors are computed modulo two in each coordinate; and
  • The decoding algorithm proceeds by tossing random coins, reading a certain k-tuple of coordinates of the corrupted codeword and outputting the XOR of the values in these coordinates.

For i in [n], let ei denote a binary n-dimensional vector, whose unique non-zero coordinate is i. Observe that every linear LDC admits a combinatorial description. That is, to define a linear LDC it is sufficient to specify for every i in [n]:

  • A set Ti of coordinates of C(ei) that are set to 1. Such sets completely specify the encoding, since for any message x, C(x) = i:xi=1 C(ei); and
  • A family Qi of k-size subsets of codeword coordinates that can be read by a decoding algorithm while reconstructing the i-th message bit.

Not every collection of sets Ti and families Qi yields an LDC. Certain combinatorial constraints must be satisfied. The basic rationale for these constraints is the following:

  1. The decoding must be correct in case no codeword bits have been corrupted. This implies that for every i, j in [n] and any k-set S in Qi the size of S Tj must be odd if i = j and even otherwise; and
  2. The distribution of individual queries of the decoding algorithm must be close to uniform. This implies that for every i in [n], the union of the k-sets in Qi must be large relative to the number of codeword coordinates.

Step 2. Design a collection of sets Ti and families Qi that satisfy these constraints. This is where most of the technical work occurs. The construction is supported by a strong geometric intuition. We consider a bijection between the set of codeword coordinates and a set of m-dimensional vectors over a finite field of cardinality k. We choose sets Ti to be unions of certain parallel hyperplanes in the m-dimensional linear space over Fk and families Qi to be certain families of lines. (Note that lines over Fk have only k points.) We use basic algebra to argue about the intersection sizes.

Back to Top

Computational PIR

Computational PIR schemes are attractive because they avoid the need to maintain replicated copies of a database and do not compromise user privacy against colluding servers. Here, we share a high-level overview (the basic version) of the Kushilevitz and Ostrovsky scheme that relies on the standard hardness assumption15the "quadratic residuosity assumption."

Let m be a positive integer. A number a is called a "quadratic residue," or QR, modulo m, if there exists an integer x such that x2 = a mod m. Otherwise, a is called a "quadratic non-residue," or QNR, modulo m. The QR assumption states that it is computationally hard to distinguish numbers that are QRs modulo m from those that are not, unless one knows the factorization of m.

The protocol. The server stores the n-bit database x in a square matrix of size s by s for cacm5304_g.gif.

ueq01.gif

Suppose the database user is interested in obtaining the value xij for some i, j between 1 and s. The user would select a large integer m together with its factorization at random and generate s - 1 random QRs a1,...,ai1, ai+1,...,as modulo m, as well as a single random QNR bi. The user would then send the string a1,...,ai1, bi, ai+1,...,as and the integer m to the server.

The QR assumption implies that the server cannot distinguish bi from integers {al} and thus observes only a string of s random-looking integers u1,...,us modulo m (one per each row of the database). The server responds with s integers 1,..., s (one per each column of the database). Here each h is equal to a product of integers ul for all l such that xlh = 1; formally, h = l | xlh=1 ul mod m.

Verifying that the value of j is going to be a QR modulo m if xij = 0 and is going to be a QNR modulo m is not hard if xij = 1, since a product of two QRs is a QR and the product of a QR with a QNR is a QNR. The user need only check whether j is a QR, which is easy, since the user knows the factorization of m.

The total amount of communication in this PIR scheme is roughly cacm5304_h.gif. Better protocols are available; see Gentry and Ramzan10 and Lipmaa16 for the most efficient computational PIR schemes.

Back to Top

Conclusion

With nearly 15 years of development, the PIR field has grown large and deep, with many subareas and connections to other fields. For more details, see Ostrovsky and Sketch17 for a survey of computational PIR and Trevisan20 and Yekhanin25 for a survey of information theoretic PIR; see also Gasarch.9

Here, we have concentrated on a single (the most studied) aspect of PIR schemestheir communications complexity. Another important aspect of PIR schemes is the amount of computation servers must perform in order to respond to user queries.5, 11, 23 We are hopeful that further progress will lead to the wide practical deployment of these schemes.

Back to Top

Acknowledgments

We would like to thank Martin Abadi, Alex Andoni, Mihai Budiu, Yuval Ishai, Sumit Nain, and Madhu Sudan and the anonymous referees for helpful comments regarding this article.

Back to Top

References

1. Ambainis, A. Upper bound on the communication complexity of private information retrieval. In Proceedings of the Automata, Languages and Programming, 24th International Colloquium LNCS 1256 (Bologna, Italy, July 711). Springer, 1997, 401407.

2. Babai, L., Fortnow, L., Levin, L., and Szegedy, M. Checking computations in polylogarithmic time. In Proceedings of the 23rd Annual ACM Symposium on Theory of Computing (New Orleans, LA, May 58). ACM Press, New York, 1991, 2131.

3. Beimel, A., Ishai, Y., and Kushilevitz, E. General constructions for information-theoretic private information retrieval. Journal of Computer and System Sciences 71, 2 (2005), 213247.

4. Beimel, A., Ishai, Y., Kushilevitz, E., and Raymond, J.F. Breaking the O (n1/(2k1)) barrier for information-theoretic private information retrieval. In Proceedings of the 43rd Symposium on Foundations of Computer Science (Vancouver, B.C., Nov. 1619). IEEE Computer Society, 2002, 261270.

5. Beimel, A., Ishai, Y., and Malkin, T. Reducing the servers' computation in private information retrieval: PIR with preprocessing. In Proceedings of the 20th Annual International Cryptology Conference LNCS 1880 (Santa Barbara, CA, Aug. 2024). Springer, 2000, 5573.

6. Cachin, C., Micali, S., and Stadler, M. Computationally private information retrieval with polylogarithmic communication. In Proceedings of the International Conference on the Theory and Application of Cryptographic Techniques LNCS 1592 (Prague, Czech Republic, May 26). Springer, 1999, 402414.

7. Chor, B., Goldreich, O., Kushilevitz, E., and Sudan, M. Private information retrieval. Journal of the ACM 45, 6 (1998), 965981.

8. Efremenko, K. 3-query locally decodable codes of subexponential length. In Proceedings of the 41st Annual ACM Symposium on Theory of Computing (Bethesda, MD, May 31June 2). ACM Press, New York, 2009, 3944.

9. Gasarch, W. Web page on private information retrieval; http://www.cs.umd.edu/gasarch/pir/pir.html

10. Gentry, C. and Ramzan, Z. Single-database private information retrieval with constant communication rate. In Proceedings of the 32nd International Colloquium on Automata, Languages and Programming LNCS 3580 (Lisbon, Portugal, July 1115). Springer, 2005, 803815.

11. Ishai, Y., Kushilevitz, E., Ostrovsky, R., and Sahai, A. Batch codes and their applications. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing (Chicago, June 1316). ACM Press, New York, 2004, 262271.

12. Itoh, T. and Suzuki, Y. New constructions for query-efficient locally decodable codes of subexponential length. IEICE Transactions on Information and Systems E93-D, 2 (2010), 263270.

13. Katz, J. and Trevisan, L. On the efficiency of local decoding procedures for error-correcting codes. In Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (Portland, OR, May 2123). ACM Press, New York, 2000, 8086.

14. Kerenidis, I. and de Wolf, R. Exponential lower bound for 2-query locally decodable codes via a quantum argument. Journal of Computer and System Sciences 69, 3 (2004), 395420.

15. Kushilevitz, E. and Ostrovsky, R. Replication is not needed: Single-database computationally private information. In Proceedings of the 38th Annual Symposium on Foundations of Computer Science (Miami Beach, FL, Oct. 1922). IEEE Computer Society, 1997, 364373.

16. Lipmaa, H. An oblivious transfer protocol with log-squared communication. In Proceedings of the 11th International Conference on Information Security LNCS 5222 (Taipei, Sept. 1518). Springer, 2008, 314328.

17. Ostrovsky, R. and Sketch, W.E. III A survey of single database private information retrieval: Techniques and applications. In Proceedings of the 10th International Conference on Practice and Theory in Public-Key Cryptography LNCS 4450 (Beijing, Apr. 1620). Springer, 2007, 393411.

18. Raghavendra, P. A Note on Yekhanin's Locally Decodable Codes Technical Report TR07-016. Electronic Colloquium on Computational Complexity, 2007.

19. Sudan, M. Efficient Checking of Polynomials and Proofs and the Hardness of Approximation Problems. Ph.D. thesis, University of California, Berkeley, 1992.

20. Trevisan, L. Some applications of coding theory in computational complexity. Quaderni di Matematica 13 (2004), 347424.

21. Wehner, S. and de Wolf, R. Improved lower bounds for locally decodable codes and private information retrieval. In Proceedings of the 32nd International Colloquium on Automata, Languages, and Programming LNCS 3580 (Lisbon, Portugal, July 1115). Springer, 2005, 14241436.

22. Woodruff, D. New Lower Bounds for General Locally Decodable Codes Technical Report TR07-006. Electronic Colloquium on Computational Complexity, 2007.

23. Woodruff, D and Yekhanin, S. A geometric approach to information theoretic private information retrieval. SIAM Journal on Computing 47, 4 (2007), 10461056.

24. Yekhanin, S. Towards 3-query locally decodable codes of subexponential length. Journal of the ACM 55, 1 (2008), 114.

25. Yekhanin, S. Locally Decodable Codes and Private Information Retrieval Schemes. Ph.D. thesis. Massachusetts Institute of Technology, Cambridge, MA, 2007; http://research.microsoft.com/en-us/um/people/yekhanin/Papers/phdthesis.pdf

Back to Top

Author

Sergey Yekhanin (yekhanin@microsoft.com) is a researcher in Microsoft Research Silicon Valley, Mountain View, CA.

Back to Top

Footnotes

DOI: http://doi.acm.org/10.1145/1721654.1721674

Back to Top

Figures

UF1Figure. Two-server information-theoretic PIR scheme. To retrieve a database record, the user queries two servers, each of which stores a copy of the database, but individual queries carry no information about what the user is after. The desired record is obtained from the servers' combined responses.

Back to Top


©2010 ACM  0001-0782/10/0400  $10.00

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.


 

No entries found