The traditional approach to statistical disclosure control (SDC) for privacy protection is utility-first. Since the 1970s, national statistical institutes have been using anonymization methods with heuristic parameter choice and suitable utility preservation properties to protect data before release. Their goal is to publish analytically useful data that cannot be linked to specific respondents or leak confidential information on them.
In the late 1990s, the computer science community took another angle and proposed privacy-first data protection. In this approach a privacy model specifying an ex ante privacy condition is enforced using one or several SDC methods, such as noise addition, generalization, or microaggregation. The parameters of the SDC methods depend on the privacy model parameters, and too strict a choice of the latter may result in poor utility. The first widely accepted privacy model was k-anonymity, whereas differential privacy (DP) is the model that currently attracts the most attention.
DP was originally proposed for interactive statistical queries to a database.5 A randomized query function k (that returns the query answer plus some noise) satisfies ε-DP if for all datasets D1 and D2 that differ in one record and all S ⊂ Range (k), it holds that Pr(k(D1) ∈ S) ≤ exp(ε) × Pr(k(D2) ∈ S). In other words, the presence or absence of any single record must not be noticeable from the query answers, up to an exponential factor of ε. The smaller ε, the higher the protection. The noise to be added to the answer to enforce a certain ε depends on the global sensitivity of the query to the presence or absence of any single record. Mild noise may suffice for statistical queries such as the mean, while a very large noise is needed for identity queries returning the contents of a specific record.
DP offers a very neat privacy guarantee and, unlike privacy models in the k-anonymity family, does not make assumptions on the intruder's background knowledge (although it assumes all records in the database are independent.3,10 For this reason, DP was rapidly adopted by the research community to the point that previous approaches tend to be regarded as obsolete. Researchers and practitioners have extended the use of DP beyond the interactive setting it was designed for. Extended uses include: data release, where privacy of respondents versus data analysts is the goal, and collection of personal information, where privacy of respondents versus the data collector is claimed. Google, Apple, and Facebook have seen the chance to collect or release microdata (individual respondent data) from their users under the privacy pledge "don't worry, whatever you tell us will be DP-protected."
However, applying DP to record-level data release or collection (which is equivalent to answering identity queries) requires employing a large amount of noise to enforce a safe enough ε. As a result, if ε ≤ 1 is used, as recommended in Dwork and Roth5 to obtain a meaningful privacy guarantee, the analytical utility of DP outputs is likely to be very poor.2,7 This problem arose as soon as DP was moved outside the interactive setting. A straightforward way to mitigate the utility problem is to use unreasonably large ε.
Let us look at data collection. Apple reportedly uses ε = 6 in MacOS and ε = 14 in iOS 10 (with some beta versions using even ε = 43).9 In their RAPPOR technology, Google uses ε up to 9. According to Frank McSherry, one of the co-inventors of DP, using ε values as high as 14 is pointless in terms of privacy.9 Indeed, the privacy guarantee of DP completely fades away for such large values of ε.5
As to data release, Facebook has recently released DP-protected datasets for social science research, but it is unclear which ε value they have used. As pointed out in Mervis,12 this makes it difficult both to understand the privacy guarantees being offered and to assess the trustworthiness of the results obtained on the data. The first released version of this DP dataset had all demographic information about respondents and most of the time and location information removed, and event counts had been added noise with σ = 200.6 The second version looks better, but is still analytically poorer than the initial version released in 2018 under utility-first anonymization based on removal of identifiers and data aggregation. In fact, Facebook researchers have acknowledged the difficulties of implementing (and verifying) DP in real-world applications.11
The U.S. Census Bureau has also announced the use of DP to disseminate Census 2020 results.8 A forecast of the negative impact of DP on the utility of the current Census is given in Santos-Lazoda et al.14 Additionally, Ruggles et al.13 have remarked that DP "is a radical departure from established Census Bureau confidentiality laws and precedents": the Census must take care of preventing respondent re-identification, but masking respondent characteristics—as DP does—is not required. A more fundamental objection in Ruggles et al.13 is against the very idea of DP-protected microdata. As introduced previously, publishing useful record-level microdata under DP is exceedingly difficult. This is only logical: releasing DP-microdata, that is, individual-level data derived from real people, contradicts the core idea of DP, namely that the presence or absence of any individual should not be noticeable from the DP output.
A further shortcoming arises when trying to use DP to protect continuous data collection like Apple and Google do. DP is subject to sequential composition: if a dataset collected at time t1 is DP-protected with ε1 and a dataset collected at time t2 on a non-disjoint group of respondents is DP-protected with ε2, the dataset obtained by composing the two collected datasets is DP-protected only with ε1 + ε2. Therefore, to enforce a certain ε after n data collections on the same set of individuals, each collection should be DP-protected with ε/n, thereby very substantially reducing the utility of the collected data. Strictly speaking, it is impossible to collect DP-protected data from a community of respondents an indefinite number of times with a meaningful privacy guarantee. As a remedy, Apple made the simplification that sequential composition only applies to the data collected on an individual during the same day but not in different days.9 Google took a different way out: they use sequential composition only for values that have not changed from the previous collection.4 Both fixes are severely flawed: the data of an individual collected across consecutive days or that may have changed still refer to the same individual, rather than to disjoint individuals. Ignoring this and conducting a systematic data collection for long periods increases the effective ε and thus reduces the effective level of protection by several orders of magnitude. This issue is significantly more privacy-harming than the previously mentioned large values of ε declared by Apple and Google.
Machine learning (ML) has also seen applications of DP. In Abadi et al.,1 DP is used to ensure deep learning models do not expose private information contained in the datasets they have been trained on. Such a privacy guarantee is interesting to facilitate crowdsourcing of representative training data from individual respondents. The paper describes the impact of sequential composition over ML training epochs (which can be viewed as continuous data collection) on the effective ε: after 350 epochs, a very large ε = 20 is attained. To obtain usable results without rising to such a large ε, the authors use the (ε,δ) relaxation of DP that keeps ε between 2 and 8. Employing relaxations of DP to avoid the "bad press" of large ε while keeping data usable is a common workaround in recent literature.1,15 However, DP relaxations are not a free lunch: relaxed DP is not DP anymore. For example, with (ε,δ)-DP, "δ values on the order of 1/|D| (where |D| is the size of the dataset) are very dangerous: they permit preserving privacy by publishing the complete records of a small number of database participants."5 The value of δ employed in Abadi et al.1 and Triastcyn and Faltings15 is in fact on the order of 1/|D|, thereby incurring severe risk of disclosure. Nonetheless, despite the use of relaxations and large ε and δ, the impact of DP on data utility remains significant: in Abadi et al.1 the deep learning algorithm without privacy protection achieves 86% accuracy on the CI-FAR-10 dataset, but it falls down to 73% for ε = 8 and to 67% for ε = 2.
Fundamental misunderstandings and flawed implementations pervade the application of differential privacy to data releases, data collection, and machine learning.
Very recently, DP has also been proposed for a decentralized form of ML called federated learning. Federated learning allows a model manager to learn a ML model based on data that are privately stored by a set of clients: in each epoch, the model manager sends the current model to the clients, who return to the manager a model update based on their respective private datasets. This does not require the clients to surrender their private data to the model manager and saves computation to the latter. However, model updates might leak information on the clients' private data unless properly protected. To prevent such a leakage, DP is applied to model updates.15,16 However, in addition to distorting model updates, using DP raises the following issues:
- Since model updates are protected in each epoch, and in successive epochs they are computed on the same (or, at least, not completely disjoint) client data, sequential composition applies. This means that the effective ε grows with the number of epochs, and the effective protection decreases exponentially. Therefore, reasonably useful models can only be obtained for meaninglessly large ε (such as 50–10016).
- In the original definition of DP, a dataset where each record contains the answer of a different respondent is assumed. Then DP ensures the record contributed by any single respondent is unnoticeable from the released DP-protected output. This protects the privacy of any single respondent. However, when DP is used to protect the model update submitted by a client, all records in the client's dataset belong to the client. Making any single record unnoticeable is not sufficient to protect the client's privacy when all records in the client's private dataset are about the client, as it happens for example, if the client's private data contain her health-related or fitness measurements. Thus, the DP guarantee loses its significance in this case.
Conclusion
Differential privacy is a neat privacy definition that can co-exist with certain well-defined data uses in the context of interactive queries. However, DP is neither a silver bullet for all privacy problems nor a replacement for all previous privacy models.3 In fact, extreme care should be exercised when trying to extend its use beyond the setting it was designed for. As we have highlighted, fundamental misunderstandings and blatantly flawed implementations pervade the application of DP to data releases, data collection, and machine learning. These misconceptions have serious consequences in terms of poor privacy or poor utility and they are driven by the insistence to twist DP in ways that contradict its own core idea: to make the data of any single individual unnoticeable.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment