Big data is all the rage; using large datasets promises to give us new insights into questions that have been difficult or impossible to answer in the past. This is especially true in fields such as medicine and the social sciences, where large amounts of data can be gathered and mined to find insightful relationships among variables. Data in such fields involves humans, however, and thus raises issues of privacy that are not faced by fields such as physics or astronomy.
Such privacy issues become more pronounced when researchers try to share their data with others. Data sharing is a core feature of big-data science, allowing others to verify research that has been done and to pursue other lines of inquiry the original researchers may not have attempted. But sharing data about human subjects triggers a number of regulatory regimes designed to protect the privacy of those subjects. Sharing medical data, for example, requires adherence to HIPAA (Health Insurance Portability and Accountability Act); sharing educational data triggers the requirements of FERPA (Family Educational Rights to Privacy Act). These laws require that, to share data generally, the data be de-identified or anonymized (note that, for the purposes of this article, these terms are interchangeable). While FERPA and HIPAA define the notion of de-identification slightly differently, the core idea is if a dataset has certain values removed, the individuals whose data is in the set cannot be identified, and their privacy will be preserved.
Previous research has looked at how well these requirements protect the identities of those whose data is in a dataset.2 Violations of privacy, like re-identification, generally work by linking data from a de-identified dataset with outside data sources. It is often surprising how little information is needed to re-identify a subject.
More recent research has shown a different, and perhaps more troubling, aspect of de-identification. These studies have shown the conclusions one can draw from a de-identified dataset are significantly different from those that would be drawn when the original dataset is used.1 Indeed, it appears the process of de-identification makes it difficult or impossible to use a de-identified (and therefore easily sharable) version of a dataset either to verify conclusions drawn from the original dataset or to do new science that will be meaningful. This would seem to put big-data social science in the uncomfortable position of having either to reject notions of privacy or to accept that data cannot be easily shared, neither of which are tenable positions.
This article looks at a particular dataset, generated by the massive open online courses (MOOCs) offered through the edX platform by Harvard University and the Massachusetts Institute of Technology during the first year of those offerings. It examines which aspects of the de-identification process for that dataset caused it to change significantly, and it presents a different approach to de-identification that shows promise to allow both sharing and privacy.
The first step in de-identifying a dataset is determining the anonymization requirements for that set. The notion of privacy that was used throughout the de-identification of this particular dataset was guided by FERPA, which requires personally identifiable information be removed, such as name, address, Social Security number, and mother's maiden name. FERPA also requires other information, alone or in combination, must not enable identification of any student with "reasonable certainty."
To meet these privacy specifications, the HarvardX and MITx research team (guided by the general counsel, for the two institutions) opted for a k-anonymization framework, which requires every individual in the dataset to have the same combination of identity-revealing traits as at least k-1 other individuals in the dataset. Identity-revealing traits, termed quasi-identifiers, are those that allow linking to other datasets; information that is meaningful within only a single dataset is not of concern.
Anonymizing a dataset with regard to quasi-identifiers is important in order to prevent the re-identification of individuals that would be made possible if these traits were linked with external data that shares the same traits. The example in Figure 1 illustrates how two datasets could be combined in such a way that allows re-identification.2
In the edX dataset, the quasi-identifiers were course ID, level of education, year of birth, gender, country, and number of forum posts. The number of forum posts is considered a quasi-identifier because the forum was a publicly accessible website that could be scraped in order to link user IDs with their number of forum posts. Course ID is considered a quasi-identifier because unique combinations of courses could conceivably enable linking personally identifiable information that a student posts in a forum with the edX dataset.
The required value of k within k-anonymization was set to 5 in this context, based on the U.S. Department of Education's Privacy Technical Assistance Center's claim that "statisticians consider a cell size of 3 to be the absolute minimum" and that values of 5 to 10 are even safer. A higher value of k corresponds to a stricter privacy standard, because more individuals are required to have a given combination of identity-revealing traits.3
Note this is not a claim that de-identifying the dataset to a privacy standard of k = 5 assures no one in the dataset can be re-identified. Rather, this privacy standard was chosen to allow legal sharing of the data.
There are two techniques to achieve a k-anonymous dataset: generalization and suppression. Generalization occurs when granular values are combined to create a broader category that will contain more records. This can be achieved both for numerical variables (for example, combining ages 20, 21, and 22 into a broader category of 20–22) and for categorical variables (for example, generalizing location data from "Boston" to "Massachusetts"). Suppression occurs when a record that violates anonymity standards is deleted from the dataset entirely.
Generalization and suppression techniques introduce differing kinds and degrees of distortion during the anonymization process. Relying on suppression can mean a large number of records in the dataset will be removed. Suppression-only de-identification also skews the integrity of a dataset when values are eliminated disproportionately to the original distribution of the data, causing distortion in resulting analyses.
On the other hand, generalized values are often less powerful than granular values—it may be difficult, for example, to fit a linear regression line on generalized numeric attributes. Further, while generalization-only de-identification leaves non-quasi-identifier fields intact, quasi-identifiers may become generalized to a point where few conclusions can be drawn about their relationship with other fields. Finally, since generalization is applied to whole columns, it decreases the quality of the entire dataset, whereas suppression decreases the quality of the dataset on a record-by-record basis.
Anonymizing a dataset with regard to quasi-identifiers is important in order to prevent the re-identification of individuals that would be made possible if these traits were linked with external data that shares the same traits.
The anonymization process used to de-identify edX data for public release in 2014 employed a "suppression-emphasis" approach toward k-anonymization. In this approach, the names of the countries were first generalized to region or continent names, then date-time stamps were transformed into date stamps, and finally any existing records that were not k-anonymous after these generalizations were suppressed. In the process, records that claimed a birth date before 1931 (which seemed unlikely to be correct) were automatically suppressed.
Daries et al.'s 2014 study of edX data confirmed a suppression-emphasis approach tended to distort mean values of de-identified columns, whereas a generalization-emphasis approach tended to distort correlations between de-identified columns.1
Daries et al. showed de-identification distorted measures of class participation by suppressing records of rare (generally higher) levels of participation. We pursued investigating where distortion of summary statistics was being introduced into the dataset. Intuitively, distortion is introduced whenever a row becomes generalized or suppressed. Under k-anonymity, this occurs only when a row's combination of quasi-identifier values occurs fewer than k times. If rare quasi-identifier values tend to be associated with high grades or participation levels, then the de-identified dataset would be expected to have a lower mean grade or participation level than the original dataset.
We did, in fact, find a quasi-identifier characteristic whose frequency of occurrence is correlated with a numeric attribute is most likely to create distortion in that numeric attribute. Specifically, we confirmed this hypothesis in three ways, using the edX data:
What methods may alleviate distortion introduced by de-identification? The analyses here indicate associations between quasi-identifier traits and numeric attributes may introduce distortion of means by suppression during de-identification. We therefore consider a prospective role for generalization in alleviating distortion during de-identification.
Since the number of forum posts is the quasi-identifier whose frequency of values is most correlated to grade, we first explore the effect of generalizing this attribute. As the bin size increases (for example, from 0,1,2,3 to values of 0-1,2-3, and so on), the number of rows requiring suppression decreases, as shown in Figure 3. Further, the mean grade approaches the true value (of 0.045) as bin size increases, suggesting generalization may alleviate distortion by preventing records associated with rarer quasi-identifier values from becoming suppressed.
Generalization, however, can make it difficult to draw statistical conclusions from a dataset. Certain statistical properties of a column, like its mean, can be maintained after generalization by computing a weighted mean of the pregeneralized values within each bin. The average of these bin averages will be equal to the true mean of the pregeneralized values.
Such a solution, however, cannot easily preserve two-dimensional relationships among generalized values. Table 1 illustrates the correlation of the number of forum posts with various numeric attributes becomes increasingly distorted with increasing forum post bin size.
Thus is encountered the fundamental trade-off between generalization and suppression as discussed earlier: although an approach emphasizing suppression may introduce bias into an attribute where a correlation exists between quasi-identifier frequency and numeric attributes, generalization may also distort correlational and other multidimensional relationships inherent within datasets.
Decreasing distortion introduced by generalization. One potential improvement to generalization may be to distribute the number of records more evenly within each bin, using small bucket sizes for values that are well represented and larger bucket sizes for less-well-represented values.
When the number of forum posts is generalized into groups of five for values greater than 10 (for example, 1,2,3, ..., 11–15, 16–20, and so on), the correlations between the number of forum posts and other characteristics become less distorted than with generalization schemes that use constant bin widths. This suggests optimizing for equal numbers of records within each bin may enable a compromise between the loss of utility and the distortions caused in numeric analysis, such as correlations between different variables. Using this framework for generalization, let's now explore its relationship to suppression in more detail.
To reach a compromise between the distortions introduced by suppression and by generalization, we first want to quantify the relationship between suppression and generalization. As generalization is increased, how much suppression is prevented, and does this change at a constant rate as generalization is increased?
Each of the quasi-identifiers was individually binned to ensure a minimum number of records in each bin, termed bin capacity. An increase in bin capacity from 1,000 to 5,000 drastically decreases the number of records that have to be suppressed, but this improvement drops off as bin capacity continues to increase. Furthermore, in Figure 4, the decreasing slope of the lines as the bin size increases suggests the larger the chosen bin capacities, the smaller the marginal cost of a greater degree of anonymity.
We then quantify the distortion that was introduced under each choice of bin capacity. Concentrating on sets that were 5-anonymous with bin capacities of 3k, 5k, and 10k, we compare the resulting de-identified datasets with the original set on the percentage of students who simply registered for the course; those who registered and viewed (defined as looking at less than half of the material); those who explored (defined as looking at more than half of the material but not completing the course); and those who were certified (completed the material). This comparison shows the greatest disparity in the de-identification scheme that favors suppression; the results are skewed by as much as 20% with the suppression-emphasis de-identification approach.
A generalization scheme using bin capacities of 3,000 entries, as shown in Figure 5, produces a distribution of participation that is somewhat closer to the original distribution than the suppression-only approach. While in some categories the distortion is large (such as the certification rates for MITx/7.00x during the Spring semester), others are much closer to the original values.
The situation gets considerably better by using bins with a minimum of 5,000 entries, as shown in Figure 6. The distribution of participation is nearly the same in the de-identified set as in the original dataset. The maximum difference between the measures is less than three percentage points; most are within one percent.
Moving to a bin capacity of 10,000 gives even better results, as shown in Figure 7. While there are one or two cases of results differing by almost three percentage points, in most cases the difference is a fractional percentage.
As expected, the decrease in the distortion of the mean of certain attributes is accompanied by an increase in the distortion of the correlation between quasi-identifier fields with numeric attributes as bin capacity increases. The table in Figure 8 shows the correlations between the number of forum posts and numeric attributes under various bin capacities. The column corresponding to a bin capacity of 1 represents a suppression-only approach.
Encouraged, we observe a bin capacity of 3,000 produces a dataset whose correlations are close to those of the original, non-de-identified dataset, as shown in Figure 8. Even though a bin capacity of 3,000 did not produce optimal results in terms of minimization of class participation distortion, these results may signal the existence of a bin capacity that produces an acceptable balance of distortion between single- and multidimensional relationships.
Given these results, the question naturally arises whether bin capacities may be chosen differently for each quasi-identifier in order to minimize distortion further.
The edX dataset contains two numeric, generalizable quasi-identifier values: year of birth and number of forum posts. Experimentation with different bin capacity combinations yielded the results shown in Table 2. This table illustrates the number of records that must be suppressed with the respective amounts of generalization. It is particularly noteworthy that generalization of each quasi-identifier has uneven effects: the required number of suppressed values drops off much more quickly as the bin capacity for number of forum posts increases, as compared with the bin capacity for year of birth.
Such an analysis of the trade-offs between generalization versus suppression becomes exponentially harder as the number of quasi-identifier values increases. A brute-force method of calculating the number of suppressed records would demand excessive computation time with datasets like edX's that contain six quasi-identifier fields. The development of approximation algorithms for these calculations would enable researchers to quickly determine a near-optimal generalization scheme that strikes an ideal balance between distortions introduced by generalization versus suppression. This is an area where further research is needed.
De-identification techniques will continue to be important as long as the regulations around big datasets involving human subjects require a level of anonymity before those sets can be shared. While there is some indication regulators may be rethinking the tie between de-identification and ensuring privacy, there is no indication the regulations will be changed anytime soon. For now, sharing will require de-identification.
But de-identification is hard. We have known for some time it is difficult to ensure the dataset does not allow subsequent re-identification of individuals, but we now find it is also difficult to de-identify datasets without introducing bias into those sets that can lead to spurious results.
A combination of record suppression and data generalization offers a promising path to solving the second of these problems, but there seems to be no magic bullet here; our best results were obtained by trying a number of different combinations of generalization, sizing, and record suppression. There is further work to be done, such as investigating the possibility of choosing different bin capacities for different quasi-identifiers, which may mitigate some of the distortions introduced by anonymity. We are more confident than we were a year ago that some form of de-identification may allow sharing of datasets without distorting the analyses done on those shared sets beyond the point of usefulness, but there is much left to investigate.
Privacy, Anonymity, and Big Data in the Social Sciences
Jon P. Daries et al.
Broadcast Messaging: Messaging to the Masses
Modeling People and Places with Internet Photo Collections
David Crandall, Noah Snavely
The Digital Library is published by the Association for Computing Machinery. Copyright © 2015 ACM, Inc.
No entries found