Security breaches and the public exposure of internal databases, customer records, and user information are now unfortunately frequent occurrences. The data exposed by these breaches presents an ethical dilemma for computing researchers. Such datasets are potentially valuable as descriptions of actual behavior and events that are otherwise hidden from outside observers. However, researchers must confront the fact this data was obtained unethically: corporate confidentiality has been broken, and the privacy of those described in the data has been compromised.
There are a variety of contexts where laws and professional norms prohibit using unethically obtained information. In law, there are exclusion rules against using illegally obtained evidence in criminal cases. Trade secrecy law prohibits businesses from using confidential information obtained from their competitors. The 'Common Rule' in U.S. scientific research regulation and the World Medical Association's Declaration of Helsinki requires human research subjects to give informed consent to participate in research.8,9
There are various kinds of unethically obtained data that might interest computing researchers.
Nonetheless, in some research fields there is still debate over whether researchers should use data obtained unethically by others. Medical researchers have debated whether to use the results of deadly and inhumane experimentation conducted by Nazi doctors on concentration camp prisoners and prisoners of war during the Second World War.6 That researchers would seriously consider using data obtained in such grossly immoral ways shows the dilemma of using unethically obtained data is not as simple as it initially appears. Nonetheless, the arguments for and against using such data in research can guide computing researchers concerned about using data exposed by a data breach.
Computing researchers are not usually concerned with using data obtained by inflicting physical or psychological harm on human subjects. However, computing and Internet research may have the potential to cause harm: data analysis and computing research can be used to invade privacy and reveal information about individuals that might be used against them. Unlike medical research, the appropriate uses of big data analytics and the appropriate sources of data are not well established.4 In their review of published research that uses unethically obtained data, Daniel R. Thomas and his colleagues found the discussion of the ethical issues associated with using such data was inconsistent.8 While the ACM Code of Ethics lists 'avoiding harm' as a general principle,1 the possible harms of using the data are not always clear.
There are various kinds of unethically obtained data that might interest computing researchers. The data-sets resulting from security breaches may be password dumps, databases of internal message boards, financial and personal data, or classified information.7 Such information may be released onto the Internet by whistleblowers, as a result of deliberate infiltration of a secure network by outsiders, or an accidental disclosure caused by weak security practices.
We should distinguish between data unethically obtained by the researchers themselves, and data unethically obtained and released by third parties. The first case is straightforward: researchers should always reject using unethical methods. Institutional Review Boards (IRBs) and research ethics committees should be consulted if there are concerns about the methods of collecting data. Legal privacy protections also limit what researchers can collect and the methods they can employ. It is less straightforward, though, if a third party has collected data using unethical (and potentially illegal) methods and then released it publicly. Consider a whistleblower releasing confidential documents that reveal wrongdoing by governments, companies, or institutions. While it may be illegal for the whistleblower to release these documents, it not clear researchers should ignore their contents if they are of significant research value and public interest. A blanket prohibition against using unethically obtained data may prevent socially beneficial research from occurring. The arguments for and against using unethically obtained data should therefore be considered for each individual case.
The most straightforward justification for using data exposed by a security breach is the potential benefits to society from utilizing that data outweigh the harms caused by obtaining it. The researchers must therefore offer a compelling justification for how using the data will benefit society. If the data describes illegal or harmful activity, the obvious justification is that research using it may be used to prevent or limit such activity in the future. This recognizes the means used to obtain the data were wrong but defends the researchers' use of it as a means to prevent or reduce another form of wrongdoing.
This argument's effectiveness depends on both the seriousness of the unethical methods used to obtain the data and the potential significance of the research's benefits to society. The researchers should also attempt to minimize any further harm that may occur from publishing research using such data. For example, unethically obtained data is likely to include personal information that would otherwise have been removed or anonymized. Researchers should ensure any information that allows individuals to be identified is removed when they clean the data for analysis and publication.
Another argument is that since the data is already publicly accessible, it can be used the same way as any other publicly accessible data.5 The methods and motives behind the data's release are only relevant for evaluating its quality. Given the likelihood illegal methods were used to obtain the data, the source will frequently attempt to maintain remain their anonymity to avoid punishment. The source therefore is unaccountable for the data's quality. This creates the possibility that the data may have been altered or falsified for their own purpose.
This uncertainty about the data's authenticity justifies at least performing a preliminary analysis to determine whether it is genuine. Since appeals to the public benefit of using data exposed by a security breach depend on the data's accuracy, the researchers must explain how they established the data's authenticity and the likelihood it is genuine. Even if the researchers refuse to use the data in their own work, establishing whether it is likely to be genuine is useful for confirming a security breach has occurred.
However, the fact the data from a security breach is publicly accessible does not mean using it in research does not create additional risks to those it describes. For instance, publicly available information may be used to harass or threaten individuals. Jacob Metcalf rightly states that the risk to individuals from using research data depends more on the dataset's contents and the research's usage of it rather than whether the data is public, private, or deanonymized.3 This holds for both legitimately acquired and unethically obtained data. While this might appear to downplay the significance of how the data was obtained and released, the uncertainty about data quality imposes an additional burden on using unethically obtained data. This burden is itself a potential reason to avoid using such data.
The major arguments against using data exposed by a security breach are:
The claim that unethically obtained data 'taints' research using it is both symbolic and methodological: using the data symbolically re-enacts the harm caused by obtaining it, and the unethical means of gathering it also suggests the data may be of poor quality.2 For victims of security breaches, research using the exposed data may reinforce the feelings of violation and humiliation from when they discovered their data had been exposed. Methodologically, since the researchers were not involved in collecting the data, they also must confirm the data is genuine and has not been manipulated. This is part of the burden imposed by having to authenticate and clean unethically obtained data mentioned previously.
Part of the unease associated with using unethically obtained data is that it implies the researchers themselves condone the methods used to obtain it. While this is unlikely, it requires researchers to explicitly distance themselves from those who performed the security breach in publications using this data. While there may be legitimate reasons for conducting research with such data, researchers must take care to ensure the readers of their published findings are clear they do not endorse or condone the methods used to obtain the data.
An even stronger rejection of security breaches as a data source is refusing to use it in research. Such a refusal makes a clear statement about the proper methods of conducting research and obtaining data. It also deters future researchers from using such data as it means the research community will shun their work. However, adopting this position risks neglecting valuable data that may otherwise be inaccessible. If a security breach exposes data about illegal activity, this data might be useful for gaining a better understanding of how to combat it.
There are a few general conclusions to be derived from this summary of the arguments for and against using data exposed by a security breach. The risks to those described in the data and the additional burdens such data imposes on researchers means data from security breaches should only be used as a last resort. However, there are cases where obtaining data ethically is impossible, and compelling public interest justifications exist for analyzing it. In these cases, the burden of proof to explain the public interest in analyzing unethically obtained data is the responsibility of the researchers. IRBs and ethics committees should assist researchers in determining whether a compelling benefit to society justifies using such data.
Given the risks of using data from a security breach (both from the uncertainty of the data quality and the risks of causing further harm), researchers must ensure how they use such data minimizes these risks. The general ethical principles 1.2 ("Avoid harm") and 1.6 ("Respect privacy") of the ACM Code of Ethics apply to these cases, as they do to all computing research.1 Published research using data from a security breach should include an ethics section where the researchers present their justifications for using the data and which IRBs and/or ethics committees reviewed and approved it.7 IRBs and ethics committees should also be aware of the specific concerns raised by unethically gathered data. Whether the data used is publicly accessible or not should not determine on its own whether its use poses a minimal risk to those described within it.
1. ACM Code of Ethics and Professional Conduct. Association for Computing Machinery. ACM, New York, NY, USA; 2018; https://www.acm.org/code-of-ethics.
2. Douglas, D.M. Should Internet researchers use ill-gotten information? Science and Engineering Ethics 24, 4 (Aug. 2018), 12211240; https://doi.org/10.1007/s11948-017-9935-x.
3. Metcalf, J. Big data analytics and revision of the common rule. Commun. ACM 59, 7 (July 2016), 3133; https://doi.org/10.1145/2935882.
4. Metcalf, J. and Crawford, K. Where are human subjects in big data research? The emerging ethics divide. Big Data and Society 3, 1 (June 1, 2016); https://doi.org/10.1177/2053951716650211.
5. Poor, N. and Davidson, R. Case study: The ethics of using hacked data: Patreon's data hack and academic data standards. Council for Big Data, Ethics, and Society (Apr. 6, 2016); http://bit.ly/2NnkscM.
7. Thomas, D.R. et al. Ethical issues in research using datasets of illicit origin. In Proceedings of the 2017 Internet Measurement Conference, 445462. IMC '17. ACM, New York, NY, USA; https://doi.org/10.1145/3131365.3131389.
8. U.S. Department of Health and Human Services. 45 CFR 46. (July 19, 2018); http://bit.ly/2NianOq.
9. World Medical Association. Declaration of HelsinkiEthical Principles for Medical Research Involving Human Subjects. (Oct. 19, 2013); http://bit.ly/2lJFEQw.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2019 ACM, Inc.
No entries found