In recent years, many popular data sets have been identified by the machine-learning community as having an alarming number of potential legal and ethical problems—representational harms, effects of bias, privacy infringement, and unclear or dubious downstream use. This has led several data sets to be taken down or heavily redacted. In practice, however, they continue to be available and widely used, either in their original form, such as via online torrents, or in derivative form, as subsets or modifications of the original data set or models pretrained on the deprecated data set.
Moving forward, a fundamental change in data set culture is necessary. Harm mitigation and stewardship are required throughout the data set's life cycle, while creators must monitor the use of their data sets, update licenses and documentation, and limit access when necessary.
View Full Article
No entries found