The Ethics of Datasets: Moving Forward Requires Stepping Back

Arvind Narayanan

The Ethics of Datasets: Moving Forward Requires Stepping Back

2021

Arvind Narayanan

Machine learning research culture is driven by benchmark datasets to a greater degree than most other research fields. But the centrality of datasets also amplifies the harms associated with data, including privacy violation and underrepresentation or erasure of some populations. This has stirred a much-needed debate on the ethical responsibilities of dataset creators and users. I argue that clarity on this debate requires taking a step back to better understand the benefits of the dataset-driven approach. I show that benchmark datasets play at least six different roles and that the potential harms depend on the roles a dataset plays. By understanding this relationship, we can mitigate the harms while preserving what is scientifically valuable about the prevailing approach.

Keywords:

Computer science
Benchmark (computing)
degree
Centrality
Data science
CLARITY
Erasure

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations