The Ethics of Datasets: Moving Forward Requires Stepping Back

2021 
Machine learning research culture is driven by benchmark datasets to a greater degree than most other research fields. But the centrality of datasets also amplifies the harms associated with data, including privacy violation and underrepresentation or erasure of some populations. This has stirred a much-needed debate on the ethical responsibilities of dataset creators and users. I argue that clarity on this debate requires taking a step back to better understand the benefits of the dataset-driven approach. I show that benchmark datasets play at least six different roles and that the potential harms depend on the roles a dataset plays. By understanding this relationship, we can mitigate the harms while preserving what is scientifically valuable about the prevailing approach.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []