An Empirical Investigation of Learning from Biased Toxicity Labels

Neel Nanda,Jonathan Uesato,Sven Gowal

An Empirical Investigation of Learning from Biased Toxicity Labels

2021

Neel Nanda
Jonathan Uesato
Sven Gowal

Collecting annotations from human raters often results in a trade-off between the quantity of labels one wishes to gather and the quality of these labels. As such, it is often only possible to gather a small amount of high-quality labels. In this paper, we study how different training strategies can leverage a small dataset of human-annotated labels and a large but noisy dataset of synthetically generated labels (which exhibit bias against identity groups) for predicting toxicity of online comments. We evaluate the accuracy and fairness properties of these approaches, and trade-offs between the two. While we find that initial training on all of the data and fine-tuning on clean data produces models with the highest AUC, we find that no single strategy performs best across all fairness metrics.

Keywords:

quality
identity
Artificial intelligence
Leverage (statistics)
Machine learning
Computer science
initial training

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations