Annotations and Subjective Machines: Of Annotators, Embodied Agents, Users, and Other Humans

2008 
The usual practice in assessing whether a multimodal annotated corpus is fit for purpose is to calculate the level of inter-annotator agreement, and when it exceeds certain fixed threshold the data is considered to be of tolerable quality. There are two problems with this approach. Firstly, it depends on the assumption that any disagreement in the data is not systematic. This assumption may not always be warranted. Secondly, the approach is not well suited for annotations that are subjective to a certain degree. In that case annotator disagreement is (partly) an inherent property of the annotation, expressing something about the level of intersubjectivity between annotators in how they interpret certain communicative behavior versus the amount of idiosyncrasy in their judgements with respect to this behavior. This thesis addresses both problems. In the theoretical part, it is shown that when disagreement is systematic, obtaining a certain level of inter-annotator agreement may not be a guarantee for the data being fit for purpose. Simulations are used to investigate the effect of systematic disagreement on the relation between the level of inter-annotator agreement and the validity of machine-learning results obtained on the data. In the practical part, two new methods are explored for working with data that has been annotated with a low level of inter-annotator agreement. One method is aimed at finding a subset of the annotations that has been annotated more reliably, in a way that makes it possible to determine for new, unseen data whether it should belong to this subset — and therefore, whether a classifier trained on this more reliable subset is qualified to make a judgement for the new data. The other method is designed to use machine learning for explicitly modeling the overlap and disjunctions in subjective judgements of different annotators. Both methods put together should in theory make it possible to build classifiers that, when deployed in a practical application, yield decisions that make sense for the human end user of the application, who indeed also may have his or her own way of interpreting the communicative behavior that is subjected to the classifier.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    6
    References
    30
    Citations
    NaN
    KQI
    []