Risk-Averse Biased Human Policies with a Robot Assistant in Multi-Armed Bandit Settings

2021 
In the assistive multi-armed bandit problem, an autonomous system observes and intercepts the repeated actions of a human, estimates the true utility of the different actions, and potentially chooses a different arm than the human in order to improve the overall return. This setting can be used to model team situations between a human and an autonomous system like a domestic service robot. Previous work deals with human policies in human-robot teams that are (noisily) rational or in some way communicative about the rewards. However, empirically shown human biases such as the risk-aversion described in the Cumulative Prospect Theory shifts the perceived action utilities in such a way that previous methods will only learn to repeat the bias. Therefore, the assistive multi-armed bandit setting is expanded by using observable reward classes, but not their utility value. This allows to derive an algorithm that leverages knowledge about the risk-averse human model in order to correct the human bias in a human-robot team. A brief evaluation indicates that arbitrary discrete reward functions can be handled.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    0
    Citations
    NaN
    KQI
    []