No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds

2021 
We present two heuristics for tackling the problem of reward gaming by self-modification in Reinforcement Learning agents. Reward gaming occurs when the agent’s reward function is mis-specified and the agent can achieve a high reward by altering or fooling, in some way, its sensors rather than by performing the desired actions. Our first heuristic tracks the rewards encountered in the environment and converts high rewards that fall outside the normal distribution into penalities. Our second heuristic relies on the existence of some validation action that an agent can take to check the reward. In this heuristic, on encountering an abnormally high reward, the agent performs a validation step before either accepting the reward as it is, or converting it into a penalty. We evaluate the performance of these heuristics on variants of the tomato watering problem from the AI Safety Gridworlds suite.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    0
    Citations
    NaN
    KQI
    []