No Free Lunch: Overcoming Reward Gaming in AI Safety Gridworlds
2021
We present two heuristics for tackling the problem of reward gaming by self-modification in Reinforcement Learning agents. Reward gaming occurs when the agent’s reward function is mis-specified and the agent can achieve a high reward by altering or fooling, in some way, its sensors rather than by performing the desired actions. Our first heuristic tracks the rewards encountered in the environment and converts high rewards that fall outside the normal distribution into penalities. Our second heuristic relies on the existence of some validation action that an agent can take to check the reward. In this heuristic, on encountering an abnormally high reward, the agent performs a validation step before either accepting the reward as it is, or converting it into a penalty. We evaluate the performance of these heuristics on variants of the tomato watering problem from the AI Safety Gridworlds suite.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
10
References
0
Citations
NaN
KQI