RL agents Implicitly Learning Human Preferences.

Nevan Wichers

RL agents Implicitly Learning Human Preferences.

2020

Nevan Wichers

In the real world, RL agents should be rewarded for fulfilling human preferences. We show that RL agents implicitly learn the preferences of humans in their environment. Training a classifier to predict if a simulated human's preferences are fulfilled based on the activations of a RL agent's neural network gets .93 AUC. Training a classifier on the raw environment state gets only .8 AUC. Training the classifier off of the RL agent's activations also does much better than training off of activations from an autoencoder. The human preference classifier can be used as the reward function of an RL agent to make RL agent more beneficial for humans.

Keywords:

Artificial neural network
Artificial intelligence
Computer science
Autoencoder
Machine learning
Classifier (linguistics)

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations