Deep Mind and Open AI collaborated on an interesting project where they discovered how to use human feedback to help a deep learning algorithm learn by providing the reward feedback. The goal is to help improve AI safety by correcting wrong behavior through human intervention.
An example is a to help a robot perform a backflip.
It learns through reinforcement learning, and sometimes it asks a human which alternative is the best one, and the humans choice is used to train a reward predictor, which it uses in the reinforcement learning process.
The idea was that the algorithm tried different methods and presented alternatives to the human, where the human could choose which one was the best one to reach its goal of performing a backflip. It would continue and generate its own reward estimates, continue learning and later check in with the human to see how it had improved and which alternative now was the best one. To train a robot to perform a backflip, 900 such inputs were needed.
This method is helpful for situations where it is difficult to create a reward function.
This iterative approach to learning means that a human can spot and correct any undesired behaviours, a crucial part of any safety system. The design also does not put an onerous burden on the human operator, who only has to review around 0.1% of the agent’s behaviour to get it to to do what they want. However, this can mean reviewing several hundred to several thousand pairs of clips, something that will need to be reduced to make it applicable to real world problems.