Research

Abstract: A simple way to improve classification performance is to average the predictions of a large ensemble of different classifiers. This is great for winning competitions but requires too much computation at test time for practical applications such as speech recognition. In a widely ignored paper in 2006, Caruana and his collaborators showed that the knowledge in the ensemble could be transferred to a single, efficient model by training the single model to mimic the log probabilities of the ensemble average. This technique works because most of the knowledge in the learned ensemble is in the relative probabilities of extremely improbable wrong answers. For example, the ensemble may give a BMW a probability of one in a billion of being a garbage truck but this is still far greater (in the log domain) than its probability of being a carrot. This “dark knowledge”, which is practically invisible in the class probabilities, defines a similarity metric over the classes that makes it much easier to learn a good classifier. I will describe a new variation of this technique called “distillation” and will show some surprising examples in which good classifiers over all of the classes can be learned from data in which some of the classes are entirely absent, provided the targets come from an ensemble that has been trained on all of the classes. I will also show how this technique can be used to improve a state-of-the-art acoustic model and will discuss its application to learning large sets of specialist models without overfitting. This is joint work with Oriol Vinyals and Jeff Dean.

Video: https://www.youtube.com/watch?v=EK61htlw8hY&t=3m14s
Lecture notes: http://www.ttic.edu/dl/dark14.pdf

Deep Mind and Open AI collaborated on an interesting project where they discovered how to use human feedback to help a deep learning algorithm learn by providing the reward feedback. The goal is to help improve AI safety by correcting wrong behavior through human intervention.

An example is a to help a robot perform a backflip.
It learns through reinforcement learning, and sometimes it asks a human which alternative is the best one, and the humans choice is used to train a reward predictor, which it uses in the reinforcement learning process.

The idea was that the algorithm tried different methods and presented alternatives to the human, where the human could choose which one was the best one to reach its goal of performing a backflip. It would continue and generate its own reward estimates, continue learning and later check in with the human to see how it had improved and which alternative now was the best one. To train a robot to perform a backflip, 900 such inputs were needed.

This method is helpful for situations where it is difficult to create a reward function.

This iterative approach to learning means that a human can spot and correct any undesired behaviours, a crucial part of any safety system. The design also does not put an onerous burden on the human operator, who only has to review around 0.1% of the agent’s behaviour to get it to to do what they want. However, this can mean reviewing several hundred to several thousand pairs of clips, something that will need to be reduced to make it applicable to real world problems.

Read about it here:
The Paper
A blog Post from Deep Mind
OpenAI’s blog post

Category: Research

Dark Knowledge – Gueffrey Hinton

Feedback from humans to help machines learn