Lasso regression

In order to avoid overfitting in regression due to too many feature while at the same time have enough features to minimize the sum of squared errors in order to get a more accurate fit on the test data, you need to regularize the regression.

This can be done with a Lasso regression where you want to minimize the sim of squared errors + plus a penalty parameters times the coefficient of the regression (which indicates the amount of features)

minimize SSE + λ|β|

 

Deep Learning School

Here you can watch lectures from the 2016 Deep Learning Summer School in Montreal.

Course excerpt:

Deep neural networks that learn to represent data in multiple layers of increasing abstraction have dramatically improved the state-of-the-art for speech recognition, object recognition, object detection, predicting the activity of drug molecules, and many other tasks. Deep learning discovers intricate structure in large datasets by building distributed representations, either via supervised, unsupervised or reinforcement learning.

The Deep Learning Summer School 2016 is aimed at graduate students and industrial engineers and researchers who already have some basic knowledge of machine learning (and possibly but not necessarily of deep learning) and wish to learn more about this rapidly growing field of research.

 

 

Here is the schedule in which you could view the presentations

table.schedule td {
vertical-align: top;
padding: 10px;
}

 01/08/2016  02/08/2016  03/08/2016  04/08/2016  05/08/2016  06/08/2016  07/08/2016
9:00
10:30
Doina
Precup
Rob
Fergus
Yoshua
Bengio
Kyunghyun

Cho
Joelle
Pineau
Ruslan
Salakhutdinov
Bruno
Olshausen
Neuro I
10:30
11:00
Coffee
Break
Coffee
Break
Coffee
Break
Coffee
Break
Coffee
Break
Coffee
Break
Coffee
Break
11:00
12:30
Hugo Larochelle
Antonio Torralba
Sumit
Chopra
Edward Grefenstette
Pieter
Abbeel
Shakir
Mohamed
Surya
Ganguli

and Deep Learning Theory
12:30
14:30
Lunch Lunch LunchWiDL event Lunch Lunch Lunch Lunch
14:30
16:00
Hugo Larochelle

Neural Networks II (click on part II)

Alex Wiltschko
Torch I
Jeff
Dean

& TensorFlow

Julie Bernauer
(NVIDIA)
GPU programming with CUDA
Joelle, Pieter & Doina
Advanced Topics in RL
Contributed
talks

Session 4
Contributed
talks

Session 4
16:00
16:30
Coffee
Break
Coffee
Break
Coffee
Break
Coffee
Break
Coffee
Break
Coffee
Break
Coffee
Break
16:30
18:00
Pascal
Lamblin

Practical
Session
 Alex Wiltschko
(Torch)

Frédéric Bastien
(Theano)

Jeff
Dean

& TensorFlow (click on part II)

Contributed
talks

Session 1
Contributed
talks

Session 2
Contributed
Posters

Session 1
Contributed
Posters

Session 2
 Evening Opening Reception
(18:00-20:30)
— by —
Imagia
 Happy Hour
(18:45-22:30)
buses at 18:30
— by —
Maluuba
Happy Hour
(18:30-20:30)
— by —
Creative Destruction Lab

(or you can just follow them in consecutive order at http://videolectures.net/deeplearning2016_montreal/ since they seem to be in the order they were presented.)

Contributed talks:

12:55 Chains of Reasoning over Entities, Relations, and Text using Recurrent Neural Networks
Rajarshi Das

14:29 Professor Forcing: A New Algorithm for Training Recurrent Networks
Anirudh Goyal

10:59 Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
Tegan Maharaj

18:58 Deep multi-view representation learning of brain responses to natural stimuli
Leila Wehbe

14:49 Learning to Communicate with Deep Multi­-Agent Reinforcement Learning
Jakob Foerster

13:57 Model-Based Relative Entropy Stochastic Search
Abbas Abdolmaleki

16:33 Learning Nash Equilibrium for General-Sum Markov Games from Batch Dat
Julien Pérolat

20:30 A Network-based End-to-End Trainable Task-oriented Dialogue System
Tsung-Hsien Wen

15:28 Inference Learning
Patrick Putzky

16:45 Variational Autoencoders with PixelCNN Decoders
Ishaan Gulrajani

13:33 An Infinite Restricted Boltzmann Machine
Marc-Alexandre Côté

15:15 Deep siamese neural network for prediction of long-range interactions in chromatin
Davide Chicco

14:09 Beam Search Message Passing in Bidirectional RNNs: Applications to Fill-in-the-Blank Image Captioning
Qing Sun

18:40 Analyzing the Behavior of Deep Visual Question Answering Models
Aishwarya Agrawal

13:55 Recombinator Networks: Learning Coarse-to-Fine Feature Aggregation
Sina Honari

Outlier Rejection

To detect and get rid of outliers in a dataset (which may for instance have been caused by sensor error or data entry error) you first train your data, and remove the data point that has the highest residual error (over 10%) and then train again.

Otherwise erroneous data entries may give you an incorrect regression line.

Classification vs. Regression

Two slightly similar concepts in supervised machine learning are Supervised classification, and regression.
With supervised classification you will get a discrete output (a label or boolean value) and in regression your output is continuous (i.e. a number).
The thing you are trying to find in the different cases is a decision boundary when using classification and a best fit line in regression. You evalueate the former with it’s accuracy value, and the latter using the “sum of squared errors” or r2.

Want to refresh your Linear Algebra?

In Machine Learning and especially Deep Learning you will need Linear Algebra. If you have not used your linear algebra in some time, this chapter of the Deep Learning Book will refresh you on the parts of Linear Algebra that are essential to machine Learning.

k-Nearest Neighbors

One of the simples algorithms in Machine Learning is k-Neares Neighbors. It is considered a “lazy learning” algorithm where all the calculations are deferred until classification.
It works like this:
You have out training data with its features and labels. Then upon classification, or testing the data, you take the k-nearest neighbors and assign the label for your test feature with what the majority vote of the k-nearest neighbors is. If k=1 then your features label with be that of its closes neighbor.

The Dataset

The 1NN classification map

The 5NN classification map

(Images from: wikipedia)

Decision Trees

When you have to deal with non-linear decision making, you can use decision trees to transform it into a linear decision surface.

Let say we have a buddy that goes wakeboarding if the weather is sunny but not windy. Whenever he sees the sun is up he considers going wakeboarding, but if it is too windy there will be too much waves on the lake and it is not as fun as when the surface is still. This data is not linearly separable, as shown in the following image: We can’t separate this with one line.

Decision trees enables you to make several linearly separable decisions one after another. In this case when we look at the data we can clearly see that it follows a pattern. First we can see that for instance if it is windy, he will not go wakeboarding regardless of if it is sunny or not. So by asking ourselves, is it windy? we can get a definite answer if it is windy, then we will not go wakeboarding.

If the answer was yes, we had an answer to our question if we shall go wakeboarding, and if it is not windy, we will ask another question: is it sunny? And if not, we won’t go wakeboarding, but if it is sunny, we’ll head to the lake.

This way, we could make a non linearly separable data set into a linear one by stepping through a decision tree.

 

If you then look at a little bit more complex data. You can see that for after a certain treshold on x, the data behaves differently. Here the data is not linearly separable.

This can be made into linear decisions using a decision tree.

Decision trees are easy to understand and interpret since they can be visualized as a tree structure. The data need not be prepared much whereas other methods may require normalisation. As we saw earlier that naive Base was good for classifying text, a decision tree is good when it comes to numerical and categorical data. They are, however, prone to overfitting. If some classes dominate in the training data, the generated tree may become biased. In that case the testing data needs to be balanced before training.

You want to find split points and variables that can separate the dataset into as pure subsets as possible, meaning that there the classes are preferably of only one type in the subset of the data. This is then done recursively with the generated subsets until you have classified the data.

Entropy (a measure of impurity) is what a decision tree uses to determine where to split the data when constructing the tree. Entropy is the opposite of purity, if all examples of a sample are of the same class, then the entropy is 0. If all examples are evenly split between all the available classes, then the entropy is 1.

Decision Trees use the entropy of a node to calculate what the split shall be.

 

Pi is a fraction of examples in the class i.

 

The way a decision tree affects its boundaries (choosing which features to make a split on) using entropy is by maximizing something called information gain.

Information gain = entropy(parent) – [weighted average]entropy(children)

 

See the example found in this video for an explanation about using information gain to choose which feature to use to split data in a decision tree.

 

 

Here you can find more information about decision trees: http://scikit-learn.org/stable/modules/tree

If you use the DecisionTreeClassifier in Scikit-learn you can tinker a bit with the parameters to set the criterions how it shall behave when splitting the train data into tree branches.

 

Support Vector Machines

A Support Vector Machine is an algorithm that outputs a line separating two classes in a set of data.

An SVM will try to maximise the margin from the line to the nearest data points of the two classes.

If you use the scikit-learn library for training and predicting using the svm algorithm, you would use similar syntax as for the GaussianNB and in that matter the other machine learning algorithms.

In general when doing machine learning, you should reserve a part, about 10%, of your data to use as testing data, and that should not be used for training. Otherwise you would not get a realistic accuracy on your algorithm.

So we should split our training/testgin data into 4 varaibles, one set of features for training (an array, and in the case of svm, an array containing arrays that are the featuresw), one set of labels for those features, (the correct answer), similar variables to be used for testing.

from sklearn import svm
features_train = [[0, 0], [1, 1]]
labels_train = [0, 1]
clf = svm.SVC(kernel="linear")
clf.fit(features_train, labels_train)
pred = clf.predict([[2., 2.]])

 

If your data, however, is not separable linearly, lets say, that the data would look like this:

The data can not be separated by a line, but if you for instance add a new feature where the x is instead |x|, i.e. the absolute value of x, you could separate the data with a line.

A very powerful feature of support vector machines is to take a set of two dimensional data, and if it is not linearly separable, it uses something called Kernel tricks to transform the features into a multidimensional dataset, that is separable in order to get a solution

In the svm library you can also provide different kernels, or define your own. For instance, in the code example above you see that i use a linear kernel.

In the image below you see two decision boundaries that are produced on the same data but using two different kernels and another parameter called gamma, a linear to the left and rbf to the right (with gamma parameter of 1000).

The parameters are used when you create your svm classifier, ie. before you train it.

Another parameter you can use is called the C paramter. That can be set to indicate if you want a smooth border between the classes or classifier that classifies the training features correctly. The tradeoff can be that even though you could get a classfiier that is very correct on your training data with a high C, it might not be very good at generalizing your test data.

The gamma parameter tells how far a data point reaches in affecting the decision boundary. A low value means that you have a far reach, and a high value means a narrow reach. This means that with a high value you will get a very squiggly border, and a low value means you will get a more straight line because a single data point that otherwise would affect the border greatly, but might make the classifier not work well with the test data, is not affecting the line as much as it would be doing it only that dot would be the factor that adjusts the line. With a low value of the gamma parameter, you will also have the data points further away from the border to be taken into account when deciding where it shall be placed.

 

Overfitting

One thing we must always be aware of and try to avoid in machine learning is overftting. In the example below you see that the line to the left correctly classifies the green dots, but is very complicated. This can happen when you take the data too literal. If your algorithm produces a border similar to the one on the left instead of the one to the right, you are overfitting.

One way of reducing overfitting is through the use of you algorithms parameters.

Support Vector Machines work very well in complicated domains where there is a clear margin of separation. They do not perform efficiently in very large data sets, the training time increases cubically. They also do not work very well where there is a lot of noice in the data where the classes overlap (in that case a Naive Bayes classifier would be better)