When you have to deal with non-linear decision making, you can use decision trees to transform it into a linear decision surface.
Let say we have a buddy that goes wakeboarding if the weather is sunny but not windy. Whenever he sees the sun is up he considers going wakeboarding, but if it is too windy there will be too much waves on the lake and it is not as fun as when the surface is still. This data is not linearly separable, as shown in the following image: We can’t separate this with one line.
Decision trees enables you to make several linearly separable decisions one after another. In this case when we look at the data we can clearly see that it follows a pattern. First we can see that for instance if it is windy, he will not go wakeboarding regardless of if it is sunny or not. So by asking ourselves, is it windy? we can get a definite answer if it is windy, then we will not go wakeboarding.
If the answer was yes, we had an answer to our question if we shall go wakeboarding, and if it is not windy, we will ask another question: is it sunny? And if not, we won’t go wakeboarding, but if it is sunny, we’ll head to the lake.
This way, we could make a non linearly separable data set into a linear one by stepping through a decision tree.
If you then look at a little bit more complex data. You can see that for after a certain treshold on x, the data behaves differently. Here the data is not linearly separable.
This can be made into linear decisions using a decision tree.
Decision trees are easy to understand and interpret since they can be visualized as a tree structure. The data need not be prepared much whereas other methods may require normalisation. As we saw earlier that naive Base was good for classifying text, a decision tree is good when it comes to numerical and categorical data. They are, however, prone to overfitting. If some classes dominate in the training data, the generated tree may become biased. In that case the testing data needs to be balanced before training.
You want to find split points and variables that can separate the dataset into as pure subsets as possible, meaning that there the classes are preferably of only one type in the subset of the data. This is then done recursively with the generated subsets until you have classified the data.
Entropy (a measure of impurity) is what a decision tree uses to determine where to split the data when constructing the tree. Entropy is the opposite of purity, if all examples of a sample are of the same class, then the entropy is 0. If all examples are evenly split between all the available classes, then the entropy is 1.
Decision Trees use the entropy of a node to calculate what the split shall be.
Pi is a fraction of examples in the class i.
The way a decision tree affects its boundaries (choosing which features to make a split on) using entropy is by maximizing something called information gain.
Information gain = entropy(parent) – [weighted average]entropy(children)
See the example found in this video for an explanation about using information gain to choose which feature to use to split data in a decision tree.
Here you can find more information about decision trees: http://scikit-learn.org/stable/modules/tree
If you use the DecisionTreeClassifier in Scikit-learn you can tinker a bit with the parameters to set the criterions how it shall behave when splitting the train data into tree branches.