Random Forests
Random Forests models are supervised machine learning algorithm used in classification and regression problems. A Random Forest is a collection of Decision Trees that improve the prediction over a single Decision Tree.
Decision Tree
A tree is an acyclic directed data structure with nodes and edges that connect nodes.
A Decision Tree is a tree with nodes representing deterministic decisions based on variables and edges representing path to next node or a leaf node based on the decision. Leaf node or terminal node of the tree represents a class label as output of prediction.
A Decision Tree is built by following three steps:
Step 1: Build the root with variables of most importance.
Step 2: Build a decision based on the highest information split.
Step 2: Recursively construct the nodes and decision using step one and step two until no information can be split on the edge node.
When large number of variables are involved, primary challenge in building a Decision Tree is finding variable or combination of variables of most importance at each of the nodes.
This variable selection, also called attribute selection, is generally performed using information gain or Gini impurity criterion.
Information gain is used when the variables are categorical, i.e., when the variable values fall into classes or categories and do not have a logical order, for example, types of fruits.
Gini impurity is used when the variable values are continuous, i.e., the values are numerical, for example, age of a person.
Information Gain : how much Entropy we removed, so
This makes sense: higher Information Gain = more Entropy removed, which is what we want. In the perfect case, each branch would contain only one color after the split, which would be zero entropy.
Entropy: A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogeneous). ID3 algorithm uses entropy to calculate the homogeneity of a sample.
If the sample is completely homogeneous the entropy is zero and if the sample is equally divided then it has entropy of one.
Gini Impurity:
Gini impurity is a measure of how often a randomly chosen element from the set is incorrectly labelled.
Higher Gini impurity refers to higher chance of misclassification.
Cons of Decision Trees:
In a Decision Tree, the choice is made to optimize the decision at each of the nodes. Choosing best result at each step does not result in global optimal result.
Decision Trees is that they are prone to overfitting.
Random Forests
Random Forests are collection of decision trees. Random Forests are way of averaging multiple Decision Trees built from different parts of the training set with the goal of reducing overfitting by a single Decision Tree.
The increased performance comes at a cost of some loss of interpretable and accurate fit to training data. Random Forests use bagging to construct multiple Decision Trees.
Bagging or bootstrap aggregation is a technique of sampling a subset of training data with replacement and constructs the model based on the sampled training set.
Random Forests grow many Decision Trees by sampling from the input dataset D for a set of features instead of the training data samples. The process is also called feature bagging.
If one of the features is highly correlated, then that feature will be selected by many different trees. This causes the trees to be correlated and thus increases the bias of the algorithm, reducing the accuracy.
Feature Bagging:
For each tree, you should create feature bagging, i.e., randomly chose n/3 features, and you will use
only those features for creating a decision tree for the data in that bag.
Use cross validation to decide which feature bag is most apt for this specific data bag. And use this feature bag for the data to create a Decision Tree.
So, each data bag will have a different set of features chosen for creating the Decision Tree.
Cross Validation in Random Forests
In Random Forests, cross validation is estimated internally during the run because of the bagging procedure. For each bag, based on sampling, about 62% unique samples from original dataset are used.
Hence, about a third of the samples are not present in ith tree construction. These left out data will be used for cross validation, to identify the most useful feature set combination, among the multiple randomly chosen feature bags for that data bag.
The proportion of sample classified incorrectly from the cross validation set (for that data bag) over all classifications of the cross validation set is the error estimate of the system.
This error estimate is also referred to as out-of-bag (OOB) error estimate. OOB is mean prediction error on each training sample j, using only the trees that did not have sample j in the bootstrap sample.
Disadvantages of Random Forests:
Random Forests can overfit with the data when there are large number of high cardinality categorical variables.
Random Forests also provide discrete models for prediction. If the response variable is a continuous variable, then there are only n distinct values that are possible through prediction. Other regression mechanisms would provide a model for continuous data prediction.
Random Forests unlike Decision Trees are not easy to interpret. As the result is from ensemble of trees, interpretation from intuition point of view is very hard.
Random Forests do not support incremental learning very well. With new data they have to be relearned.
The advantages of random forests include:
The predictive performance can compete with the best supervised learning algorithms
They provide a reliable feature importance estimate
They offer efficient estimates of the test error without incurring the cost of repeated model training associated with cross-validation