Random Forests Simplified
Random Forests simplified article explains about concept related to machine learning in simplified way.Machine Learning as a domain is not as volatile as it had been for the past decade, things have settled down and some industry accepted best-practices have emerged. It feels as though a lot of the 90s was spent on developing the theory instead of testing it out (quite possibly due to the lack of computing power available), and the courseware developed during that period still holds weight in the minds of current practitioners.
A lot of data and compute power have emerged during this time span, and one of the classic algorithms that has taken a liking to these newfound resources and emerged as a clear winner in the machine learning community is Random Forests.
Though they are exteremely powerful, Random Forests are still kind of black boxes in their implementaions, and to actually grasp why they are so good we need to look at decision trees first. Decision trees are highly interpretable and easy to grasp, make no mistake they are powerful in their own right, but have some limitaions that have can solved using an ensemble learning algorithm.
Decison trees are available in the scikit-learn library, and can be imported using:
from sklearn.tree import DecisionTreeClassifier
Let us take the fairly common iris dataset as an example, to show how decision trees work and how to comprehend its output:
from sklearn.datasets import load_iris iris = load_iris() x = iris.data[:, 2:] y = iris.target tree_clf = DecisionTreeClassifier(max_
depth = 3) tree_clf.fit(x, y)
We can visualize the decison tree by calling the export_graphviz() function, to create a graph definition file called iris_tree.dot
from sklearn.tree import export_graphviz export_graphviz( tree_clf, out_file = image_path('iris_tree.dot'), feature_names = iris.feature_names[2:], class_names = iris.target_names, rounded = True, filled = True )
To visulaize the dot file we can convert it into a variety of formats, most notably PDF or PNG.
$ dot -Tpng iris_tree.dot -o iris_tree.png
Our decison tree looks like this:
The above diagram is a fairly intuitive representation of what a decison tree looks for when making predictions. We can see that if the petal length is less than 2.45 cm, then the tree deduces it to be setosa, since there are no further child nodes attached to this node, its gini value or impurity index is 0, which is indicative of a clean classifiction.
If the petal width is > 2.45 cm, then it looks at the petal width, and based on that further classifications are made, the nodes which have a gini value of 0, are child nodes that classify an iris class cleanly.
Below we can visualize how the boundaries are formed by the decision tree to classify the data, using various depths of tree growth.
As we can see, the higher the depth, the better the classification, but onw needs to be vary as to how deep the algorithm should go on splitting, because unnecessary splits lead to overfitting and ungeneralizable models.
CART Training Algorithm
Scikit-learn uses Classification and Regression Tree (CART) algorithm, to train decison trees. Decision Trees produces the purest subsets (weighted by their size). Once it has made a split, it then splits the subsets using the same logic, until it reaches the maximum depth (as defined by the max_depth parameter), or if further splits do not reduce impurity.
Like every other machine learning algorithm out there, there are a few hyperparameters that can be fine-tuned to give optimum performance. These hyperparameters may be used for regularization, by reducing the degrees of freedom and hence curbing over-fitting. One of the standout features of decision trees is that they make ver few assumptions about the data (unlike linear models), and they adapt themselves to the training data quite well.
Some of the common hyperparameters used while calling the decision tree function are:
- max_depth = Specifies the maximum depth of the tree, which loosely translates to how deep will the tree grow to find pure classifications.
- min_samples_split = The minimum number of samples a node must have, before it can be split.
- min_samples_leaf = The minimum number of samples a leaf node must have.
- max_leaf_nodes = Maximum number of leaf nodes.
- max_features = Maximum number of features that are evaluated for splitting each node.
Instability and Why we go for Random Forests
We have noticed how easy and versatile it is to make and use decision trees. However, they do have a few limitations. Firstly, decision trees love orthogonal boundaries, or splits. If, we just rotate our data by 45 degrees, the boundaries produced are complex and unnecesarily convoluted. Decision tree models do not generalize well, because, the CART algorithm is a greedy algorithm that splits to gain the maximum pure nodes at a particular level, it does not care for the fact, whether that split will lead to the most efficient splits down the line. In such cases, a feature might heavily dominate the splits in such a fashion as it always becomes the first feature to be split upon. Such bias and lack of variance in splits produces highly sensitive and overfit models, that do not generalize well to real world data.
Almost all the problems listed above for decision trees are fixed using random forests, and it also brings along a whole new class of Ensemble algorithms, these algorithms are super fun, thoughtful and have a inheritent form of genius within them, we will check them out in the second part of this series.
Author : Piyush Daga