Random Forest Simplified

Random Forest Simplified,As we had seen in the previous article, decision trees are a powerful, intuitive and simple to use class of machine learning algorithms, they can be applied for both classification and regression tasks, and generally make very few assumptions about the data, we also learnt how to create our own trees for classifying the iris dataset, and figured out how to interpret its result.

Then we turned our attention to regularization and reducing over-fitting the data, despite all these we saw that decision trees had their pitfalls, some of the grievances are ingrained deep into the basic splitting algorithm that builds the trees, i.e. The splitting algorithm goes for a split that produces purest subsets at that particular level, without any care for the consequences that the split might produce. We also mentioned that Ensemble processes can be used to overcome these challenges, let us see how.

Random Forest Simplified
Random Forest Simplified

Ensemble in our case means having multiple votes on an answer and using some aggregation mechanism to produce one final value. Why it is useful is that, instead of having a single extremely proficient classifier or a strong learner, we can have several average classifiers or weak learners, the thing with an average classifier is that it will have an additional bias towards a class as compared to a good classifier which has low bias and an almost equal probability for all classes. The advantage of using a sufficient number of weak diverse learners is that they will predict one class well, and if we have many learners that can each predict a single class well, in a voting system what we might end up with is a decent collection of many one-class-predictors, their final output being a very good classifier.

You might question: how is this even possible? Let us take the following analogy, we have a slightly biased coin which flips to produce heads 51% of the time and tails 49% respectively. If we toss the coin 1000 times, we theoretically land up with 510 heads and 490 tails. If we do the math we find out that the probability of obtaining a majority heads after 1000 tosses is close to 75%, the more we toss the higher the probability.

Random Forest Simplified
Random Forest Simplified

So, with the same logic if we build an ensemble containing 1000 weak classifiers that are individually correct only 51% of the time, what we end up with is a classifier that produces the correct answer with a 75% accuracy. However, this assumption only holds good only if the classifiers are uncorrelated and make perfectly independent decisions, which is not likely the case as they are trained on the same training data.

Now that we have figured out why ensembles are so useful let us learn how to get and build our diverse set of classifiers, we will use scikit-learn’s voting classifier method to build a combination of Random Forest, Logistic Regression and Support Vector Classifier.

Voting Classifier

Random Forest Simplified
Random Forest Simplified

Program to build and test a voting classifier from scratch.

# Importing the required packages
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import VotingClassifier
from sklern.linear_model import LogisticRegression
from sklearn.svm import SVC

# Building Individual Classifiers
log_clf = LogisticRegression()
rnd_clf = RandomForestClassifier()
svm_clf = SVC()

# Making a voting classifier that considers the result of all three individual 
# classifiers
voting_clf = VotingClassifier(
                    estimators = [('lr', log_clf), ('rf', rnd_clf), ('svc', svm_clf)],
                    voting = 'hard')

# Fitting the voting classifier on the well-known moons dataset
voting_clf.fit(x_train, y_train)

Now, we will see each classifiers accuracy on the moons dataset to figure out whether our voting classifier actually performs as advertised.

from sklearn.metrics import accuracy_score

for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
  clf.fit(x_train, y_train)
  y_pred = clf.predict(x_test)
  print(clf.__class__.__name__, accuarcy_score(y_test, y_pred))

LogisticRegression 0.864

RandomForestClassifier 0.872

SVC 0.888

VotingClassifier 0.896

There we go! We can see that our Voting (ensemble classifier) outperforms the individual classifiers, although by a slight margin.

Before we move onto Random Forests as our choice of classifiers, there is but a major concept that we need to understand that makes Random Forests as popular as they are, and that is Bagging and Pasting.

Bagging and Pasting

One way to get a diverse set of classifiers is to use various unrelated and independent ones in a voting classifier, as we have seen, the results are good, but not nearly promising. Another more common method is to use different random subsets of data fed into the same classifier.

When Sampling is performed with replacement this process is called Bagging.

When sampling is performed without replacement this process is called Pasting.

Random Forest Simplified
Random Forest Simplified

Aggregation for:

  1. Classification –> Statistical Mode (Most frequent class predicted).
  2. Regression –> Average of all predicted values.

In general, data scientists prefer bagging as the method of data resampling in Random Forets, because of the lack of diverse uniquely reusable data available at their disposal. If the data available is large that holistically represents the distribution of classes then pasting will be preferred.

Now, let us move on to the main entree, Random Forests:

Random Forests

By now we already know that Random Forests are a collection / ensemble of decision trees, generally trained via the bagging method (sometimes pasting is also used). The max_samples hyperparameter is set to the size of the training data.

Let us build a basic Random Forest Classifier to get a feel for the easy syntax and basic parameters that need to be passed.

# Importing Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier

# Initializing the classifier to build 500 trees, each having a maximum of 16 leaf nodes
# and n_jobs = -1 --> utilizing all available CPU cores on the local machine.
rnd_clf = RandomForestClassifier(n_estimators = 500, max_leaf_nodes = 16, n_jobs = -1)

# Fiiting / Training the classifier to model the input data.
rnd_clf.fit(x_train, y_train)

# Using the trained classifier to predict the classes for test cases.
y_pred = rnd_clf.predict(x_test)

In general a Random Forest Classifier has all the hyperparameters of a decision tree classifier, as well as all those of a Bagging Classifier to control the overall ensemble like:

splitter –> ‘random’ split or not.

n_estimators –> How many trees will be built.

This is a good stopping point for anyone who wants to understand and use decision trees, voting classifiers, bagging classifiers and Random Forests in their day-to-day life to make predictions and assist in their statistical decisions, beyond this are advanced topics that demand another post of their own, for further reading on the advanced topics these would be the key searches:

  1. Why build extra trees?
  2. Boosting
  3. AdaBoost
  4. Gradient Boosting
  5. Stacking

Hope you enjoyed the article.

Author: Piyush Daga

Back to Back to Random Forest 1

Post Author: Ruchi

Leave a Reply