Random Forest Regression

Modified on Tue, 11 Apr, 2023 at 11:39 AM

Description

Random forest regression trains a machine learning model using an ensemble of decision trees.

Individual decision tree models are not robust to minor changes in data; random forests overcome this limitation by training multiple decision trees that are exposed to different combinations of the features and observations from the training data.

Application

Similar to other machine learning regression models, random forest regression can be used to predict single or multiple “target” output values from “feature” inputs. Outputs from random forest models are the average outputs from the decision trees composing the random forest ensemble. See the last section for advantages and disadvantages of this type of model and when to use or not to use it.

How to use

The inputs, usage, and most of the advanced parameters of the Random Forest Regression model is the same as of the Decision Tree Regression model.

The additional advanced parameter, Number of Estimators, controls the number of individual decision trees trained to compose the random forest. Generally, more trees always improves performance, although with diminishing returns for higher values (>100). Increasing the number of estimators also increases training and prediction time.

More on this step

Single decision trees are sensitive to training data; very different trees can be trained as the result of minor variations in the training data. Random forests overcome this limitation by training many trees that see randomised views of the training data. Typically, each tree sees a random subset of the features in the full dataset, and a random bootstrap sampling of the rows from the full dataset. Predictions are generated by predicting from each tree independently and taking the average value.

The combining of predictions from multiple diverse trees has the effect of greatly reducing overfitting (model variance) and generally leads to much better performance compared to a single tree.

Advantages

Fast to train and scale well with data size.
Robust to data types, quality, and preparation methods - works well with continuous and categorical data, doesn’t care about scaling or normalisation.
Non-parametric - meaning there’s no set underlying mathematical equation predefining the interactions between features (as there is with some other model types, such as linear regression). How features interact is learned from the data, rather than imposed by assumption.

Disadvantages

Harder to interpret than single decision trees - although each tree is still interpretable, trying to understand the decisions of many trees at a time is much harder.
Predictions are not necessarily smooth or continuous, although will likely be smoother than for a single decision tree.

Out of memory issues

Training Random Forest Models can consume a large amount of memory as the model consists of many decision trees. The larger or more complex the trees within the random forest become the more memory is consumed. Especially in case of hyper-parameter optimisation you can hit the memory limits easily and end up with an error message.

Limit the Maximum depth of a tree or reduce the Number of Estimators which is the number of decision trees within the Random Forest Model. Both parameter can be found in the advanced options. As the models become less complex/smaller the memory consumption decreases. This, of course, is a trade-off between accuracy and memory consumption.