Learning Curve

Modified on Tue, 14 Mar, 2023 at 1:58 PM

Description

This function is applied on trained models and creates a learning curve. With learning curves you can analyse if a model is over- or underfitting and if more data would help to improve model accuracy.

Application

If one or more of the following are a matter of discussion a learning curve can help to answer these:

Is the model over- or underfitting? Or is the model “just right”?
Do I have enough data or should I bring more data to improve model accuracy?
Could I achieve the same model accuracy with less data?

Read the linked article in the FAQ section to understand how to read a learning curve with regard to the questions above.

How to use

You need at least one trained model to create a learning curve. The step also needs a train and a test dataset to generate the learning curve. There are three different ways to define the test dataset.

Provide separate train and test data sets

You can use a previous train test split and reuse the train and test datasets in this step. The advantage of this approach is that you have full control about the train and test datasets.

Select the dataset used for model training in the field Training Data.
Select the dataset that was hold back for model testing in the field Test Data (optional).
Select the Models for which to create a learning curve. You can select a single model or multiple models.
Select Use existing from the Testing Data options. This option only appears if a dataset was selected in the Test Data field.

Generate test dataset via percentage

Just a single dataset is used which is broken into two pieces of data. One piece is used for model training the other part is used to test the model. The fraction used for model testing is specified by the user as percentage of the initial dataset.

For this approach you could use the initial full dataset (before doing train test split) to generate the learning curve.

Select a dataset in the field Training Data.
Leave the field Test Data (optional) empty.
Select the Models for which to create a learning curve. You can select a single model or multiple models.
Select Percentage from the Testing Data options.
Specify the percentage of the dataset to use as testing data in the field Test percentage. The percentage value is in % and needs to be in the range ]0, 100[. The default value is 20 %.

Generate test dataset via cross validation

Just a single dataset is used and cross validation is used to train and test the model. You have to select how many folds will be used for cross validation.

The dataset will be split into K folds and for each point on the learning curve the model will be trained K times. Each time another fold will be hold back as test data. The other K-1 folds will be used as training data. If less than 100% of the data is used, only a subset of the training folds is used.

Select a dataset in the field Training Data.
Leave the field Test Data (optional) empty.
Select the Models for which to create a learning curve. You can select a single model or multiple models.
Select Cross Validation from the Testing Data options.
Specify the number of Folds to use for cross validation. The default value is 5. For more info on how to select this number look in this article.

Further steps common to all approaches

The Granularity parameter defines how many points are calculated along the learning curve. The default value is 5.

The last point on the learning curve will always be “100 % of available data used”.
The other points are evenly distributed to span the range between 0 and 100.
The first point will not be 0 (impossible) but 100/Granularity.
For the default value of 5 the following points are calculated: 20, 40, 60, 80, and 100 % of the available data used.
Granularity = 10 would calculate a point for each multiple of 10 up to 100 %.

Select the Outputs for which the learning curve should be calculated. Multiple outputs can be selected if available. For each output a learning curve is calculated and plotted separately.

If multiple models are selected only outputs which are common to all models will be presented. If the models don’t have a common output, the list will be empty and no output can be selected

Click Apply to generate the learning curve plot.

Examples

Refer to this article for some learning curve examples. All examples in that article used the cross validation approach to calculate the learning curve.