Learning Curve

Modified on Tue, 14 Mar, 2023 at 1:58 PM

Description

This function is applied on trained models and creates a learning curve. With learning curves you can analyse if a model is over- or underfitting and if more data would help to improve model accuracy.


Application

If one or more of the following are a matter of discussion a learning curve can help to answer these:

  • Is the model over- or underfitting? Or is the model “just right”?
  • Do I have enough data or should I bring more data to improve model accuracy?
  • Could I achieve the same model accuracy with less data?

Read the linked article in the FAQ section to understand how to read a learning curve with regard to the questions above.


How to use

You need at least one trained model to create a learning curve. The step also needs a train and a test dataset to generate the learning curve. There are three different ways to define the test dataset.

Provide separate train and test data sets

You can use a previous train test split and reuse the train and test datasets in this step. The advantage of this approach is that you have full control about the train and test datasets.

  • Select the dataset used for model training in the field Training Data.
  • Select the dataset that was hold back for model testing in the field Test Data (optional).
  • Select the Models for which to create a learning curve. You can select a single model or multiple models.
  • Select Use existing from the Testing Data options. This option only appears if a dataset was selected in the Test Data field.

Generate test dataset via percentage

Just a single dataset is used which is broken into two pieces of data. One piece is used for model training the other part is used to test the model. The fraction used for model testing is specified by the user as percentage of the initial dataset.

For this approach you could use the initial full dataset (before doing train test split) to generate the learning curve.

  • Select a dataset in the field Training Data.
  • Leave the field Test Data (optional) empty.
  • Select the Models for which to create a learning curve. You can select a single model or multiple models.
  • Select Percentage from the Testing Data options.
  • Specify the percentage of the dataset to use as testing data in the field Test percentage. The percentage value is in % and needs to be in the range ]0, 100[. The default value is 20 %.

Generate test dataset via cross validation 

Just a single dataset is used and cross validation is used to train and test the model. You have to select how many folds will be used for cross validation.

The dataset will be split into K folds and for each point on the learning curve the model will be trained K times. Each time another fold will be hold back as test data. The other K-1 folds will be used as training data. If less than 100% of the data is used, only a subset of the training folds is used.

  • Select a dataset in the field Training Data.
  • Leave the field Test Data (optional) empty.
  • Select the Models for which to create a learning curve. You can select a single model or multiple models.
  • Select Cross Validation from the Testing Data options.
  • Specify the number of Folds to use for cross validation. The default value is 5. For more info on how to select this number look in this article.

Further steps common to all approaches

  • The Granularity parameter defines how many points are calculated along the learning curve. The default value is 5.
  • The last point on the learning curve will always be “100 % of available data used”.
  • The other points are evenly distributed to span the range between 0 and 100.
  • The first point will not be 0 (impossible) but 100/Granularity.
  • For the default value of 5 the following points are calculated: 20, 40, 60, 80, and 100 % of the available data used.
  • Granularity = 10 would calculate a point for each multiple of 10 up to 100 %.
  • Select the Outputs for which the learning curve should be calculated. Multiple outputs can be selected if available. For each output a learning curve is calculated and plotted separately.

If multiple models are selected only outputs which are common to all models will be presented. If the models don’t have a common output, the list will be empty and no output can be selected

  • Click Apply to generate the learning curve plot.

Examples

Refer to this article for some learning curve examples. All examples in that article used the cross validation approach to calculate the learning curve.


More on this step

  • The higher the Granularity parameter (N) the higher the calculation effort to generate the learning curve. N model trainings are required for each model to complete the step.
  • For cross validation the amount increases by the number of folds (K). The total amount is N * K.
  • The bigger or more complex the model is the longer the calculations are going to take. Also, it might be that the server running the calculation might run out of memory for complex models and a large number of points to calculate.
  • On the other side higher values for the Granularity parameter result in much better curve shapes.
  • Using cross validation reduces sensitivity to outliers (extreme values) or unbalanced data and therefore is more likely to produce smoother learning curves.

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article