Cross validation

Modified on Mon, 27 Mar, 2023 at 1:08 PM

Cross-validation is a technique that helps to check the sampling and size of a dataset resampling. There are different methods on how to do the cross validation. In this article, we cover only the options available on the platform.

K-fold cross validation

A K-fold cross validation means that the dataset will be split into K random subsets in order to evaluate the model. The model will be trained K times and each another fold will be hold back so the model doesn’t see it in training. The model is tested on that fold and a cross validation score (e.g. Root Mean Squared Error) is calculated for each of those K models.

For example, for a cross-validation with 5 folds, the training data is split into 5 random subsets of about the same size. Five different models are trained and for each run a different subset of the data is used for validation as indicated by the figure below. For each of the five cross validation runs a cross validation score is calculated which can be checked on the platform via the Model Evaluation function.

Group K-fold cross validation

The K-fold technique is splitting the dataset randomly into K folds. Any pattern or structure in the data is not considered for the split. This is indicated by the left sketch below for a number of 4 folds. If the dataset consists of groups of data a random split would end up with each group represented in each fold (the left sketch is over-simplified and just highlights the points that group boundaries are not respected in the split).

If you want to keep groups of data together during cross validation you can apply Group K-Fold cross validation. This is indicated by the right sketch below. Group K-fold will respect data groups and not tear them apart for the split. The following statements summarises the Group K-fold:

One group is always only part of one fold. A group is never split up and put into several folds.
A fold can contain more than one group (depending on dataset structure, i.e. number and sizes of groups).
The number of groups should at least be equal to the number of folds (following from the first bullet point).
As a result, all folds might not have exactly the same size.

When to apply Group K-fold

In the following circumstances the use of Group K-fold could make sense:

If you have (time) series data you can keep each series together with Group K-fold.
You want to use your model to make predictions for completely new groups the model has never seen. If you do a random split, the model would see all groups in each cross validation loop. That is, the model assessment would be overly optimistic when it comes to its ability to predict the results for completely new data. In that case it would be much better to use Group K-fold. The model would not see a complete group in each cross validation loop.

How to use and interpret cross validation

If your data is evenly distributed over the entire design space and if you have enough data samples the train test split should have no effect on the model performance. The randomly sampled training set should cover the entire design space as well and the test set should also have the same characteristics.

In case your design space is poorly sampled or you just have not enough data the train-test-split might have an effect on model performance. Depending on which data goes into the training set and therefore is seen by the model the model performance might be quite different.

Cross validation can help to quantify this effect. As the dataset is split into several folds, in each cross validation loop another part of the data is hold back and the data is trained and tested on different data each time. Ideally the cross validation score (e.g. root mean squared error) is the same for each loop. That would mean that the model performance is independent of your dataset. This could typically indicate one of two options:

Your data sampling is good and model performance is good as well no matter how you split the data (i.e., similar and relatively low scores).
Your data sampling is very poor and model performance is also very bad no matter how you split the data (i.e., similar but quite high scores).

In reality even a very good dataset will show some differences between the scores but these should be rather small. The bigger the differences become the higher the sensitivity of the model with regard to which data is included in training or not. That indicates that you don’t have enough data or the sampling of the data is poor at least in parts of your design space.