Train Test Split

Modified on Thu, 14 Sep, 2023 at 1:54 PM

Description

The manipulator provides an easy way to split your dataset into two subsets on which you can train and test your models.

Application

To check your model performance and especially to compare different models you need a dataset none of the models has seen before to make sure your models generalise well. If you did this already outside of the platform you can simply upload and import two separate datasets. If that is not the case you can use this function to split up your dataset into a larger subset which is used to train models and a smaller subset which is used to test model performance and compare models with each other.

Even though the original purpose of this function is to split a dataset into train and test data you can also use it to create a random subset of a dataset for any other purpose (e.g. to create smaller subsets of different sizes to manually create a Learning Curve).

The train test split is not well suited to deal with (time) series data. In these cases refer to How to do a train-test split on time series?

How to use

You first need to assign the dataset to evaluate in the field Data. All other parameter/fields are preset and if you don't want to change them you can directly click Apply.

Train Percentage	This parameter defines the size of the training dataset as percentage of the original dataset. The default value is 80%. Refer to How to choose a good train-test split? for more info on how to choose this parameter.
GroupBy Columns	This field is optional and will only be available if the data selected is tabular (if 3D data is selected, this field will not be available). If one or multiple columns are selected, they will be used to define a list of unique groups (similar to the Group By function), and the train-test split will be performed on this list of groups. Finally, the data from each group will be added to the corresponding set (train or test). See the section below for an example. ⚠️ If groups don’t all have the same length, it is possible that the final train-test split ratio of data points is different from the specified train-test split of groups. A warning will inform the user on what the final split is.
Training data name	The name of the new dataset with the training data. The size of this dataset (i.e. number of rows) is N% of the original dataset as specified by the parameter Train percentage.
Test data name	The name of the new dataset with the test data. The size of this dataset (i.e. number of rows) is (100-N)% of the original dataset as specified by the parameter Train percentage.

Example

Here is an example where the data was split with and without using the group-by option. Using the group-by option preserves groups together. Groups A, B and D are in the train set, and group C is in the test set. Using the normal train-test split will break these groups. Groups A, B and D are spread across train and test sets.

More on this step

The train test split is done randomly. Each time you rerun the train test split the two resulting data sets are different. Moreover, if you load the same data set into a new train test split step the result will be different. If you want to use the same train and test sets, you should do it once and then reuse the same data sets. You might need to export the data sets and import them in other notebooks.