How to choose a good train-test split?

Modified on Mon, 19 Dec, 2022 at 4:53 PM

Choosing a good train-test split (ratio of data used for training and testing a model) will vary depending on the amount of data available. Here is a good practice you could start with:

If the data set contains less than 100k points: Doing a 80/20-split is a reasonable choice for most cases (80% of data used for training and 20% used for testing).
If the data set contains more than 100k points: At this stage, 20k data point is a reasonable size for your test data set as long as it is well distributed over your design space. Therefore, you can select a split ratio so that the testing set contains 20k points (e.g. for 1 million points, you will get a test split of 20k points by doing a 98/2-split).