How to choose a good train-test split?

Modified on Mon, 19 Dec, 2022 at 4:53 PM

Choosing a good train-test split (ratio of data used for training and testing a model) will vary depending on the amount of data available. Here is a good practice you could start with:

  • If the data set contains less than 100k points: Doing a 80/20-split is a reasonable choice for most cases (80% of data used for training and 20% used for testing).
  • If the data set contains more than 100k points: At this stage, 20k data point is a reasonable size for your test data set as long as it is well distributed over your design space. Therefore, you can select a split ratio so that the testing set contains 20k points (e.g. for 1 million points, you will get a test split of 20k points by doing a 98/2-split).

Was this article helpful?

That’s Great!

Thank you for your feedback

Sorry! We couldn't be helpful

Thank you for your feedback

Let us know how can we improve this article!

Select at least one of the reasons
CAPTCHA verification is required.

Feedback sent

We appreciate your effort and will try to fix the article