Choosing a good train-test split (ratio of data used for training and testing a model) will vary depending on the amount of data available. Here is a good practice you could start with:
- If the data set contains less than 100k points: Doing a 80/20-split is a reasonable choice for most cases (80% of data used for training and 20% used for testing).
- If the data set contains more than 100k points: At this stage, 20k data point is a reasonable size for your test data set as long as it is well distributed over your design space. Therefore, you can select a split ratio so that the testing set contains 20k points (e.g. for 1 million points, you will get a test split of 20k points by doing a 98/2-split).
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article