Remove Duplicates

Modified on Tue, 4 Apr, 2023 at 3:09 PM

Description

This function remove duplicate entries from your dataset. A duplicate can be searched for in a single column or in a combination of multiple columns.

Application

Depending on your train-test split, duplicate data points could mean that a data point is used for model training and testing which means your test metric becomes corrupted as the data should be unknown to the model. This should be avoided under all circumstances.

Duplicate data points could also mean that the model sees the same point several times in training. This could be intentional in some cases to focus the model on certain parts of the design space which are of higher importance (in these cases make sure to manually split the data to avoid the situation described above!). But in most cases this is presumably not the case. Each data point should have the same weight in training which means the model should see each data point the same number of times.

In both cases Remove Duplicates will help you to clean your dataset so that it contains no duplicates.

How to use

Create the step and assign a tabular dataset to it in the field Data.
Select all Columns which should be considered for the check.
You can choose if you want to overwrite the existing dataset by the result of this step or save it under a new name by dis-/enabling the option Save output under new name.
Click Apply to execute the step.

The result will only contain only rows which are distinct in all selected columns. If duplicate rows exist only the row of first occurrence will be kept. No sampling of the non-involved columns will be done. See examples below for more details.

Typically you would apply this step on all columns which are inputs and outputs to your models so that the same set of data is not presented twice to the model.

Make sure to select all relevant columns that define a unique data point. If you don't do that Remove Duplicates might end up deleting data! For example, saying that if data points are uniquely defined by three variables (e.g. car type, speed, and sensor), you need to make sure you select those three. If not, you will delete too much data! For example, if you don't select speed, you will only get one point per car and sensor.

Examples

Consider the following example dataset:

Row	A	B	C
1	1	1	1
2	1	2	1
3	1	3	2
4	1	4	2
5	2	1	1
6	2	2	1
7	2	3	2
8	2	4	2

If you selected only column A that would result in the following dataset:

Row	A	B	C
1	1	1	1
5	2	1	1

Select columns A and B would not change the dataset as the combination of A and B is distinct in all rows. There are no rows with the same combination of A and B.

If instead you selected A and C the result would be this dataset:

Row	A	B	C
1	1	1	1
3	1	3	2
5	2	1	1
7	2	3	2

This can be extended to any number of columns.

More on this step

As you see in the examples above there is no logic or sampling applied to the columns not involved in the search for duplicates. Just the value of the first point is kept. If for example you would like to search for duplicates in A and C and get the average value for B along the duplicates you can use Group By to achieve that.