Description
This function allows to filter a dataset based on a single categorical column. The function is a positive filter; all selected values will be kept in the dataset.
Please note that this function can ONLY be used with datasets that DO NOT contain missing data - You may need to clean your data using the 'Remove Missing' function, or alternatively, fill in any missing data before uploading data to the platform
Application
Filter Categorical can help to reduce the size of the data and focus on relevant information. For example, focus on one type of test, one type of material, etc.
It can also be used to do a manual train-test split if you have only a small number of time series.
How to use
- Create the step and assign a tabular dataset to it in the field Data.
- Select a Column on which to apply the filter.
- The field Value(s)is defining the filter. The value(s) selected here will be kept in the dataset, all other values will be removed.
- If you want to select more than one value, enable the option Allow multiple values.
- You can make the step Interactive.
- If this option is left blank the step can only be changed by re-entering the edit mode and change the options (especially the field Value(s)).
- If this options is enabled the Value(s) field is exposed in the notebook and can be changed interactively without starting the edit mode.
- You can save the resulting dataset under a new name. Enable Save output under different name to do so. If that option is unticked the existing dataset will be overwritten.
- Click Apply to filter the data.
At the bottom of the step an info message will be shown which displays the number of rows before and after applying the filter (“Filtered from X to Y rows”, see screenshots above).
Examples
Quickly show how it would apply in a simple example. Some people would look for this section first.
- You can also apply this step to numerical columns. For this step, the column will then be interpreted as categorical data (the datatype of the column is not changed!). Identical numbers will be counted as identical categories.
- For
float
data this will usual result in a very huge number of categories as values have to be exactly identical. It is usually better to use Filter Numerical for float data. - For
integer
data the Filter Category filter can make sense if the unique value count is not too high (e.g. test number).
- For
- If your filter step is Interactive and you Allow multiple values it is possible to deselect all entries in the Values field. This will result in an error of that step. Enter edit mode to select Values and rerun filter.
- If you want to change the filter values completely first select the new values before removing the old values.
- If a Filter Categorical step is exposed on a Dashboard, dashboard users can only change the filter if it is made Interactive.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article