Description
This function creates a heatmap of a dataset on missing data and therefore enables a quick visual check for missing data.
Application
Collecting real data often involves missing data. Data sheets are not always filled completely and consistently. Sensors fail during test and don’t send a signal. Especially with bigger datasets it can be tedious to search for missing data. With this function you can quickly check a dataset for missing data before blindly starting to remove missing data with Remove Missing.
How to use
- Select the dataset to analyse in the field Data.
- Select which Columns shall be analysed and plotted in the heatmap.
- Both numerical and categorical columns can be analysed.
- Select the Number of buckets for rows. This has to be an integer number between 1 and the number of rows in the dataset.
- The dataset will be split evenly into buckets (or bins) according to the entered number. Each bucket represents the same number of rows, i.e. each bin contains
#rows/#buckets
rows. - For each bucket one cell will be plotted in the heatmap.
- The dataset will be split evenly into buckets (or bins) according to the entered number. Each bucket represents the same number of rows, i.e. each bin contains
- Click Apply to run the step and generate the heatmap.
The heatmap layout is as follows:
On the y-axis the number of rows is plotted. The number of cells on the y-axis is according to the number of buckets.
On the x-axis one cell for each selected column will be plotted.
The z-value (colour) is the fraction of missing data in each cell in percent with values between 0% (no missing data, bright color) and 100% (all data in that bucket missing, dark red). Each cell is colored according to its z-value. When you hover the mouse pointer over a cell the x-, y-, and z-value are shown.
Examples
Consider this example table with missing data:
The table has just 16 entries. If you plot the missing data with 16 buckets the heatmap would look like this, giving a clear indication in which rows data is missing:
If you plot the same data with just one bucket the heatmap looks like this:
You don’t see how much data is missing and on which position but you immediately see which columns have non-zero and where you need to look closer.
Here one last example with four buckets. This would represent the typical approach for a larger dataset for which you can’t choose one bucket per row:
The y-axis is fixed in height. For very small datasets it can make sense to choose the maximum number of rows as the number of buckets so each cell on the heatmap represents a single column. That way you will get direct feedback on missing rows.
For larger datasets this becomes infeasible as the cells would become so small that they would not be recognisable. Use a smaller number of bins; up to 100 buckets is a reasonable choice.
The quickest way to check very big datasets is to use a single bucket and to check the z-value of this single bucket. If the data is complete the value has to be 0. Any value deviating from 0 indicates missing data.
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article