Group By

Modified on Wed, 18 Jan, 2023 at 1:27 PM

Description

With this function you can group a dataset based on one or multiple columns and perform calculations on each group.

Application

Some examples on when you typically would use Group By:

You have time series data with a sampling rate of 1 Hz spanning several days and want to know the average value for each minute, hour, or day.
You have data from several types of tests and want to know the maximum value occurring for each test type.
Or, on the same dataset you want to know how many data points you have for each test type.
You have a dataset with different materials and want to now the average value of a material property for each material in your dataset (e.g. the composites dataset from Challenge 1 - Composites Materials).

In general, whenever you can group your data by one or more variables (typically categorical variables) and want to get statistics (mean, min, max, count, …) or other characteristics for each group this can be done with Group By.

How to use

Choose the dataset to work on in the field Data.
Choose the column(s) by which to group the dataset in the field GroupBy columns.
- You need to choose at least one column. But a group can also be defined by multiple columns if required.
- The columns can have any data type. But if you choose a column with data type float you hardly will get any groups as the numbers have to be exactly the same to form a group. Typically, you would use categorical columns for grouping. Integer and Boolean columns would work as well.
Choose an Operation which should be applied to each group. See the section below for details.
The operation is not automatically applied to all available columns of the dataset. Instead, you have to choose on which columns the operations are applied for each group. Choose those in the field Columns.
- A column can't be assigned to GroupBy columns and Columns at the same time. This will result in an error when running the step. You either have to assign a column to GroupBy columns or Columns.
Typically, you should save the result of Group By in a new dataset by enabling Save output under different name because the resulting dataset is completely restructured (see below).
Click Apply to execute the Group By operation.

Structure of output

The resulting dataset has the following structure:

The number of rows is equivalent to the number of groups found in the dataset. There will be one row for each group. (If custom code is used this could be different as any number of rows could be defined as result.)
The Group By columns will be kept so you can identify each group.
All columns selected in the Columns field are kept in the dataset. All other columns will not be part of the new dataset.

The output would look like this:

GroupBy Col1	GroupBy Col2	…	Column1	Column2	…	ColumnK
`Group1 ID1`	`Group1 ID2`	…	`value`	`value`	…	`value`
`Group2 ID1`	`Group2 ID2`		`value`	`value`	…	`value`
…	…	…	…	…	…	…
`GroupN ID1`	`GroupN ID2`	…	`value`	`value`	…	`value`

Available operations

The following operations are available in GroupBy:

Operation	Description
Min	Returns the minimum value within a group for each selected column. This works also for categorical data and will return the first string after ordering all entries found in a group alphanumerical.
Max	Returns the maximum value within a group for each selected column. This works also for categorical data and will return the last string after ordering all entries found in a group alphanumerical.
Mean	Returns the mean value within a group for each selected column. Will throw an error due to wrong data type for categorical columns.
Sum	Returns the sum of all rows within a group for each selected column. For categorical columns this will yield a single long string for each group with all entries of a group concatenated.
Count	Counts the number of non-empty rows within a group for each selected column. If the dataset is complete (i.e. no missing values) this will return the same value for any selected column. Works on both numerical and categorical data.
Unique Count	Returns the number of distinct values within a group for each selected column. Works on both numerical and categorical data.
Custom Code	The custom code will be applied to each group. See below for more details.

Examples

We use the dataset from Challenge 1 - Composite Materials for these examples.

Question 1: What is the average fiber length for each Carbon fiber in the dataset?

To answer that question we can use Group By. See the required settings in the screenshot below.

And this will be the resulting dataset:

You have only the two columns involved in the dataset (Carbon fiber as GroupBy column and Fiber length as the only column on which the operation was performed). To work further with this information you could do the following:

Rename the column Fiber length to Average fiber length so that it is clear what the column content is.
Join this new dataset with the original dataset. That way you have now additionally the average fiber length for each carbon fiber in your original dataset. You could compare now the actual fiber length of each data point with the average fiber length of the group.

Question 2: What is the maximum ultimate strain that was achieved in a composite for each glass fiber?

To answer that question we can use Group By as well. See the required settings in the screenshot below.

And this will be the resulting dataset:

More on the custom code operation

To know the packages which can be imported refer to the article on Custom Code. But in other respects the custom code environment here differs from Quick Columns' custom code interface and from the function Custom code itself.

The custom code needs to be encapsulated within a subroutine called apply_to_groups. A template of this subroutine is already pre-filled in the code field when you select the Custom code operation. Just fill the body of the subroutine with your code and change the return statement according to your requirements.
The method apply_to_groups is called for each group found in the dataset. That is, this code is applied to each group. The input is a dataframe called group_df which forms a single group (i.e. all rows which form one group).
The dataframe group_df contains all columns selected in the fields GroupBy columns and Columns.
The method needs to return a dataframe. That is, apply any required operation on the dataframe group_df and return it (or a copy). The Group By function will then construct the final function output by appending the dataframes group_df calculated for each group.
You can create other subroutines and call those from apply_to_groups.
If you want to use any package this needs to be imported within the subroutine!

Example: Calculate the Area Under Curve (AUC)

This could be done on (time) series data. The formula for AUC is:

The sklearn package offers this function out of the box which we can use for this. In the return statement we construct a dataframe consisting just of one single column and line. The returned dataframe will have the column AUC besides the ID column which is identifying each group (this part is taken care of by Group By itself.

def apply_to_groups(group_df):
    import pandas as pd
    from sklearn import metrics

    return pd.DataFrame({'AUC': [metrics.auc(group_df['x'], group_df['y'])]})