Mastering Monolith: A Step-by-Step Workflow for Success

Modified on Tue, 4 Apr, 2023 at 4:02 PM

General recommendations

Split up your work into stages and notebooks

Most workflows can be broken down into distinct stages. It is usually best practice to split these stages into separate notebooks. At the end of each notebook export datasets and models relevant for the next stage.

The stages are:

Data exploration
Data pre-processing (transformation)
Modelling and Model Evaluation
Apply Models

The sections of this article follow these stages.

Related Functions

Use subsets of data for prototyping

The larger a dataset becomes the more memory and computing power is required to load and process the data. That is, the platform will become slower and appear less performant. If you are still in the process of building up the final pipeline for your project and have big data (>10,000 data points), it is recommended to use a smaller subset of your data to build a prototype of the required steps and intended models.

The target of building the prototypes is to test that all steps are properly configured and produce the intended results. The target is not to achieve good model performance. The time needed to build up the entire workflow will be significantly less.

Once you have the workflow completely established you can update the initial dataset and load the complete dataset. The notebook(s) will update and each step will rerun on the full dataset. This will happen automatically on the backend and you can return to the notebook once the update is done.

Related Functions

Random Subset

Data exploration

The first stage is typically data exploration to get a good understanding of the data. The typical steps involve:

Confirm that the data is correct.
Explore trends and confirm expectations.
Check the distribution of the columns you want to use as input and output with the following questions in mind. The answer to those questions might define the model performance you can achieve.
- How is the sampling of the data?
- Is the design space fully covered?
- Is the data evenly sampled?
- Is the data accumulated at certain values or at the edges?
- Are there outliers that needs to be removed in the next stage?
Check for missing data in the dataset.
Missing data should be removed in the next stage as well to avoid problems in the modelling phase.
Understand relationships in the data.

Related Functions

Distribution, Box and Whiskers
Plot Missing Data
2D Point Plot, 3D Point Plot, Line Plot
Intelligent Correlations, Parallel Coordinates

Data pre-processing and cleaning

The next stage is pre-processing of the data. This can have different (and also multiple) objectives depending on the state of the original dataset:

Data cleaning
Bring data into required shape
Split up or combine data (depends on context)

If the data is already in a suitable format, or if you did all preprocessing already outside of Monolith before uploading and importing the data you can skip this stage.

Data cleaning

Cleaning the data usually involves one or multiple of the following steps:

Remove outliers

If that is required and for which columns should be clear from the data exploration stage before.

You can use one of two approaches:

Remove outliers
Use Filter Numerical or Filter Categorical to remove certain parts of the data that you know are not desired.

Remove duplicates

Duplicate entries should be removed so a model doesn’t get biased by them. Use Remove Duplicates to do so.

Remove missing

Incomplete or completely missing data points can’t be used for model training. Remove those to avoid problems in later stages. Use Remove Missing to do so.

Bring data into required shape

Add additional columns	Sometimes not all required parameters are already in the data but can be derived from the existing data. You can use Quick Columns to calculate additional parameters.
Convert dimensions	In principle it is not important for ML models that dimensions of all involved parameter match as the conversion factors will be included in the model. But if for some reason you need or want to convert the dimension of a column you can also use Quick Columns.
Change dataset structure	Sometimes it might be required to change from long to wide or from wide to long format. What is a “wide” and “long” table?, Long-to-Wide, Wide-to-Long It can make sense to remove unnecessary columns from the dataset to make it smaller which can improve experience working with it. Select Columns
Grouping data	It could be that your data consists of groups of data. Sometimes it is required to calculate certain attributes or parameters of these groups (e.g. group average, minimum value, maximum value, …). This can be achieved with Group By.

Split data

Split data into separate datasets according to physical regimes or geometrical families.

Split by physical regimes

Examples for physical regimes: different flow regimes (laminar, turbulent), physical behavior changes, underlying equations change.

In those cases it is often not possible to find one model which can capture the physics in the entire range. Instead it is better to train multiple models on different regimes separately.

Split by geometrical families

If geometries are very different, sort data according to geometrical families.

Check carefully if any of those conditions apply to your case.

Related Functions

Filter Numerical
Filter Categorical

Join data

Join data if your original data is coming from different sources and is therefore distributed over multiple datasets. To combine them into a single dataset use:

Join	Combine data horizontally: common rows but different columns
Append	Combine data vertically: common columns but different rows

Modelling and model evaluation

The next stage is to train and test different model types. If you found the model type most suitable for your problem you should play around with hyperparameters to find the best model setup for your problem.

First step is to split data into a train and a test dataset.

Related Functions

Train different types of models. The Guided ML/Bulk modellingfeature supports you in choosing suitable models and train them in batch.
Use the Model evaluation tools to test model performance.

Related Functions

Predicted vs. Actual
Compare Performance Metrics
Compare Against Criteria
Learning Curve and How to read a learning curve?
Validation Plot (for time series)

The Guided ML is training each model type with a default parameter setting. Use the two most promising model types and use hyperparameter optimisation (HPO) to check if these models can be improved by adjusted hyperparameters.
- Advanced Tabular Model Options
- Check out the help article on the specific model to learn more about the meaning of its hyperparameters.
Usually run this with randomised search and about 10 samples. To improve statistics you could run HPO two or three times and check if a similar model topology is suggested each time.
Switch to manual parameter settings and search close to the parameter settings suggested by HPO to check if further improvement is possible.
Use the model evaluation tools after each model training to check how model performance changes.

Continue or iterate through previous stage

At the end of this stage the result might be:

At least one of the models meets your requirements. In this case, move on to the next stage with that model.
No model meets the requirement

The latter case is not uncommon as Machine Learning is usually a very iterative processes. There are different options on how to proceed:

Revisit your requirements. Are they realistic? Could they be lowered?
Use the Learning Curve to check if more data would help to improve model performance.
Is there a general issue with the data? Here are a few typical reasons:

Noise	The level of noise in the data could impose a limit to the achievable model accuracy. Go back to the exploration stage and check the level of noise in your data and compare with your prediction results. If noise and prediction error are on a similar level this might explain why no further model improvement is possible. Check if data quality can be improved with regard to the noise level.
Missing inputs	It might be that not all relevant inputs are either selected in the model, or present in the dataset. If not selected in the model but available in the dataset, it’s easy to fix, but if not in the data, you need to find a way to access that information.
Data distribution	The data might be distributed in a way that makes the model good in some regions only. Got back to the exploration stage and check the distribution of the inputs used for the model. In case the data sampling is not suitable to accuracy requirements, you need to resample the data.

Apply models

If you found a model which meets your requirements it is recommended to build a final notebook which is reduced to the minimally required steps to reproduce the successful model. Depending on how your intended use of the model looks like there are several options to apply the model.

Explain

Use the model to improve your understanding of the problem by using the explanation tools available on the platform.

Sensitivity Analysis	Changes in which inputs impact the output(s) most?
Explain Prediction	How important is it to include certain inputs in the model?

With these tools you can identify on which inputs to focus as they offer the biggest lever. These tools can also be used in the modelling phase as they can indicate on which inputs you have to focus when trying to improve data quality or collect more data.

Predict

Use the model to make predictions for new sets of input parameters.

Use interactive prediction tools and expose those on a Dashboardto make them easily accessible for other users.
- Dashboard links, something on collaboration.
- Scalar Prediction, Curve Prediction, Surface Prediction
Import new unknown sets of data and use Dataset Predictionwith the final model. Use the export capabilities if you want to use the predicted values in other tools.
- How to export a dataset?
Use Monolith’s REST API to send new unknown data points to the model and get a prediction back in return. This enables to exchange data/predictions with other external tools automatically.

Optimise

Use the model to do an optimisation to find optimal set of inputs for a desired/required target.

Related Functions

Targeted Optimisation
Min/Max Optimisation