General recommendations
Split up your work into stages and notebooks
Most workflows can be broken down into distinct stages. It is usually best practice to split these stages into separate notebooks. At the end of each notebook export datasets and models relevant for the next stage.
The stages are:
- Data exploration
- Data pre-processing (transformation)
- Modelling and Model Evaluation
- Apply Models
The sections of this article follow these stages.
Related Functions
Use subsets of data for prototyping
The larger a dataset becomes the more memory and computing power is required to load and process the data. That is, the platform will become slower and appear less performant. If you are still in the process of building up the final pipeline for your project and have big data (>10,000 data points), it is recommended to use a smaller subset of your data to build a prototype of the required steps and intended models.
The target of building the prototypes is to test that all steps are properly configured and produce the intended results. The target is not to achieve good model performance. The time needed to build up the entire workflow will be significantly less.
Once you have the workflow completely established you can update the initial dataset and load the complete dataset. The notebook(s) will update and each step will rerun on the full dataset. This will happen automatically on the backend and you can return to the notebook once the update is done.
Related Functions
Data exploration
The first stage is typically data exploration to get a good understanding of the data. The typical steps involve:
- Confirm that the data is correct.
- Explore trends and confirm expectations.
- Check the distribution of the columns you want to use as input and output with the following questions in mind. The answer to those questions might define the model performance you can achieve.
- How is the sampling of the data?
- Is the design space fully covered?
- Is the data evenly sampled?
- Is the data accumulated at certain values or at the edges?
- Are there outliers that needs to be removed in the next stage?
- Check for missing data in the dataset.
- Missing data should be removed in the next stage as well to avoid problems in the modelling phase.
- Understand relationships in the data.
Related Functions
Data pre-processing and cleaning
The next stage is pre-processing of the data. This can have different (and also multiple) objectives depending on the state of the original dataset:
- Data cleaning
- Bring data into required shape
- Split up or combine data (depends on context)
If the data is already in a suitable format, or if you did all preprocessing already outside of Monolith before uploading and importing the data you can skip this stage.
Data cleaning
Cleaning the data usually involves one or multiple of the following steps:
Remove outliers | If that is required and for which columns should be clear from the data exploration stage before. You can use one of two approaches:
|
Remove duplicates | Duplicate entries should be removed so a model doesn’t get biased by them. Use Remove Duplicates to do so. |
Remove missing | Incomplete or completely missing data points can’t be used for model training. Remove those to avoid problems in later stages. Use Remove Missing to do so. |
Bring data into required shape
Add additional columns | Sometimes not all required parameters are already in the data but can be derived from the existing data. You can use Quick Columns to calculate additional parameters. |
Convert dimensions | In principle it is not important for ML models that dimensions of all involved parameter match as the conversion factors will be included in the model. But if for some reason you need or want to convert the dimension of a column you can also use Quick Columns. |
Change dataset structure | Sometimes it might be required to change from long to wide or from wide to long format. It can make sense to remove unnecessary columns from the dataset to make it smaller which can improve experience working with it. |
Grouping data | It could be that your data consists of groups of data. Sometimes it is required to calculate certain attributes or parameters of these groups (e.g. group average, minimum value, maximum value, …). This can be achieved with Group By. |
Split data
Split data into separate datasets according to physical regimes or geometrical families.
Split by physical regimes | Examples for physical regimes: different flow regimes (laminar, turbulent), physical behavior changes, underlying equations change. In those cases it is often not possible to find one model which can capture the physics in the entire range. Instead it is better to train multiple models on different regimes separately. |
Split by geometrical families | If geometries are very different, sort data according to geometrical families. |
Check carefully if any of those conditions apply to your case.
Related Functions
- Filter Numerical
- Filter Categorical
Join data
Join data if your original data is coming from different sources and is therefore distributed over multiple datasets. To combine them into a single dataset use:
Join | Combine data horizontally: common rows but different columns |
Append | Combine data vertically: common columns but different rows |
Modelling and model evaluation
The next stage is to train and test different model types. If you found the model type most suitable for your problem you should play around with hyperparameters to find the best model setup for your problem.
- First step is to split data into a train and a test dataset.
Related Functions
- Train different types of models. The Guided ML/Bulk modellingfeature supports you in choosing suitable models and train them in batch.
- Use the Model evaluation tools to test model performance.
Related Functions
The Guided ML is training each model type with a default parameter setting. Use the two most promising model types and use hyperparameter optimisation (HPO) to check if these models can be improved by adjusted hyperparameters.
Check out the help article on the specific model to learn more about the meaning of its hyperparameters.
Usually run this with randomised search and about 10 samples. To improve statistics you could run HPO two or three times and check if a similar model topology is suggested each time.
Switch to manual parameter settings and search close to the parameter settings suggested by HPO to check if further improvement is possible.
Use the model evaluation tools after each model training to check how model performance changes.
Continue or iterate through previous stage
At the end of this stage the result might be:
- At least one of the models meets your requirements. In this case, move on to the next stage with that model.
- No model meets the requirement
The latter case is not uncommon as Machine Learning is usually a very iterative processes. There are different options on how to proceed:
- Revisit your requirements. Are they realistic? Could they be lowered?
- Use the Learning Curve to check if more data would help to improve model performance.
- Is there a general issue with the data? Here are a few typical reasons:
Noise | The level of noise in the data could impose a limit to the achievable model accuracy. Go back to the exploration stage and check the level of noise in your data and compare with your prediction results. If noise and prediction error are on a similar level this might explain why no further model improvement is possible. Check if data quality can be improved with regard to the noise level. |
Missing inputs | It might be that not all relevant inputs are either selected in the model, or present in the dataset. If not selected in the model but available in the dataset, it’s easy to fix, but if not in the data, you need to find a way to access that information. |
Data distribution | The data might be distributed in a way that makes the model good in some regions only. Got back to the exploration stage and check the distribution of the inputs used for the model. In case the data sampling is not suitable to accuracy requirements, you need to resample the data. |
Apply models
If you found a model which meets your requirements it is recommended to build a final notebook which is reduced to the minimally required steps to reproduce the successful model. Depending on how your intended use of the model looks like there are several options to apply the model.
Explain
Use the model to improve your understanding of the problem by using the explanation tools available on the platform.
Sensitivity Analysis | Changes in which inputs impact the output(s) most? |
Explain Prediction | How important is it to include certain inputs in the model? |
With these tools you can identify on which inputs to focus as they offer the biggest lever. These tools can also be used in the modelling phase as they can indicate on which inputs you have to focus when trying to improve data quality or collect more data.
Predict
Use the model to make predictions for new sets of input parameters.
- Use interactive prediction tools and expose those on a Dashboardto make them easily accessible for other users.
- Dashboard links, something on collaboration.
- Scalar Prediction, Curve Prediction, Surface Prediction
- Import new unknown sets of data and use Dataset Predictionwith the final model. Use the export capabilities if you want to use the predicted values in other tools.
- Use Monolith’s REST API to send new unknown data points to the model and get a prediction back in return. This enables to exchange data/predictions with other external tools automatically.
Optimise
Use the model to do an optimisation to find optimal set of inputs for a desired/required target.
Related Functions
- Targeted Optimisation
- Min/Max Optimisation
Was this article helpful?
That’s Great!
Thank you for your feedback
Sorry! We couldn't be helpful
Thank you for your feedback
Feedback sent
We appreciate your effort and will try to fix the article