Next Test Recommender Evaluation

Modified on Fri, 5 Jan, 2024 at 10:05 AM

Description

The Next Test Recommender (NTR) Evaluation step enables to quickly and automatically identify the best recommenders to apply on a existing data set to maximise learning efficiency. Once the best recommender is identified, it can be used within the Next Test Recommender (BETA) step on live data.

Application

The Next Test Recommender (BETA) step enables users to be recommended what are the next most impactful tests they should carry out to maximise their learning. In the NTR step, multiple recommenders are available, and it is not obvious what recommender will be the best on the type of data that they are working with. The NTR evaluation step will help users identify which recommender will help gain knowledge faster. It will also show if and when these recommenders are better than random or user-based strategy.

How to use

The step will run on existing data, and will require the following information:

Data for evaluation	This is the data that will be used to run and evaluate the different recommenders. Please only select data that includes the results of completed tests.
Inputs to get recommendation for	Select the inputs for which you want the recommenders to find the best next values to tests. These are typically configurations or test parameters. At least two parameters are required for the step to run.
Outputs to better understand	Select the outputs that you want to better understand with the tests. This usually is related to the performance (e.g. battery life, efficiency, strength, …). Multiple columns can be selected if a combination of outputs needs to be understood.
Acquisition strategy	Choose a training data acquisition strategy that reflects your real-life testing approach. This includes: Number of initial data points: defines how many points are initially used to run the first iteration of the recommenders Batch size: defines how many more tests will be run for each iteration.
Evaluation strategy	Ensure sufficient data is kept for reliable evaluation results. This includes the same parameters as the acquisition strategy: Number of initial data points: defines how many points are initially used evaluate the different recommenders Batch size: defines how many more test will be added to the evaluation data after each iteration. Enter ‘0’ to keep the evaluation dataset size constant. If the value is different from ‘0’, then the test set will grow as more data is available. It also means that the data available to use the recommenders will be more limited.
Evaluation model type	This is the type of models that will be used to evaluate the performance of each recommender. Random forest is usually a good default model type, but this can be changed depending on how the data is used.
Repeat calculations	This is the number of time that each curve is calculated. More repetitions will lead to more reliable results, but will take longer to run.
Limit iterations	For large data sets, this option enables to reduce the number of iterations to make the step stop sooner. If you have a limited budget (e.g. 200 tests max), you can also limit the iterations accordingly.
Holdout proportion	Select a holdout set from the beginning to evaluate the performance of the different recommenders. The difference with the evaluation strategy above is that the holdout set is set apart from the very beginning and is therefore not biased by the selections of the different recommenders. The model errors calculated from this holdout set will not be visible in the curves displayed by this step, but can be accessed in the generated dataset.
Preprocess	Preprocessing the data will allow automatic handling of data inconsistencies without user intervention (such as missing data, `NaN` or `inf` values). If this is not ticked, using inconsistent data will return an error.

Once the step is run, it will display a graph with curves looking like the figure below:

The curves show how good a model trained on the acquired data is, after each iteration. The horizontal axis shows the number of test data points used to train the model, and the vertical axis shows the error of the trained model on a test set, at each iteration. Curves will generally go down, but what is important is the “speed” at which they go down. The faster a curve goes down, the quicker the model learns from the recommended data. Therefore, recommenders passing by the bottom left of the graph (e.g. recommender α in the figure) will be better than those on the top right (e.g. recommender δ in the figure).

These curves will enable to quickly compare the different recommender, including the random sampling strategy.

There is a way to read these curves in a more quantative way. If we focus on recommenders α and β, the figures below show how quantitative insights can be derived:

(1)		Compare how many tests/iterations are required for each recommender to achieve a target model error (or to achieve a target space coverage). In the example above, recommender α will require only 15 tests to reach the target error value of 0.15, while recommender β will require 30 tests. This means that the same understanding of the design space can be achieved with half of the tests when using recommender α.
(2)		Compare the model errors for a given number of tests acquired by the different recommenders. In the example above, after 25 tests, the error of the model using recommender α is twice as low as the model using recommender β. This means that for the same number of tests, you will get a much better understanding of the design space with recommender α.

By default, the plot shown in the step will display the MSE (Mean Squared Error) on the evaluation set. However, the step will return a dataset with more information. The returned data set contains more informations that can be explored further on downstream steps:

recommender_name: name of the recommender used for the row.
rep: number of the repetition (starting at 0). If multiple repetitions are selected in the advanced options, the results of each repetition will be accessible.
step: number of iteration of using the recommenders, starting with 0 for the initial batch only.
n_train_rows, n_test_rows, n_holdout_rows: number of points in the train, test (evaluation) and holdout data set after each step.
test_mean_squared_error, test_r_squared, test_mean_absolute_error, test_prop_within_5_percent: different metrics evaluating space coverage on the test (evaluation) data set.
holdout_mean_squared_error, holdout_r_squared, holdout_mean_absolute_error, holdout_prop_within_5_percent: different metrics evaluating space coverage on the holdout data set. If the holdout proportion is set to 0%, then these columns will be empty.

Examples

Here is an example of a graph produced by the next test recommender evaluation step in the platform. In this example, we can see for example, that using recommender Delta:

A MSE of 5 can be achieved with half the number of tests required with the random approach,
The full test campaign would achieve a MSE four times smaller than with the random approach

More on this step

To know more about how the step works, we recommend that you watch our webinar on this topic, available here.