How to do classification in the platform?

Modified on Thu, 30 Mar, 2023 at 12:46 PM

For the moment, our models focus on regression problems. However, there are a few ways to still address categorical (i.e. classification) problems.

1. Link the problem to a regression one

if you have classification problems based on values, it is sometimes possible to create a regression model and then post process the results.

For example, if you want to know if a part is manufacturable. Instead of a classification problem, you can look at numerical data (e.g. max strain, ...), and applied some rules on them. For instance, once the max strain is predicted, the rule “if max_strain > 5%, then it's not manufacturable” could be added by creating a new column using the following custom code in a Quick Columns step:

df['manufacturable'] = np.where(df['max_strain'] > 0.05, 'No', 'Yes')

This methods works well when the categories are ordered, but might be harder to implement for categories that are not ordered (e.g. different types of materials, ...).

2. Hot encode the categories

If the categories cannot be linked to some numerical values, another option is to hot-encode your categories and then train a model predicting which category is the most likely. Using the same example, you can transform your data in the following way:

Then, you can train a model to predict the column “Manufacturable”. If the model predicts a value above 0.5, you can assume that the component is probably manufacturable. If the value is below 0.5, then it is predicting that the component is not manufacturable. An advantage of this method is that you can see how sure the model is. If the prediction is 0.985, the model seems pretty sure that the component can be manufactured, while if the prediction is 0.634, it seems less certain.