Some tips on how to optimize the development process of a Machine Learning model in order to avoid surprises during the deployment phase.

Photo by Nick Owuor (astro.nic.visuals) on Unsplash

Eventually I was able to breathe a sigh of relief: my Machine Learning model works perfectly both on training and on the test set. All the metrics used to measure the performance of my model achieve very high performance. I can finally say that my work is almost completed: just deploy and that’s it.

Instead, it is right there, in the deployment phase, that all the problems arise: on the new data the model seems to give bad results. And above all it seems to have code implementation problems.

In this article I describe three common mistakes that absolutely must be avoided when developing a Machine Learning model, in order to prevent surprises during the deployment phase.

1 Forgot to Save Scalers

In the first phase of the development of a Machine Learning model, we proceed with the cleaning of the data and their normalization and standardization.

One of the possible errors during this preprocessing phase could be to perform operations of this type:

df['normalized_column'] = df['column']/df['column'].max()

In the previous example, everything seems to work on the original dataset. However, when you go to deploy, the problem of column normalization arises. Which value should you use as a scale factor?

A possible solution to the problem could be to store the scale factor somewhere:

file = open("column_scale_factor.txt", "w")
scale_factor = df['column'].max()
file.write(scale_factor)
file.close()

The previous solution is very simple. As an alternative, a more powerful scaler could be used, such as those provided by the Scikit-learn preprocessing package:

form sklearn.preprocessing import MaxAbsScalerscaler = MaxAbsScaler()
feature = np.array(df[column]).reshape(-1,1)
scaler.fit(feature)

Once the scaler is fitted, do not forget to save it! You will use it during the deployment phase!

We can save the scaler, by exploiting the joblib Python library:

import joblibjoblib.dump(scaler, 'scaler_' + column + '.pkl')

2 Multiply your Notebooks without Any Criteria

When looking for the optimal model for representing your data, it may happen that we test different algorithms and different scenarios before obtaining the best possible model.

In this case, one of the possible mistakes is creating a new notebook as we test a new algorithm. Over time, the risk is to have

a folder crammed with files on your filesystem. There is therefore a risk of not being able to accurately track the steps taken during model development.

So how to solve the problem?

It is not enough to comment on the code, we have to enter talking names to the various notebooks. A possible solution could be to precede the title of the notebook with a progressive number that indicates exactly at which point to perform a given step.

Here a possible example:

01_01_preprocessing_standard_scaler.ipynb
01_02_preprocessing_max_abs_scaler.ipynb02_01_knn.ipynb
02_02_decisiontree.ipynb
..

In the previous example, we have used the following naming format:

stepnumber_alternative_name.ipynb

Thus, the first number indicates the step number in the pipeline, while the second number (after the underscore) indicates a possible alternative.

Continue reading on Towards Data Science

3 Mistakes to Avoid When You Write Your Machine Learning Model

Some tips on how to optimize the development process of a Machine Learning model in order to avoid surprises during the deployment phase.

1 Forgot to Save Scalers

2 Multiply your Notebooks without Any Criteria

Related