Deployment of Machine Learning Models (with an example)

Last year, I wrote a blog post on the development and release of Type4Py. Type4Py is a machine learning model for code. In a nutshell, it predicts type annotations for Python source code files and enables developers to add types gradually to their codebases. At the time of the Type4Py release, its deployment was pretty simple. I didn’t use containerization (Docker) and Kubernetes, and the model was deployed on a single machine. There were two clear downsides to the initial deployment approach. First, I could not easily deploy the ML model and its pipeline on another machine. Because I had to install Type4Py and its dependencies on other machines, Second, the ML application could not be scaled well since a single machine’s resources are limited.

Now, in this blog post, I want to explain how you can containerize your ML application and scale it with Kubernetes. I will use Type4Py as a use case or example here. However, you can replace Type4Py with any other ML model. Before going into details, I want to make it clear that I am new to the field of deployment of ML models (a.k.a. MLOps) and am still learning new things. So I am NOT an expert in MLOps by any means. The main goal of this blog post is to share my experiences with the deployment of ML modes and help other researchers like myself deploy their models and make them accessible to the world. Also, I want to document the steps that I did to deploy and scale our own deep learning model, Type4Py.

Alright! Without further due, let’s get into the topic of this post, i.e., how to deploy your ML model. First of all, I would like to outline the deployment procedure and steps needed to deploy an ML model from a research environment setting to “production”. Later, I explain each step.

Here are the high-level steps to deploy an ML model:

  1. Export your pre-trained model to the ONNX format.
  2. Create a small Rest API application to query your model.
  3. Containerize your ML application using Docker.
  4. Deploy and scale your ML application with Kubernetes. (OPTIONAL)
  5. Monitor your ML application


  1. Exporting ML models

In this step, I assume that you already trained, evaluated, and saved your ML model with optimal hyper-parameters to the disk. All the popular ML/DL frameworks like scikit-learn have a guide on how to save trained models. Oftentimes, ML frameworks are optimized to speed up the model training, not prediction/inference. Therefore, it is highly recommended to export or convert your trained model to the ONNX format. It is an open standard to greatly accelerate the ML model’s inference and also enables querying models on various systems such as cell phones, embedded systems, workstations, etc. To export your ML model to the ONNX format, check out their guide here. ONNX supports both traditional ML models (scikit-learn) and deep learning models (PyTorch and TensorFlow).

2. Creating a REST API to query a model

An ML model is trained to solve a problem. For example, Type4Py is trained to predict type annotations for Python. That said, an ML model is usually part of a bigger system. A famous example I can give is Spotify, a music streaming platform. Obviously, the Spotify platform is a huge system. It has an ML component that recommends music to its users. The ML component is queried on the front-end side like their website or mobile app. Similarly, we want to query our model ML at the back-end side and give the prediction results to other components in the system or on the front-end side. For instance, Type4Py is deployed on a Kubernetes cluster (back-end) and its VS code extension (front-end) queries the Type4Py model via a REST API call.

In short, you need to create a tiny REST API with a prediction endpoint. There are many web frameworks to do so. I suggest using Flask, which is small and lightweight. Also, most ML models are often created in Python and hence they can easily be integrated into a Flask app. For the sake of this post, let’s name the prediction endpoint “/predict”. This endpoint can accept a raw input/sample (via a POST request), pre-process, extract features, query the trained model, and finally return predictions as a response. This way, your ML application is deployed as a service and can be queried simply by sending a request to http://localhost/predict”. As an example, Type4Py’s predict endpoint accepts a Python source code file (raw text) and returns predicted type information as a JSON response.

Note: Flask’s development server is NOT suitable for the production environment. It is highly recommended to use a production-ready WSGI HTTP server like Gunicorn with Nginx as a proxy. For more info, check out this guide.


3. Containerzing ML applications

ML applications can have a large and complex pipeline and hence they might have tons of dependencies. For instance, Type4Py’s Docker image has a 2 GB size without considering its pre-trained model. So what’s the common practice to ship such large applications? The answer is containerization. Containerization is packaging an application with all its dependencies and required run-time libraries. It allows you to run your application in isolation, i.e., a “container”, on different environments and operating systems. One of the most widely used containerization software is Docker. You might have already used Docker to run applications on different machines. To containerize your ML application, all you have to do is to create a Dockerfile that has a base image and installs all the necessary dependencies that your application needs. Assuming that your ML pipeline is entirely written in Python, you need to use the Python base image and install dependencies using pip. Lastly, in the Docker file, you specify an entry point to your application, which is often running the Flask application that exposes your ML pipeline and model via REST API. As an example, you can see Type4Py’s Dockerfile to get an idea of how a Docker image for an ML application with REST API is created.

Up to this point, you should be almost ready to deploy your ML application. For deployment, there are generally two options. The first option is to ship your containerized ML application with a trained model to the users directly. They will run your application locally on their machines as a service and query the ML model via the REST API. This can be a viable option if your ML model is small and your Docker image plus the ML model is less than 1 GB. On other the hand, sometimes, ML models can be pretty large like language models and have millions of learnable parameters if not billions. Clearly, this kind of model cannot be delivered to users directly. Therefore, they are deployed on a powerful server(s) or cloud environment. If you have access to a (powerful) server, maybe at your organization, you can purchase a domain for your application and make your ML application accessible to the world. For instance, we have

Also, releasing Docker images can be part of your project’s continuous integration. If your project is hosted on GitHub, you can use GitHub packages to publish your Docker images.

4. Scaling ML applications with Kubernetes

You can skip this step if you do not have access to a cluster of machines or you do not anticipate many users for your ML application at the time of its initial release. Kubernetes is a container orchestration platform that automatically manages distributed containerized applications at a massive scale. Kubernetes (a.k.a K8s) follows the master/slave model. There is one master machine/node that controls other slave machines. Thanks to Kubernetes, you can easily deploy and scale your ML application on a cluster of machines/nodes without worrying about scheduling given available resources. If you have followed the previous steps, to deploy your ML with K8s, all you have to do is create K8s deployment and K8s service. A K8s deployment file contains a path to pull your ML app’s container or image and required CLI args or environments variables if any to run the application.

A K8s deployment created a set of a pod(s). Each pod is often a stateless application. In this case, a pod is an instance of your ML application that processes prediction requests via REST API calls. Note that your ML application should be stateless, meaning that it doesn’t store the program’s states in memory. In other words, if your application is stopped and restarted again, it should be able to process requests without knowing about previous requests. If your application needs to know about the previous states, then it can store its states or data in a database. A K8s service exposes your ML application to the outside and also load-balances incoming traffic among your K8s deployment’s pods, which is awesome! So you don’t need to worry about load-balancing. To better understand what’s said above for deploying ML applications, the below figure shows the front-end and back-end of Type4Py, an ML application that provides type auto-completion in VSCode.

Type4Py deployment using K8s


Thanks to K8s service, scaling your ML application is straightforward, and, depending on the traffic and load, you just need to tell K8s to create more replicas/pods. In K8s, with a deployment file, deploying and scaling is basically knowing several simple commands like “kubectl deploy” or “kubectl scale”. If you don’t know the Kubernetes CLI commands, see this cheat sheet.

5. Monitoring ML applications

Frankly, I know very little about monitoring ML applications. I just started to learn more about this topic. If you are using Kubernetes for your deployment, you can use Prometheus to monitor your application, which is a widely-known open-source monitoring and alerting system. Given that we use Flask in this guide to create a REST API, you can use a Prometheus exporter for Flask applications like this one. It provides monitoring data like requests per second, average response time, etc. It also provides a Grafana dashboard to visualize gathered monitoring data, which is nice.

Next steps

I am also new to MLOps and am learning new things about deploying ML models. To know more, there is a course by Andrew Ng on ML engineering for production,  which you can take. Also, on YouTube, there are some talks and guides on MLOps by engineers who work(ed) at famous tech companies. Aside from YouTube, there are some research papers on the deployment of machine learning models and their challenges (See this paper). Personally, I would like to know more about monitoring ML applications.

Wrapping Up

I hope that this post was useful and you can now deploy your first ML model somewhere so that people can try it. Your model does not have to be a giant deep learning model. It can be a small classic model that solves an important or interesting problem. Also, if you have questions or issues regarding the deployment of ML models let me know in the comments below. That being said, I am open to suggestions or ideas. If you would like to add something relevant to this post, I would be happy to update it with more information.

Leave a Reply

Your email address will not be published.