Machine learning projects are very different from normal software projects. We often need to perform various steps like exploratory data analysis, feature engineering, model training, model review, model retraining, etc. The entire process of developing a machine learning application is not a linear process. Hence, we need to keep track of different activities and artifacts in the entire development process. For this, we use MLOps. MLOps is an acronym for Machine learning operations.
In this article, we will discuss what MLOps is, what are its components, why we need MLOps, and what are its benefits. We will also discuss some uses and best practices for MLOps.
What is MLOps?
MLOps or machine learning operations is a set of software development practices that focuses on streamlining the process of model development, its deployment to the production environment, monitoring, and maintenance of the model in a machine learning project. You can term MLOps as DevOps for machine learning applications.
An MLOps team consists of data engineers, machine learning engineers, data scientists, software developers, and IT professionals who collaborate to make the project successful.
Need of MLOps
With machine learning and artificial intelligence, businesses have been successful in unlocking previously untapped sources of income, increasing sales revenue, reducing costs by leveraging predictions for decision-making, and improving customer service. Without a robust machine learning application that is accurate, we cannot perform these tasks. To create a robust machine learning application, we need a specified set of practices for machine learning operations.
- MLOps serves as a framework for machine learning engineers and data scientists to create optimized machine learning applications using the available data, resources, and constraints.
- Creating a machine learning application is a difficult task. We often need to change the approach of development, choose different algorithms, change datasets, tune hyper-parameters, or revise the entire setup for any given machine learning application. While making changes, we need to record their impact on the performance of the model. For this, we need to store the machine learning model, the hyper-parameters, the input dataset, and other artifacts for each model developed during the entire development process for the machine learning application. Managing this entire process requires MLOps tools and techniques for better understanding and efficiency.
- Usually, a team working on a machine learning project consists of data engineers, data scientists, software developers, and other team members. To achieve a synchronous workflow among the team members, we need to follow MLOps practices.
- A machine learning project consists of experimentation, iteration, and continuous improvement of the process. MLOps encompasses these steps and helps the developers build high-quality machine learning models with good efficiency.
- We need to make sure that the tasks performed by the members of the team developing the machine learning application are in accordance with their responsibilities and expertise. For instance, a junior machine learning engineer who has just joined the team will be able to retrain a model or deploy it in the production environment correctly. If he makes any mistakes, there can be a huge loss to the organization. In such cases, MLOps provides us with ways for governance so that we can track and regulate the actions performed by the team members.
Benefits of MLOps
- MLOps practices help us streamline processes in machine learning projects. It leads to increased efficiency and improvement in the quality of machine learning models. It also helps us achieve rapid model development, deployment, and production.
- Using MLOps tools, we can monitor and analyze hundreds of machine learning models. MLOps helps us perform continuous integration, continuous delivery, and continuous deployment that helps us achieve high scalability.
- Using MLOps, we can reproduce machine learning pipelines, enable highly coupled collaboration across data teams, and reduce conflicts with the DevOps and IT teams. It helps us minimize the release time for ML models.
- MLOps also helps us minimize risks in the machine learning project. We can monitor the machine learning models, perform regulatory scrutiny and drift checks, maintain transparency, and ensure compliance with the organization’s data security measures in a machine learning project using the MLOps tools.
The components in a machine learning process may vary on a project basis. However, there are a few steps that are present in almost every machine learning project development process.
- Data Collection: To train a machine learning model, we first need to have some training data. For training, we often need to gather data from different sources like data warehouses and live data streams.
- Exploratory data analysis: The real-world data is messy and has a high level of noise. We need to perform data preprocessing to make the data useful. For this, we perform data quality assessment, data cleaning, data transformation, and data reduction in various phases. We iteratively explore, share, and prepare data to create reproducible, editable, and shareable datasets for the machine learning lifecycle.
- Data preparation and Feature Engineering: While data preparation and feature engineering, we need to iteratively transform, aggregate, and reduce the data for creating refined features. We also need to create a feature store to make the features in the data visible across data teams.
- Model Training: In model training, you can use various algorithms and open source machine learning libraries such as scikit-learn, TensorFlow, Keras, etc to train a machine learning model using the available dataset. While training, we also need to evaluate the trained model for various statistical metrics, reevaluate the hyper-parameters, refine the dataset, and retrain the model to get more accurate and efficient machine learning models. You can also use automated machine learning tools such as AutoML to train the machine learning models and create reviewable and deployable code.
- Model review and governance: In a machine learning project, we also need to track model lineage and model versions. Additionally, we need to manage model artifacts and transitions through the entire machine learning lifecycle. We also need to share source codes and artifacts and collaborate with other teams in the organization to implement the tasks in an efficient manner.
- Model Inference and Serving: We need to manage the frequency of model refresh, inference request times, and other production-specific metrics in testing and quality assurance. To automate the pre-production pipeline, we can use CI/CD tools such as repositories and orchestrators.
- Model deployment and monitoring: We can automate permissions and cluster creation to production registered models. We also need to enable REST API model endpoints.
- Model Retraining: We can create alerts and automation for situations of model drift due to dissimilarity in training and inference data.
MLOps vs DevOps
DevOps and MLOps serve the same purpose. However, there are fundamental differences between the practices involved in these processes.
- MLOps is experimental. We often need to change training data, tune hyper-parameters, and even machine learning algorithms in order to achieve our goals for a machine learning project. In a software project, this is not the case. In software projects, there is a defined process of software development.
- In MLOps teams, there are data scientists, data engineers, machine learning engineers, etc. They focus on exploratory data analysis, model development, and experimentation. On the other hand, the DevOps teams majorly consist of software developers who have the ability to build production class services. The MLOps teams might not be experienced in building production class services.
- Testing a software project involves tasks like unit testing and integration testing. Testing a machine learning application involves model validation and model training in addition to the tasks used in software development.
- There are differences in how we deploy a software project and a machine learning project. We cannot deploy a machine learning model that has been trained on offline data as a prediction service. We need to develop pipelines to retrain and redeploy a model for desired outcomes. On the other hand, you can deploy a software application to the production environment after testing and QA and it will work as desired.
How to Implement MLOps in Machine Learning Projects?
We can implement machine learning operations (MLOps) manually, using a machine learning pipeline automation or using a CI/CD pipeline automation.
Manual MLOps Implementation
Manual MLOps implementation is used by the teams that are just starting with machine learning. For small projects in their initial phases, a manual machine learning workflow with a data-scientist-driven process is enough to perform accurately if the machine learning models are rarely changed or retrained.
- In Manual MLOps implementation, every step such as exploratory data analysis, data preprocessing, model training, validation, etc is manual. ML engineers and data scientists manually perform each task to achieve the goals.
- In manual MLOps, there is a disconnect between machine learning and operations. The manual process separates the data scientists who train and create the models from the engineers who serve the machine learning models as a prediction service. Here, the machine learning engineers and data scientists create and provide a machine learning model as an artifact. The model is then used by other software engineers to deploy on their API infrastructure.
- Manual MLOps cause infrequent release iterations. Here, the machine learning team manages the models that don’t change frequently. New models with changed model implementations or retraining are deployed only a few times a year.
- Manual MLOps doesn’t have continuous integration and deployment due to less frequent implementation changes and model version deployments.
- In manual MLOps, we don’t conduct active performance monitoring or log model predictions and actions.
Manual MLOps isn’t very efficient because machine learning models often fail in real-time production environments. Either the models don’t work at all or they are very inaccurate. To tackle this problem, we need to adopt MLOps practices with continuous integration and deployment. We also need to enable continuous training of the ML model. With these features, we can rapidly test, build and deploy machine learning easily with updated data and requirements.
MLOps Using Machine Learning Pipeline Automation
When we use MLOps using machine learning pipeline automation, we mainly focus on continuous training of the model by automating the ML pipeline. Through continuous training, we can achieve continuous delivery of model prediction services. We can use this implementation if the machine learning models operate in a constantly changing environment and we need to actively respond to the changes in input data for accurate prediction.
- MLOps using ML pipeline automation helps in rapid experimentation. The machine learning experimentation steps are orchestrated and done automatically. Also, the model is continuously trained in production using fresh data based on live ML pipeline triggers.
- In this implementation of MLOps, we have experimental operational symmetry. The same ML pipeline that is used in the development environment is also used in the pre-production and production environment.
- When we implement MLOps using the ML pipeline, we need reusable, composable, and shareable code components. Hence, the source code is modularized which also helps in efficient debugging in case of errors.
- In manual MLOps implementation, we first train the model and then deploy it for prediction. In MLOps with pipeline automation, we deploy the entire ML pipeline. The pipeline automatically runs to serve the trained model as the prediction service. Thus, continuous delivery of models is achieved to serve the trained and validated model as a prediction service by automating the entire process. In the production pipeline, we use automated data validation and model validation steps by training the model with live data.
- In MLOps using ML pipeline, we use a feature store that works as a centralized repository that facilitates standardized definition, storage, and access of features for training and serving ml models. We also perform metadata management to store information about each execution of the ML pipeline. This helps us identify data and artifacts lineage, achieve reproducibility, debug errors, and perform comparisons.
MLOps with ML pipeline helps us achieve continuous training of the machine learning model. However, if we want to change the algorithms for the models or there is a change in the data format, this implementation for MLOps fails. To tackle this problem, we can use MLOps with continuous integration and continuous deployment to automate the development, testing, and deployment of ML pipelines.
MLOps With Continuous Integration and Continuous Deployment
With continuous training, we also need continuous integration and deployment for the best results. For a fast and reliable update of pipelines in the production environment, we need a robust CI/CD pipeline. Having a CI/CD pipeline deployed, machine learning engineers and data scientists can experiment with algorithms, features, and hyper-parameters to get the best model out of the available resources.
- Having a CI/CD pipeline helps us retrain the models daily, updated them in minutes, and deploy them to hundreds of servers in a single click. MLOps with CI/CD has various components such as source control, test and build services, deployment services, model registry, pipeline orchestrator, metadata and feature store, etc.
- With CI/CD pipeline available, we can try out new ML algorithms and new modeling techniques where experiment steps are orchestrated. The output of the experimentation is used as the source code of the machine learning pipeline steps and is pushed into the source code repository. On the source code, we run various tests. The outputs are then stored as packages, executables, and other artifacts. This is called continuous integration.
- The artifacts created by the continuous integration stage are continuously deployed to the target environment. This is called continuous delivery. After each delivery, we get a deployed ML pipeline with a new implementation of the ML model.
- The ML Pipeline is automatically executed in the production environment in response to any trigger. The output of this stage is a newly trained model that we push to the model registry.
- After the registry, we serve the trained model as the prediction service. After serving the trained model, we get a deployed model prediction service.
- Based on live data, we evaluate the model performance. Based on the evaluation results, either a trigger is issued to execute the pipeline or a new experiment cycle is started in case of bad results. The analysis is manually done by data scientists.
There are various commercial solutions made available by companies like Amazon, Microsoft, and Google that you can use to build, train, and deploy machine learning models following the MLOps principles.
- Amazon provides Amazon Sagemaker for developing, training, deploying, and monitoring machine learning models.
- Google provides the Google Cloud MLOps suite. It has the following components.
- Dataflow is used for data extraction, validation, and transformation. We can also use Dataflow to evaluate models.
- Vertex AI Workbench is used to develop and train machine learning models.
- Cloud Build is used to build and test machine learning pipelines.
- TensorFlow Extended (TFX) is used to deploy machine learning pipelines.
- Kubeflow pipelines is used to arrange model deployments on top of the Google Kubernetes Engine.
- Microsoft also provides us with the Microsoft Azure MLOps suite for the same tasks.
In this article, we have discussed machine learning operations i.e. MLOps.We discussed its need and benefits along with different implementations. We also briefly discussed the differences between MLOps and DevOps. To learn more about MLOps, you can read this mlflow tutorial.