This article is part of our coverage of the latest in AI research.
Despite the growing interest in applied machine learning, organizations continue to face enormous challenges in integrating ML into real-world applications. A considerable percentage of machine learning projects either get abandoned before they’re finished or fail to deliver on their promises.
Applied ML is a young and evolving area. MLOps, an emerging field of practices for deploying and maintaining machine learning models, has inspired many tools and platforms for ML pipelines. However, much remains to be done.
A recent study by scientists at the University of California, Berkeley, sheds light on best practices for operationalizing machine learning in different organizations. The paper, which is based on a survey of machine learning engineers in various industries, contains vital lessons for the successful deployment and maintenance of ML models, and guidance for the development of future MLOps tools.
The MLOps pipeline
The MLOps pipeline is a recurring cycle that consists of four tasks. First, the ML team must collect, clean, and store data to train the models. In case the organization is building a supervised machine learning model, they must label the data, either manually or with the use of semi-automated tools.
The team must then do feature engineering and model experimentation. In this stage, the ML engineers try different machine learning algorithms and create features that correspond to those models. For example, deep neural networks might require little or no feature engineering, but classic ML models such as linear regression or support vector machines might require extra engineering efforts, such as using dimensionality reduction to select a minimal set of relevant features.
The next step is evaluation and deployment. Machine learning engineers compare different ML models based on their performance on a held-out validation dataset. In applied machine learning pipelines, deployment is usually done in a staged manner: The model is first rolled out to a subset of users and evaluated against the current production models. If it works well, it is then expanded to more users, tested again, and so on.
Finally, the ML pipeline needs constant monitoring and response. The ML team must continuously monitor the model’s performance on different user and data subsets and look for signs of drift, degradation, and other problems. At the same time, the engineers need tools to collect fresh data from user interactions with the model. Once the ML model’s performance drops below a certain level, they need to reinitiate the pipeline, and gather their dataset to train, validate, and deploy newer versions of the model.
The “three Vs” of MLOps
Based on their interviews with machine learning engineers, the UC Berkeley scientists define three criteria for the success of MLOps pipelines: velocity, validation, and versioning. They call this the “Three Vs” of MLOps.
Velocity is the speed at which the ML team can iterate. Developing machine learning models is a scientific process. It requires constant observation, hypothesizing, development, and testing. The faster an ML team can develop, train, and test new ML prototypes, the quicker it can reach the optimal solution. “ML engineers attributed their productivity to development environments that prioritized high experimentation velocity and debugging environments that allowed them to test hypotheses quickly,” according to the paper.
Validation measures how fast ML teams can find errors in their models. When using machine learning in real-world applications, organizations preferably want to find and fix errors as early as possible and before users are exposed to them.
Finally, versioning is the capability to keep multiple versions of ML models. Sometimes, a new ML model that works well on validation datasets ends up performing worse in production. In such cases, ML teams must be able to quickly switch back to the older model until they can debug and update their new one.
Based on these key tenets and their interviews with machine learning engineers, the authors of the paper provide some practical insights for successful MLOps pipelines.
In comparison to traditional software development, machine learning engineering is much more experimental. Therefore, it is natural that a considerable percentage of ML models will not make it to production.
“What matters is making sure ideas can be prototyped and validated quickly—so that bad ones can be pruned away immediately,” the authors of the paper write.
In their study, they document some interesting strategies that can help establish high-velocity ML pipelines. One example is cross-team collaboration, where data scientists and subject matter experts work together to choose the best hypotheses, features, and models. This helps discard unfeasible hypotheses at the ideation stage, before allocating development and computation resources.
In some cases, iterating on data provided quicker results than going through different machine learning algorithms and model configurations.
Some teams set up their machine learning pipelines to “kill ideas with minimal gain in early stages to avoid wasting future time” and focus on “ideas with the largest performance gains in the earliest stage of deployment.” They do this by setting up sandboxes (usually a Jupyter Notebook) that can quickly stress test their ideas.
Another interesting approach is minimizing code changes. To do this, engineers develop their machine learning modules to switch models by modifying config files instead of source code. This way, they can create multiple versions of config files and have them quickly validated by the same code.
“Because ML experimentation requires many considerations to yield correct results—e.g., setting random seeds, accessing the same versions of code libraries and data—constraining engineers to config-only changes can reduce the number of bugs,” the researchers write.
Machine learning models must be constantly updated to stay in line with changes in environment data, customer requirements, and business. To achieve this goal, organizations need evaluation procedures that can adapt to the changing world, avoid repeated failures, and prevent bad ML models from making it to production.
One of the important takeaways from interviews with machine learning engineers is to regularly update validation datasets. This is a change from the standard practice in academia, which is to test models against fixed validation datasets. “Dynamic validation sets served two purposes,” the researchers write. “(1) the obvious goal of making sure the validation set reflects live data as much as possible, given new learnings about the problem and shifts in the aggregate data distribution, and (2) the more subtle goal of addressing localized shifts that subpopulations may experience (e.g., low accuracy for a specific label).”
An interesting strategy is to use a “shadow stage,” where a candidate machine learning model is integrated into the production system, but its predictions are not surfaced to users. This enables ML engineers to dynamically validate their models against live data without causing risk to the business. Note that applications with a feedback loop from the user (e.g., recommender systems) don’t support shadow deployments.
Finally, an important part of successful MLOps validation is to update validation metrics to reflect the right business goals. As products evolve, their key performance indicators and growth metrics change. For example, at one point, the company’s goal might be to increase the number of active users regardless of the revenue they generate. At a later stage, the same company might want to increase the share of paying users. The deployed ML models must be evaluated based on their contribution to these key metrics. This requires close coordination between the ML, product, and business teams.
Good software engineering
Robust MLOps requires sound software engineering and DevOps practices, according to the experience of the ML engineers interviewed by the UC Berkeley team.
For example, regular retraining of ML models helped teams keep their ML models up to date and avoid drift. This required the software and data engineering teams to set up the right pipelines to continuously gather and label fresh data. A complement to this practice was to have a robust versioning system that kept track of different versions of ML models and their performance metrics. This allowed the engineers to set up automated or semi-automated processes to fall back to older versions when the production model’s performance dropped below a certain threshold.
In some applications, software engineers had to deliberately add a layer of classic rule-based heuristics to stabilize the behavior or ML models. This can be important in many applications where machine learning models may learn and behave based on erroneous correlations in the features of their input vectors. “This combination of modern model-driven ML and old-fashioned rule-based AI indicates a need for managing filter (and versions of filters) in addition to managing learned models,” the researchers write.
Finally, a common theme across successful machine learning projects is the preference for simplicity. While research in academia often focuses on pushing the limits of state-of-the-art, in applied machine learning, using the simplest possible model is a winning strategy. Simpler models cost less to train and run, and they are often more interpretable. In one interesting case, the ML engineers reported that they developed a hybrid approach, in which they used a neural network to create a feature embedding layer that was then used as input for several simple classifier models.
Creating better MLOps tools
The paper is an interesting study in the challenges and lessons of implementing machine learning in real-world applications. The researchers conclude that successful MLOps practices center around “having high velocity, validating as early as possible, and maintaining multiple versions of models for minimal production downtime.”
Therefore, MLOps stacks should be built with the goal of addressing the three Vs.
“MLOps tool builders may want to prioritize ‘10x’ better experiences across velocity, validating early, or versioning for their products,” the researchers write.