In this special guest feature, Emily Kruger, Vice President of Product at Kaskada, discusses the topic that is on the minds of many data scientists and data engineers these days, maximizing the impact of machine learning in production environments. Kaskada is a machine learning company that enables collaboration among data scientists and data engineers. Kaskada develops a machine learning studio for feature engineering using event-based data. Kaskada’s platform allows data scientists to unify the feature engineering process across their organizations with a single platform for feature creation and feature serving.
Machine learning is changing the way the world does business. Everywhere you look machine learning is powering customer-facing and business-critical systems and delivering outsized impact. Trends toward hyper-personalization, automated operations, and real-time decisioning continue to drive investment, and enterprises are betting millions of dollars on their machine learning capabilities.
However, this investment and impact from advancing ML technology is not evenly distributed. Multiple sources show high tech pulling away from other industries, led by the big four – Apple, Amazon, Facebook, and Google. These companies have large teams and tens of millions of dollars to invest in ML. To be successful and remain competitive, all enterprises need the capabilities to deliver production-grade ML and match the speed of innovation of larger firms.
Delivering machine learning to production environments is not simple, however, and tends to be rife with inefficiencies. In most organizations, data scientists and data engineers work in siloed environments. Neither team has the skills to build ML systems on their own, causing friction and lost time. For instance, typical data science tools, like Jupyter notebooks, cannot be easily productionized, and this work must be rewritten by engineers to be used in production. There are separate development environments, with no reuse of data features, no shared pipeline, and other communication limitations which can delay ML projects by months or quarters.
The infrastructure required for data aggregation, processing, and serving in production is also complex. Data engineers need to build data pipelines manually, typically with open source tools that are not ideally suited to this use case. Building these systems takes months or years, and even then, they require continued overhead to maintain and keep them running smoothly. As data volume and ML needs grow, these pipelines reach their scaling limits and need to be re-architected and rebuilt from the ground up, and the process begins again. The in-house expertise needed to build and maintain these systems is immense and, as a result, most companies are only realizing a fraction of the impact that they should from their ML investments.
Enter ML Platforms
Data scientists and data engineers need integrated tools that speed the development and delivery of ML-powered products. ML platforms are an emerging solution embraced by many enterprises and are purpose-built to help get ML to production efficiently and reliably.
Many big tech firms have already developed proprietary ML platforms in-house. They provide data scientists and engineers with tools for model serving and online feature stores, allowing them to ingest, catalog, and deploy features, as well as share features across teams. These systems deploy models seamlessly and deliver feature vectors to applications in milliseconds for near real-time decisions. But proprietary platforms take years and considerable sums to build. It’s not easy to create a platform that can manage data that is shifting in three axes: the model, the data, and the code itself. Experienced and expensive data engineering teams are required to build this in-house. We recommend that companies only build a custom ML platform when ML is part of its core IP.
New commercial ML platforms are reducing the cost of entry, however. Companies can now achieve ML in production with commercially available platforms built on the technology, insights, and best practices of big tech companies. These platforms integrate with or replace the data science workflows and tools in use today and are available without the cost, talent, and time requirements needed to build this capability in-house. We see a host of platforms focused on solving bottlenecks in the ML process, such as model development and serving, model governance, experimentation and versioning, feature engineering, and more.
Do You Need an ML Platform?
At what point does it make sense to invest in an ML platform? For many companies, the trigger is the failure of existing systems and processes to scale. This could be when you move beyond one or two ML models running in production, or when your data volume exceeds your current processing capability, or when your data organization grows to the point where verbal collaboration becomes challenging. Here are a few common bottlenecks and possible flavors of ML platforms to help you address these challenges:
Feature Engineering for Production
The Problem: More often than not, the most informative features for your model are ones that have been painstakingly crafted by your data scientists. Unfortunately, in order to use these inside your production models, your data engineers need to reimplement them in the production system, wasting valuable time duplicating work and often leading to inconsistencies in results. This handoff can take weeks or months, if it happens at all.
The Solution: Feature stores are an emerging ML technology designed to address getting features to production quickly and reliably. Feature stores allow data scientists to use the same features to train models and to deploy to production, eliminating errors and the need for rewrites. Another benefit? Feature stores allow data scientists across teams to share their work, reducing duplication of effort and allowing common feature definitions to be used across your organization.
Model Management and Deployment
The Problem: As your data teams grow, so do the number of production models and the number of new experiments being run. As your data scientists can save models to their local machine, the process may work OK for a time, but eventually work gets lost locally and efforts get duplicated across teams. As you try to centralize these efforts, you may end up with a warehouse of models that are not properly tracked or organized.
Even if your model versioning and organization is under control, deploying a fully trained model to production comes with many new operational questions. For instance, “how do you manage deployments?” to “where does the model live?” to “how do I rollback to an old, good model if something happens?”
The Solution: Model orchestration tools are becoming more and more prevalent. These tools handle the versioning, storage, organization and/or deployment of your ML models in a safe, efficient, and reproducible way. Some model orchestration tools cover only a portion of this workflow, such as experimentation and version control, and others focus more on automated deployment of models. Consider whether or not you need an end-to-end platform for model orchestration or if one portion of this workflow is more of a challenge in your organization should be addressed first.
Prediction Logging and Monitoring
The Problem: Getting a model into production is only half the battle. Once it’s there you’ll need to continuously monitor the data inputs, as well as the prediction outputs to understand whether your model is still performing well. As the number of models grows, this analysis and operational burden can overwhelm your data scientists without the right tools and systems in place. And even then, with some models making thousands of predictions a minute, time is of the essence to catch data issues before they cost you too much.
The Solution: Prediction logging and monitoring systems need to be in place, so that if an issue arises from one of your data sources, it won’t cascade into a costly system failure. And while these systems can help you address costly issues as they occur, they can also help you avoid mistakes, or notice when your model is starting to slowly degrade. Preventing performance from degrading by a few percentage points can avoid an outsized impact on customer experience.
The era of commercial ML platforms is changing the economics of running ML in production. Companies are now shortening their development timelines, which empowers them to deliver personalization, recommendations, operational excellence, and other value to their products. In short, running ML in production is changing how today’s organizations serve their customers. And commercial ML platforms are opening up new markets and opportunities, empowering organizations of all sizes to truly compete with big tech.
Sign up for the free insideAI News newsletter.