Presented By O’Reilly and Intel AI
Put AI to Work
April 29-30, 2018: Training
April 30-May 2, 2018: Tutorials & Conference
New York, NY

Machine learning meets DevOps: Paying down the high-interest credit card

Wadkar Sameer (Comcast NBCUniversal), Nabeel Sarwar (Comcast NBCUniversal)
2:35pm–3:15pm Tuesday, May 1, 2018
Implementing AI
Location: Sutton North/Center
Average rating: ***..
(3.50, 2 ratings)

Who is this presentation for?

  • Machine learning engineers, architects, tech leads, developers, and DevOps engineers

Prerequisite knowledge

  • A conceptual understanding of machine learning, distributed systems and containerization, and DevOps

What you'll learn

  • Learn how to use DevOps principles and methodologies to integrate machine learning models into a data processing pipeline
  • Explore the challenges involved in operationalizing machine learning models in a high-velocity streaming environment

Description

Machine learning teams need to work with raw data in ad hoc ways to create features to drive model development. Machine learning operationalization teams inherit these ad hoc transformations and translate them into formal data pipelines, which apply these transformations on raw data and apply the model to make predictions. This handoff creates friction, slowing down the process of operationalizing the models, and is at odds with the business need to rapidly deploy models.

Sameer Wadkar and Nabeel Sarwar explain how to seamlessly integrate model development and model deployment processes to enable rapid turnaround times from model development to model operationalization in high-velocity data streaming environments. The features of the system include:

  • Integrating model training and model deployment: Machine learning teams typically start with raw data and apply ad hoc transformations to produce input datasets for model training. This is necessary for rapid prototyping. However, a model is considered formally trained only when it is developed using a versioned dataset that is created by applying a formal and unified transformation pipeline that converts raw input from a stream to a training-input dataset. The same transformation pipeline will then feed the prediction phase by applying the developed model on streaming data. This ensures the models can be deployed rapidly once they are trained, as additional effort is not required to recode the ad hoc transformations.
  • A model as an environment composed of versioned artifacts: Version everything! From an operations perspective, a model is a well-defined execution environment supporting versioned instances of raw input datasets, transformation data pipelines, training datasets, and training/prediction pipelines operating in highly cohesive (versioned) and highly decoupled (messaging) environments. Both model training and model prediction receive inputs from streams, apply a consistent set of transformations, and write results (training data rows or predictions) to data stores/streams, which are external to the model environment but integrate with the model environment using well-defined interfaces.
  • Evaluating model performance: Multiple models versions can be compared by executing them in parallel. Different model versions may use different features sets (averaging over different time windows) or simply use different hyperparameters. The former will plug into a different version of the transformation pipeline, while the latter will use the same set of transformation pipelines. Thus the environment seamlessly supports A/B testing, multivariate testing, live but still dark deployments, etc. Furthermore, the process enables metrics-driven deployment/retirement decision making when real-world feedback is incorporated into the model evaluation process.
  • A feature store and model repository: A feature is an immutable mapping from a raw input attribute via the data transformation pipeline. A feature store is a metadata repository that is tightly integrated with a data lake metastore. It documents the provenance of how a feature maps from a raw attribute via the transformation pipelines. Features in turn are mapped to model versions in a model repository.
Photo of Wadkar Sameer

Wadkar Sameer

Comcast NBCUniversal

Sameer Wadkar is a senior principal architect for machine learning at Comcast NBCUniversal, where he works on operationalizing machine learning models to enable rapid turnaround times from model development to model deployment and oversees data ingestion from data lakes, streaming data transformations, and model deployment in hybrid environments ranging from on-premises deployments to cloud and edge devices. Previously, he developed big data systems capable of handling billions of financial transactions per day arriving out of order for market reconstruction to conduct surveillance of trading activity across multiple markets and implemented natural language processing (NLP) and computer vision-based systems for various public and private sector clients. He is the author of Pro Apache Hadoop and blogs about data architectures and big data.

Photo of Nabeel Sarwar

Nabeel Sarwar

Comcast NBCUniversal

Nabeel Sarwar is a machine learning engineer at Comcast NBCUniversal, where he operationalizes machine learning pipelines under the banner of improving customer experience, operations, field, and anything in between. He also oversees data ingest, feature engineering, and the generation and deployment of the AI models. Nabeel holds a BA in astrophysics from Princeton University.