Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Transforming a machine learning prototype to a deployable solution leveraging Spark in healthcare

Rachita Chandra (IBM Watson Health)
4:20pm5:00pm Wednesday, March 7, 2018
Average rating: ***..
(3.00, 1 rating)

Who is this presentation for?

  • Big data developers, data scientists, solutions architects

Prerequisite knowledge

  • Familiarity with big data fundamentals and Python

What you'll learn

  • Explore challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment


Extensive research has been conducted at the intersection of machine learning and healthcare. With an anticipated 48% annual growth in healthcare data, building scalable healthcare solutions is more crucial now than ever before. Fortunately, there has been an equivalent surge in the algorithms, software packages, optimization techniques, and hardware available for obtaining insights on healthcare data.

However, despite these advances, minimal research has been conducted to understand the challenges and considerations associated with transforming research prototypes into real-world healthcare solutions. Often machine learning research is conducted in silos: researchers build prototypes, which are subsequently picked up by software developers who develop scalable solutions for the prototype. There is a gap in the understanding of trade-offs to consider to transition from a research prototype to a deployable healthcare solution.

Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment. The original prototype, which tackled prediction of healthcare costs, worked well for a dataset of 5 million users in a nondistributed environment. It utilized several Python data science libraries and machine learning models. However, as the dataset grew larger (> 1 TB), computational resources became the bottleneck, and the need to adopt Spark became apparent. Since the research prototype was in its mature phases, porting components of the existing pipeline to leverage Spark was more effective than building a Spark codebase from scratch. The deployable solution is an end-to-end multitenant enterprise application comprising of several components: user authentication, request handling, data transformations, quality checks, analytics, machine learning modules, a visualization interface, and error handling.

Topics include:

  • The process of finding compatible modules and mapping machine learning libraries from a nondistributed environment to a distributed environment
  • Architectural changes and additional modules needed in the codebase to enable the application to be deployed
  • A comparison of the machine learning model implementations across the distributed and non-distributed system and the performance benchmarks obtained in both environments
  • The debugging challenges faced and tools utilized to identify and circumvent these errors
Photo of Rachita Chandra

Rachita Chandra

IBM Watson Health

Rachita Chandra is a solutions architect at IBM Watson Health, where she brings together end-to-end machine learning solutions in healthcare. She has experience implementing large-scale, distributed machine learning algorithms. Rachita holds both a master’s and bachelor’s degree in electrical and computer engineering from Carnegie Mellon.