Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK
Please log in

Mastering data with Spark and machine learning

Sonal Goyal (Nube)
14:5515:35 Thursday, 2 May 2019
Average rating: *....
(1.00, 4 ratings)

Who is this presentation for?

  • Data scientists, data engineers, MDM practitioners, and data architects



Prerequisite knowledge

  • Familiarity with Spark, Cassandra, and Elastic (useful but not required)

What you'll learn

  • Understand the challenges while unifying and mastering data across multiple systems
  • Explore an end-to-end system built to solve them


Enterprise data on customers, vendors, and products is often siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting, and 360 views. Traditional rule-based MDM systems with legacy architectures struggle to unify this growing data. Further, each source and data type has its own schema and format, data volumes run into millions of records, and linking similar records is a fuzzy matching and computationally expensive exercise—making this a challenging undertaking.

Sonal Goyal offers an overview of the design and architecture of a modern master data application using Spark, Cassandra, ML, and Elastic. The application unifies nontransactional master data in multiple data domains like customer, organization, and product through multiple systems like ERP, CRM, and custom applications of different business units, using the Spark Data Source API and machine learning. Sonal explains how the abstraction offered by the Data Source API allows users to consume and manipulate the different datasets easily. After aligning required attributes, Spark is used to cluster and classify probable matches using a human-in-the-loop feedback system. These matched and clustered records are persisted to Cassandra and exposed to data stewards through an AJAX-based GUI. The Spark job also indexes the records to Elastic, which lets the data steward query and search clusters more effectively.

Sonal covers the end-to-end flow, design, and architecture of the different components as well as the configuration per source and type to support the different and unknown datasets and schemas. Along the way, she details the performance gains using Spark, machine learning for data matching, and stewardship as well as the role of Cassandra and Elastic in the application.

Photo of Sonal Goyal

Sonal Goyal


Sonal Goyal is the founder and CEO at Nube Technologies, a startup focused on big data preparation and analytics. Nube Technologies builds business applications for better decision making through better data. Sonal and the team at Nube help customers build better and effective models by ensuring that their underlying master data is accurate. The company’s fuzzy matching product, Reifier, helps companies get a holistic view of enterprise data. By linking and resolving entities across various sources, Reifier helps optimize the sales and marketing funnel, promotes enhanced security and risk management and better consolidation and reporting of business data.