Presented By O'Reilly and Cloudera
December 5-6, 2016: Training
December 6–8, 2016: Tutorials & Conference

How Mediacorp has leveraged Apache Spark and Microsoft Cloud to analyze patterns of user behavior for actionable insights

Andrea Gagliardi La Gala (Microsoft), Brandon Lee (Mediacorp)
1:45pm–2:25pm Thursday, December 8, 2016
Spark & beyond
Location: Summit 2 Level: Intermediate
Average rating: ****.
(4.50, 2 ratings)

Prerequisite Knowledge

  • Professional experience in big data analytics solutions (useful but not required)

What you'll learn

  • Understand how to best leverage Spark for data analysis, integrate it with an existing DWH/BI cloud-based infrastructure, and build robust data pipelines that ingest data, transform it into insights, and propagate these to the business to drive informed actions


Exploring and analyzing high-volume, semistructured datasets to derive deep insights into an audience’s behavior and interests is key to the media and advertising industry. Mediacorp and Microsoft have partnered to successfully adopt Apache Spark as the technology of choice underpinning Mediacorp’s big data and analytics platform—a versatile, cloud-based platform designed from the ground-up to provide the Data Science team with the tools for exploratory and interactive data analysis and the ability to gain full control over the data processing needs and the deployment in production of the devised algorithms.

Andrea Gagliardi La Gala and Brandon Lee discuss the key motivations that led to the adoption of Spark to integrate and augment the existing data warehouse infrastructure with unprecedented processing capabilities, the architecture and the business benefits that were derived from it, and how Mediacorp has contextually achieved the significant side effect of controlling costs efficiently in light of increases in data volumes.

Andrea and Brandon also present the outcomes and lessons learned from their firsthand experience with the Spark APIs, the file formats, partitioning strategies, DataFrames, and Spark SQL interfaces they found useful to work with data and to boost developers’ productivity.

Andrea and Brandon conclude by discussing the services and strategies used to complement Apache Spark and develop robust data pipelines to schedule, orchestrate, and manage data processing and analytics jobs in production—pipelines that not only ingest at scale raw data and transform it into insights but also propagate those insights to the business to drive informed actions.

Photo of Andrea Gagliardi La Gala

Andrea Gagliardi La Gala


Andrea Gagliardi La Gala is a data solution architect at Microsoft, where he helps organizations gain a competitive edge by leveraging cloud-based big data and machine-learning technologies. Andrea has 16 years’ experience in IT, delivering large-scale software solutions across a range of sectors and focusing on distributed computing frameworks and analytics.

Photo of Brandon Lee

Brandon Lee


Brandon Lee is assistant vice president and senior data scientist at Mediacorp, where his research focuses on processing methods for user profiling and end-to-end productization of Mediacorp’s big data analytics platform. Brandon has spent more than 20 years working in the data science and research fields, working with Fortune 500 companies and startups. Previously, at Samsung R&D, he introduced an award-winning big data framework, based on Hadoop and MapReduce, to implement distributed machine-learning algorithms for the electronics semiconductor business.