Exploring and analyzing high-volume, semistructured datasets to derive deep insights into an audience’s behavior and interests is key to the media and advertising industry. Mediacorp and Microsoft have partnered to successfully adopt Apache Spark as the technology of choice underpinning Mediacorp’s big data and analytics platform—a versatile, cloud-based platform designed from the ground-up to provide the Data Science team with the tools for exploratory and interactive data analysis and the ability to gain full control over the data processing needs and the deployment in production of the devised algorithms.
Andrea Gagliardi La Gala and Brandon Lee discuss the key motivations that led to the adoption of Spark to integrate and augment the existing data warehouse infrastructure with unprecedented processing capabilities, the architecture and the business benefits that were derived from it, and how Mediacorp has contextually achieved the significant side effect of controlling costs efficiently in light of increases in data volumes.
Andrea and Brandon also present the outcomes and lessons learned from their firsthand experience with the Spark APIs, the file formats, partitioning strategies, DataFrames, and Spark SQL interfaces they found useful to work with data and to boost developers’ productivity.
Andrea and Brandon conclude by discussing the services and strategies used to complement Apache Spark and develop robust data pipelines to schedule, orchestrate, and manage data processing and analytics jobs in production—pipelines that not only ingest at scale raw data and transform it into insights but also propagate those insights to the business to drive informed actions.
Andrea Gagliardi La Gala is a data solution architect at Microsoft, where he helps organizations gain a competitive edge by leveraging cloud-based big data and machine-learning technologies. Andrea has 16 years’ experience in IT, delivering large-scale software solutions across a range of sectors and focusing on distributed computing frameworks and analytics.
Brandon Lee is assistant vice president and senior data scientist at Mediacorp, where his research focuses on processing methods for user profiling and end-to-end productization of Mediacorp’s big data analytics platform. Brandon has spent more than 20 years working in the data science and research fields, working with Fortune 500 companies and startups. Previously, at Samsung R&D, he introduced an award-winning big data framework, based on Hadoop and MapReduce, to implement distributed machine-learning algorithms for the electronics semiconductor business.
©2016, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • firstname.lastname@example.org
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.