Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Spark at scale in Bing: Use cases and lessons learned

4:20pm5:00pm Wednesday, March 15, 2017
Spark & beyond
Location: LL21 C/D Level: Beginner
Secondary topics:  Architecture, Data Platform, Media
Average rating: ***..
(3.00, 3 ratings)

Who is this presentation for?

  • Software developers, data engineers, and data architects

Prerequisite knowledge

  • A basic working knowledge of Spark and Kafka

What you'll learn

  • Understand Spark use cases and architecture
  • Learn how to deal with scale issues in Spark applications (for example, when processing messages from a very large number of Kafka topics and partitions or checkpointing massive state)

Description

Apache Spark plays a key role in addressing several big data challenges in Bing. The diverse set of capabilities in Spark enables a variety of internet-scale workloads that power Bing services. The value Spark adds to the business and how well it fits with the existing data platform architecture complementing existing internal and external big data frameworks is clearly the driver behind the adoption of Spark for various next-gen data processing investments in Bing.

Kaarthik Sivashanmugam shares the Bing team’s experiences with Spark, discussing how Spark is employed in the use cases and covering batch processing of document corpus spanning the web and near real-time processing of events corresponding to hundreds of millions of search queries. Kaarthik also explores the challenges the team faced in adopting Spark and implementing scalable data processing pipelines and explains how they influenced the team in customizing Spark and building extensions.

Photo of Kaarthik Sivashanmugam

Kaarthik Sivashanmugam

Microsoft

Kaarthik Sivashanmugam is a principal software engineer on the Shared Data platform team at Microsoft. Kaarthik is the tech lead for the Mobius project specializing in Spark Streaming. Prior to joining the Shared Data platform team, he was on the Bing Ads team, where he built a near real-time analytics platform using Kafka, Storm, and Elasticsearch and used it to implement data processing pipelines. Previously, at Microsoft, Kaarthik was involved in the development of Data Quality Services in Azure and also contributed to multiple releases of SQL Server Integration Services as a hands-on engineering manager. Before joining Microsoft, Kaarthik was a senior software engineer in a semantic technology startup, where he built an ontology-based semantic metadata platform and used it to implement solutions for KYC/AML analytics.