Mar 15–18, 2020

Real-time forecasting at scale using Delta Lake

Rashmina Menon (GumGum), jatinder assi (GumGum)
11:00am11:40am Tuesday, March 17, 2020
Location: LL21A

Who is this presentation for?

Data engineers, data architects, developers

Level

Intermediate

Description

Inventory impression is the real estate to show potential ads on a publisher page. GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. By generating near-real-time inventory forecast based on campaign-specific targeting rules, GumGum enables the account managers to set up successful future campaigns. Rashmina Menon and Jatinder Assi highlight the data pipelines and architecture that help the company achieve a forecast response time of less than 30 seconds for this scale.

Spark jobs efficiently sample the inventory impressions using AMIND sampling and writing to Delta Lake. You’ll discover best practices and techniques to make efficient use of Delta Lake. GumGum caches the data on the cluster using Databricks Delta caching, which supports accelerated reads, reducing IO time as much as possible, and Rashmina and Jatinder detail the advantages of Delta caching over conventional Spark caching. You’ll learn how GumGum enables time series forecasting with zero downtime for end users using auto ARIMA and sinusoids that can capture the trends in the inventory data, and you’ll cover in detail AMIND sampling, Delta Lake to store the sampled data, Databricks Delta caching for efficient reads and cluster use, and time series forecasting.

Prerequisite knowledge

  • A working knowledge of Spark or other distributed systems

What you'll learn

  • Discover how to build scalable distributed systems to reduce IO bottleneck as much as possible
  • Learn how to work with Delta Lake or Databricks Delta caching
  • Identify how to increase the performance of Spark jobs with optimal cost using Delta Lake and Delta caching
Photo of Rashmina Menon

Rashmina Menon

GumGum

Rashmina Menon is a Senior Data Engineer with GumGum, which is a Computer Vision company. She’s passionate about building distributed and scalable systems and end-to-end data pipelines that provide visibility to meaningful data through machine learning and reporting applications.

jatinder assi

GumGum

Jatinder Assi is a data engineering manager at GumGum and is enthusiastic about building scalable distributed applications and business-driven data products.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires