Real-time forecasting at scale using Delta Lake
Who is this presentation for?Data engineers, data architects, developers
Inventory impression is the real estate to show potential ads on a publisher page. GumGum receives around 30 billion programmatic inventory impressions amounting to 25 TB of data each day. By generating near-real-time inventory forecast based on campaign-specific targeting rules, GumGum enables the account managers to set up successful future campaigns. Rashmina Menon and Jatinder Assi highlight the data pipelines and architecture that help the company achieve a forecast response time of less than 30 seconds for this scale.
Spark jobs efficiently sample the inventory impressions using AMIND sampling and writing to Delta Lake. You’ll discover best practices and techniques to make efficient use of Delta Lake. GumGum caches the data on the cluster using Databricks Delta caching, which supports accelerated reads, reducing IO time as much as possible, and Rashmina and Jatinder detail the advantages of Delta caching over conventional Spark caching. You’ll learn how GumGum enables time series forecasting with zero downtime for end users using auto ARIMA and sinusoids that can capture the trends in the inventory data, and you’ll cover in detail AMIND sampling, Delta Lake to store the sampled data, Databricks Delta caching for efficient reads and cluster use, and time series forecasting.
- A working knowledge of Spark or other distributed systems
What you'll learn
- Discover how to build scalable distributed systems to reduce IO bottleneck as much as possible
- Learn how to work with Delta Lake or Databricks Delta caching
- Identify how to increase the performance of Spark jobs with optimal cost using Delta Lake and Delta caching
Rashmina Menon is a Senior Data Engineer with GumGum, which is a Computer Vision company. She’s passionate about building distributed and scalable systems and end-to-end data pipelines that provide visibility to meaningful data through machine learning and reporting applications.
Jatinder Assi is a data engineering manager at GumGum and is enthusiastic about building scalable distributed applications and business-driven data products.
Leave a Comment or Question
Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?
Join the conversation here (requires login)
Premier Diamond Sponsors
Premier Exhibitor Plus
For conference registration information and customer service
For more information on community discounts and trade opportunities with O’Reilly conferences
For information on exhibiting or sponsoring a conference
For media/analyst press inquires