Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference

From Kafka to BigQuery: A guide for delivering billions of daily events

Ofir Sharony (MyHeritage)
1:45pm2:25pm Wednesday, December 6, 2017
Average rating: ****.
(4.57, 7 ratings)

Who is this presentation for?

  • Developers, data engineers, architects, and anyone interested in building data pipelines

What you'll learn

  • Explore considerations for batch and streaming load to your analytics system of choice
  • Understand basic loading concepts such as processing time, event time, and data partitioning
  • Learn the common pitfalls of loading your data to analysis and the trade-offs between different loading techniques


MyHeritage collects billions of events every day, including request logs from web servers and backend services, events describing user activities across different platforms, and change data capture logs recording every change made to its database records.
Delivering these events to analytics is a complex task, requiring a robust and scalable data pipeline.

Ofir Sharony shares MyHeritage’s journey to find a reliable and efficient way to achieve real-time analytics and offers an overview of the system the company decided on: shipping events to Apache Kafka and loading them to analysis in Google BigQuery. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline.

Topics include:

  • Batch loading to Google Cloud Storage and using a load job to deliver data to BigQuery
  • Streaming data via the BigQuery API as a DIY streaming application
  • Streaming data to BigQuery with Kafka Connect
  • Streaming data with Apache Beam along with its cloud Dataflow runner
  • Batch versus streaming load
  • Processing time partitioning versus event time partitioning
  • Considerations for running your pipeline on-premises versus in the cloud

For more information, take a look at Ofir’s recent blog post on the subject.

Photo of Ofir Sharony

Ofir Sharony


Ofir Sharony is a senior member of MyHeritage’s backend team, where he is currently focused on building pipelines on-premises and in the cloud using batch and streaming technologies. An expert in building data pipelines, Ofir acquired most of his experience planning and developing scalable server-side solutions.

Comments on this page are now closed.


Picture of Ofir Sharony
12/07/2017 2:28pm +08

You can find the slides here: