Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

How to avoid drowning in logs: Streaming 80 billion events and batch processing 40 TB/hour (sponsored by Pure Storage)

Ivan Jibaja (Pure Storage)
5:25pm–6:05pm Wednesday, 09/12/2018
Location: 1A 01/02
Average rating: *****
(5.00, 1 rating)

What you'll learn

  • Learn how Pure Storage uses Spark for both streaming and batch jobs, helping engineers understand the state of its continuous integration pipeline


Continuous integration (CI) pipelines generate massive amounts of messy log data. Pure Storage engineering runs over 70,000 tests per day creating a large triage problem that would require at least 20 triage engineers. Instead, Spark’s flexible computing platform allows the company to write a single application for both streaming and batch jobs so that a team of only three triage engineers can understand the state of the company’s CI pipeline. Spark indexes log data for real-time reporting (streaming), uses machine learning for performance modeling and prediction (batch job), and reindexes old data for newly encoded patterns (batch job). Ivan Jibaja discusses the use case for big data analytics technologies, the architecture of the solution, and lessons learned.

This session is sponsored by Pure Storage.

Photo of Ivan Jibaja

Ivan Jibaja

Pure Storage

Ivan Jibaja is a tech lead for the big data analytics team at Pure Storage. Previously, he was a part of the core development team that built the FlashBlade from the ground up. Ivan holds a PhD in computer science with a focus on systems and compilers from the University of Texas at Austin.