Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Hive as a service

Szehon Ho (Criteo), Pawel Szostek (Criteo)
11:50am12:30pm Thursday, March 8, 2018
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Developers and those in operations and business

Prerequisite knowledge

  • A basic understanding of Hive and Mesos

What you'll learn

  • Explore the evolution of Criteo's Hive platform

Description

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo’s Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.

The resulting platform is based on Mesos. Mesos has allowed Criteo to scale per demand and better utilize resources, iterate on development much faster than on bare metal, and roll out new versions seamlessly without downtime for our users. Finally, it has allowed the company to eliminate the last SPOF in its Hive stack. Szehon and Pawel detail Criteo’s data architecture and explain how the company solved challenges in security, monitoring, scheduling, and load balancing on multiple layers. They also discuss the gains made by this process.

Photo of Szehon Ho

Szehon Ho

Criteo

Szehon Ho is a staff software engineer on the analytics data storage team at Criteo, where he works on Criteo’s Hive platform. Previously, he was a software engineer on the Hive team at Cloudera. He was a committer and PMC member in the Apache Hive open source community, working on features like Hive on Spark and Hive monitoring and metrics, among others.

Photo of Pawel Szostek

Pawel Szostek

Criteo

Pawel Szostek is a senior software engineer on Criteo’s analytics data storage team, where he works on various projects, including implementing an improved HyperLogLog algorithm. Previously, he was a researcher at CERN in Geneva.