Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
4:20pm–5:00pm Thursday, 09/13/2018
Secondary topics:  Data Integration and Data Pipelines
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • System infrastructure engineers, architect engineers, and data scientists

Prerequisite knowledge

  • Basic knowledge of Apache Kafka and stream processing

What you'll learn

  • Explore a data hub-based reference platform architecture used in production
  • Understand best practices for developing a state-of-the-art streaming platform utilizing Apache Kafka

Description

Recruit Group is one of the largest web service providers in Japan. It has many services covering diverse business fields, including travel and restaurant reservation, human resource services, and POS systems. Analyzing application logs collected from these various services enable the company to provide more insightful services for individuals and corporate customers. Rough estimates show the log size to be around 1 TB per day, and the number of servers/instances to collect logs from will be 1,000+ in the future.

Recruit Group had to design a platform that could handle all these ever-changing requirements. It began with a project to collect and analyze all the application logs generated by these services efficiently and easily. The first step was to develop a platform to handle extensive logs from upstream applications and transfer them to downstream ones in an efficient and effective manner. This platform is based on the data hub architecture and utilizes Apache Kafka for high performance and scalability. The Kafka cluster was developed on Google Compute Engine along with some managed services in Google Cloud Platform, such as BigQuery and Pub/Sub, for analysis.

Recruit Group faced quite a few technical problems while developing this platform. Kenji Hayashida and Toru Sasaki share some of these critical problems and explains how the company solved them. Along the way, you’ll explore the platform and get lessons learned and best practices drawn from this experience.

Topics include:

  • How to collect application logs from a lot of services easily
  • How to manage the schema evolution of each log and adapt new schema to each analysis platform
  • A reference network architecture for a data hub to connect from a lot of existing services
Photo of Kenji Hayashida

Kenji Hayashida

Recruit Lifestyle co., ltd.

Kenji Hayashida is a Japan-based data engineer at Recruit Lifestyle Co., Ltd., part of Recruit Group, where he has worked on projects such as advertising technology, content marketing, and the company’s data pipeline. Kenji started his career as a software engineer at HITECLAB while he was in college. He is the author of a popular data science textbook and holds a master’s degree in information engineering from Osaka University. In his free time, Kenji enjoys programing competitions such as TopCoder, Google Code Jam, and Kaggle.

Photo of Toru Sasaki

Toru Sasaki

NTT DATA Corporation

Toru Sasaki is a system infrastructure engineer and leads the OSS professional services team at NTT Data Corporation. He is interested in open source distributed computing systems, such as Apache Hadoop, Apache Spark, and Apache Kafka. Over his career, Toru has designed and developed many clusters utilizing these products to solve his customers’ problems. He is a coauthor of one of the most popular Apache Spark books written in Japanese.