Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)

4:20pm–5:00pm Thursday, 09/13/2018

Big data and data science in the cloud, Data engineering and architecture
Location: 1A 23/24 Level: Beginner

Secondary topics: Data Integration and Data Pipelines

Average rating:

(4.50, 2 ratings)

Download slides (PDF)

Who is this presentation for?

System infrastructure engineers, architect engineers, and data scientists

Prerequisite knowledge

Basic knowledge of Apache Kafka and stream processing

What you'll learn

Explore a data hub-based reference platform architecture used in production
Understand best practices for developing a state-of-the-art streaming platform utilizing Apache Kafka

Description

Recruit Group is one of the largest web service providers in Japan. It has many services covering diverse business fields, including travel and restaurant reservation, human resource services, and POS systems. Analyzing application logs collected from these various services enable the company to provide more insightful services for individuals and corporate customers. Rough estimates show the log size to be around 1 TB per day, and the number of servers/instances to collect logs from will be 1,000+ in the future.

Recruit Group had to design a platform that could handle all these ever-changing requirements. It began with a project to collect and analyze all the application logs generated by these services efficiently and easily. The first step was to develop a platform to handle extensive logs from upstream applications and transfer them to downstream ones in an efficient and effective manner. This platform is based on the data hub architecture and utilizes Apache Kafka for high performance and scalability. The Kafka cluster was developed on Google Compute Engine along with some managed services in Google Cloud Platform, such as BigQuery and Pub/Sub, for analysis.

Recruit Group faced quite a few technical problems while developing this platform. Kenji Hayashida and Toru Sasaki share some of these critical problems and explains how the company solved them. Along the way, you’ll explore the platform and get lessons learned and best practices drawn from this experience.

Topics include:

How to collect application logs from a lot of services easily
How to manage the schema evolution of each log and adapt new schema to each analysis platform
A reference network architecture for a data hub to connect from a lot of existing services

Kenji Hayashida

Recruit Lifestyle co., ltd.

Kenji Hayashida is a Japan-based data engineer at Recruit Lifestyle Co., Ltd., part of Recruit Group, where he has worked on projects such as advertising technology, content marketing, and the company’s data pipeline. Kenji started his career as a software engineer at HITECLAB while he was in college. He is the author of a popular data science textbook and holds a master’s degree in information engineering from Osaka University. In his free time, Kenji enjoys programing competitions such as TopCoder, Google Code Jam, and Kaggle.

Toru Sasaki

NTT DATA Corporation

Toru Sasaki is a system infrastructure engineer and leads the OSS professional services team at NTT Data Corporation. He is interested in open source distributed computing systems, such as Apache Hadoop, Apache Spark, and Apache Kafka. Over his career, Toru has designed and developed many clusters utilizing these products to solve his customers’ problems. He is a coauthor of one of the most popular Apache Spark books written in Japanese.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com