Presented By O'Reilly and Cloudera
Make Data Work
Feb 17–20, 2015 • San Jose, CA

Data Discovery on Hadoop

Sumeet Singh (Yahoo), Thiruvel Thirumoolan (Yahoo!, Inc.)
2:20pm–3:00pm Thursday, 02/19/2015
Hadoop Platform
Location: 210 B/F
Average rating: ****.
(4.50, 2 ratings)

Hadoop has allowed us to move towards a unified source of truth for all of organization’s data. Managing data location, schema knowledge and evolution, fine-grained business rules based access control, and audit and compliance needs will become critical with increasing scale of operations.

In this talk, we will share an approach in tackling the above challenges. We will explain how to register existing HDFS files, provide broader but controlled access to data through a data discovery tool with schema browse and search functionality, and leverage existing Hadoop ecosystem components like Pig, Hive, HBase and Oozie to seamlessly share data across applications. Integration with data movement tools automates the availability of new data. In addition, the approach allows us to open up easy adhoc access to analyze and visualize data through SQL on Hadoop and popular BI tools. As we discuss our approach, we will also highlight how our approach minimizes data duplication, eliminates wasteful data retention, and solves for data provenance, lineage and integrity.

Photo of Sumeet Singh

Sumeet Singh

Yahoo

Sumeet Singh is a Senior Director of Products at Yahoo responsible for platforms product management and customer engagements. In this role, he also leads the Hadoop products team responsible for both Apache open source contributions and Yahoo projects. Sumeet has 15 years of Product Management, Product Development, and Strategy Consulting experience in the technology industry. Sumeet earned his MBA from UCLA Anderson School of Management and MS from Rensselaer Polytechnic Institute, NY.

Photo of Thiruvel Thirumoolan

Thiruvel Thirumoolan

Yahoo!, Inc.

Thiruvel Thirumoolan is a developer in the Hive and HCatalog team at Yahoo!. In this role he is responsible for deployment of Hive, HiveServer2 and HCatalog across all the Hadoop clusters at Yahoo! and ensuring they work at the scale for the usage patterns of Yahoos. He also contributes the features and fixes to the Apache Hive community. He has a Bachelors degree from Anna University and has been working in Hadoop team at Yahoo! for more than 4 years. His favorite theme at Yahoo! internal Hack Days is Hadoop and also mines the trove of Hadoop logs for usage patterns and insights.