Implementing solutions with Apache Hadoop requires understanding not just Hadoop, but a broad range of related projects in the Hadoop ecosystem such as Hive, Pig, Oozie, Sqoop, and Flume. The good news is that there’s an abundance of materials – books, web sites, conferences, etc. – for gaining a deep understanding of Hadoop and these related projects. The bad news is there’s still a scarcity of information on how to integrate these components to implement complete solutions. In this tutorial we’ll walk through an end-to-end case study of a clickstream analytics engine to provide a concrete example of how to architect and implement a complete solution with Hadoop. We’ll use this example to illustrate important topics such as:
• Modeling data in Hadoop and selecting optimal storage formats for data stored in Hadoop
• Moving data between Hadoop and external data management systems such as relational databases
• Moving event-based data such as logs and machine generated data into Hadoop
• Accessing and processing data in Hadoop
• Orchestrating and scheduling workflows on Hadoop
Throughout the example, best practices and considerations for architecting applications on Hadoop will be covered. This tutorial will be valuable for developers, architects, or project leads who are already knowledgeable about Hadoop, and are now looking for more insight into how it can be leveraged to implement real-world applications.
You will need an understanding of Hadoop concepts and components in the Hadoop ecosystem, as well as an understanding of traditional data management systems (e.g. relational databases), and knowledge of programming languages and concepts.
Pre-req for tutorial: It’s not required for the audience to follow along, but if they are interested in doing so they should have a set up with various projects in the Hadoop ecosystem. For that, we recommend installing Cloudera’s QuickStart VM: tiny.cloudera.com/quick-start.
Mark Grover is a committer on Apache Bigtop, a committer and PMC member on Apache Sentry (incubating) and a contributor to
Apache Hadoop, Apache Spark, Apache Hive, Apache Sqoop and Apache Flume. He is currently co-authoring O’Reilly’s Hadoop Application Architectures title and is a section author of O’Reilly’s book on Apache Hive – Programming Hive. He has written a few guest blog posts and spoken at many conferences about technologies in the hadoop ecosystem.
Gwen Shapira is a Solutions Architect at Cloudera and leader of IOUG Big Data SIG. Gwen Shapira studied computer science, statistics and operations research at the University of Tel Aviv, and then went on to spend the next 15 years in different technical positions in the IT industry. She specializes in scalable and resilient solutions and helps her customers build high-performance large-scale data architectures using Hadoop. Gwen Shapira is a frequent presenter at conferences and regularly publishes articles in technical magazines and her blog.
Ted has worked on close to 60 Clusters over 2-3 dozen clients with over 100’s of use cases. He has 18 years of professional experience working for start-ups, the US government, a number of the worlds largest banks, commercial firms, bio firms, retail firms, hardware appliance firms, and the US’s largest non-profit financial regulator. He has architecture experience across topic such as Hadoop, Web 2.0, Mobile, SOA (ESB, BPM), and Big Data. Ted is a regular committer to Flume, Avro, Pig and YARN.
Jonathan has spent more than 15 years as a software developer, with a focus in the last few years on processing large data sets using tools such as Hadoop. Currently, Jonathan is a Solutions Architect on the Partner Engineering team at Cloudera. Before joining Cloudera he was a Lead Engineer on the Big Data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is also a co-founder of the Chicago Hadoop User Group and the Chicago Big Data meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is co-authoring a book on architecting applications with Apache Hadoop for O’Reilly Media.
For exhibition and sponsorship opportunities, email firstname.lastname@example.org
For information on trade opportunities with O'Reilly conferences, email email@example.com
For media-related inquiries, contact Maureen Jennings at firstname.lastname@example.org
View a complete list of Strata + Hadoop World contacts
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.