Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Hadoop application architectures: Fraud detection (Half Day)

Jonathan Seidman (Cloudera), Ted Malaska (Blizzard Entertainment), Gwen Shapira (Confluent), Mark Grover (Lyft)
1:30pm–5:00pm Tuesday, 03/29/2016
Average rating: ****.
(4.48, 23 ratings)

Prerequisite knowledge

The tutorial will include a live demo of the full project on Cloudera's QuickStart VM. The code for the demo is available on GitHub. Download it here to follow along.

Materials or downloads needed in advance

This is not a hands-on tutorial, so no special preparation is necessary. The tutorial will include a live demo of the full project on Cloudera'a Quickstart VM. The code for the demo will be available on GitHub so the audience can follow along.

Description

Implementing a scalable, low-latency architecture requires understanding a broad range of frameworks, such as Kafka, HBase, HDFS, Flume, Spark, Spark Streaming, and Impala, among many others. The good news is that there’s an abundance of resources—books, websites, conferences, etc.—for gaining a deep understanding of these related projects. The bad news is there’s still a scarcity of information on how to integrate these components to implement complete solutions.

Jonathan Seidman, Ted Malaska, Gwen Shapira, and Mark Grover walk participants through building a fraud-detection system, using an end-to-end case study to provide a concrete example of how to architect and implement real-time systems via Apache Hadoop components like Kafka, HBase, Impala, and Spark. They cover best practices and considerations for architecting real-time applications to give developers, architects, or project leads who are already knowledgeable about Hadoop or similar distributed data processing systems more insight into how they can be leveraged to implement real-world applications.

Topics include:

  • Modeling data in Kafka, HBase, and Hadoop and selecting optimal formats for storing data
  • Integrating multiple data collection, processing, and storage systems
  • Collecting and analyzing event-based data such as logs and machine-generated data and storing the data in Hadoop
  • Querying and reporting on data
Photo of Jonathan Seidman

Jonathan Seidman

Cloudera

Jonathan Seidman is a software engineer on the partner engineering team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz Worldwide, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Photo of Ted Malaska

Ted Malaska

Blizzard Entertainment

Ted Malaska is a group technical architect on the Battle.net team at Blizzard, helping support great titles like World of Warcraft, Overwatch, and HearthStone. Previously, Ted was a principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem, and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has also contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Photo of Gwen Shapira

Gwen Shapira

Confluent

Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data-processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Photo of Mark Grover

Mark Grover

Lyft

Mark Grover is a product manager at Lyft. Mark is a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He has also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He is a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.