Presented By O'Reilly and Cloudera
Make Data Work
December 1–3, 2015 • Singapore

Hadoop application architectures: Fraud detection

Gwen Shapira (Confluent), Ted Malaska (Capital One), Mark Grover (Lyft), Jonathan Seidman (Cloudera)
9:00am–12:30pm Tuesday, 12/01/2015
Hadoop Platform
Location: 321-322 Level: Intermediate
Average rating: ****.
(4.16, 19 ratings)

Prerequisite Knowledge

Attendees should have an understanding of components and concepts in the Hadoop ecosystem such as HDFS, HBase, and MapReduce, as well as a familiarity with writing applications in languages like Java or Scala.


Computer Requirements

The tutorial will cover best practices and considerations for architecting applications on Hadoop and in particular, how to create a fraud detection application using those best practices. The tutorial is not “hands-on”, meaning that during the presentation, we will not attempt to walk you through building the application on your own Hadoop installation. You are welcome to try it later, on your own though!

Code for the demo and associated instructions are available at:


Implementing a scalable low-latency architecture requires understanding a broad range of frameworks, such as Kafka, HBase, HDFS, Flume, Spark, Spark Streaming, and Impala among many others. The good news is that there’s an abundance of materials – books, websites, conferences, etc. – for gaining a deep understanding of these related projects. The bad news is there’s still a scarcity of information on how to integrate these components to implement complete solutions.

In this tutorial we’ll walk through the end-to-end case study of building a fraud detection system to provide a concrete example of how to architect and implement real-time systems. We’ll use this example to illustrate important topics, such as:

  • Modeling data in Kafka, HBase, and Hadoop and selecting optimal formats for storing data
  • Integrating multiple data collection, processing, and storage systems
  • Collecting and analyzing event-based data such as logs and machine-generated data and storing the data in Hadoop
  • Querying and reporting on data

Throughout the example, best practices and considerations for architecting real-time applications will be covered. This tutorial will be valuable for developers, architects, or project leads who are already knowledgeable about Hadoop or similar distributed data processing systems, and are now looking for more insight into how it can be leveraged to implement real-world applications.

Photo of Gwen Shapira

Gwen Shapira


Gwen Shapira is a system architect at Confluent, where she helps customers achieve success with their Apache Kafka implementations. She has 15 years of experience working with code and customers to build scalable data architectures, integrating relational and big data technologies. Gwen currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an Oracle Ace Director, the coauthor of Hadoop Application Architectures, and a frequent presenter at industry conferences. She is also a committer on Apache Kafka and Apache Sqoop. When Gwen isn’t coding or building data pipelines, you can find her pedaling her bike, exploring the roads and trails of California and beyond.

Photo of Ted Malaska

Ted Malaska

Capital One

Ted Malaska is a director of enterprise architecture at Capital One. Previously, he was the director of engineering in the Global Insight Department at Blizzard; principal solutions architect at Cloudera, helping clients find success with the Hadoop ecosystem; and a lead architect at the Financial Industry Regulatory Authority (FINRA). He has contributed code to Apache Flume, Apache Avro, Apache Yarn, Apache HDFS, Apache Spark, Apache Sqoop, and many more. Ted is a coauthor of Hadoop Application Architectures, a frequent speaker at many conferences, and a frequent blogger on data architectures.

Photo of Mark Grover

Mark Grover


Mark Grover is a product manager at Lyft. Mark’s a committer on Apache Bigtop, a committer and PPMC member on Apache Spot (incubating), and a committer and PMC member on Apache Sentry. He’s also contributed to a number of open source projects, including Apache Hadoop, Apache Hive, Apache Sqoop, and Apache Flume. He’s a coauthor of Hadoop Application Architectures and wrote a section in Programming Hive. Mark is a sought-after speaker on topics related to big data. He occasionally blogs on topics related to technology.

Photo of Jonathan Seidman

Jonathan Seidman


Jonathan Seidman is a software engineer on the cloud team at Cloudera. Previously, he was a lead engineer on the big data team at Orbitz, helping to build out the Hadoop clusters supporting the data storage and analysis needs of one of the most heavily trafficked sites on the internet. Jonathan is a cofounder of the Chicago Hadoop User Group and the Chicago Big Data Meetup and a frequent speaker on Hadoop and big data at industry conferences such as Hadoop World, Strata, and OSCON. Jonathan is the coauthor of Hadoop Application Architectures from O’Reilly.

Comments on this page are now closed.


Picture of Mark Grover
Mark Grover
12/02/2015 12:46am +08

Hi everyone,
Ted, Jonathan, Gwen and I thank you for coming to our tutorial today. Hope you enjoyed it.

The slides are at (which redirects to

We look forward to seeing you at our other sessions!

Picture of Mark Grover
Mark Grover
12/01/2015 12:00am +08

Hi Umanga, the tutorial is not hands-on. The code and example used in tutorial is just for demo purposes. You are right, the demo is on a distributed system but setting the demo up yourself doesn’t really provide any real advantage of learning how to architect such a system which is the end goal of this tutorial.

We encourage to pay attention to the architectural considerations and decisions we made in building a fraud detection system, instead.

See you tomorrow!

Umanga Bista
11/26/2015 11:27pm +08

Do we need any environment setup before attending? From the github link, i can see we may need 5 node setup, which may be time consuming? Is cloudera quickstart vm setup ok for this tutorial session?