Presented By O'Reilly and Cloudera
Make Data Work
March 28–29, 2016: Training
March 29–31, 2016: Conference
San Jose, CA

Toppling the mainframe: Enterprise-grade streaming under 2 ms on Hadoop

Ilya Ganelin (Capital One Data Innovation Lab)
1:50pm–2:30pm Thursday, 03/31/2016
Data Innovations

Location: 210 D/H
Average rating: ***..
(3.33, 6 ratings)

Prerequisite knowledge

Attendees should have a basic familiarity with the Hadoop ecosystem (including YARN and HDFS), as well as general familiarity with streaming technology, such as Storm or Spark Streaming, to understand some of the challenges of stream computing.

Description

These days everyone is excited about big data and fast data. Capital One has embraced this new generation of technology with open arms. However, as Edward Heinlein was fond of reminding us, “TANSTAFL—There ain’t no such thing as a free lunch.”

For many years, there’s been a very real battle around the standard operating model of software. Tech giants like Oracle and IBM have traditionally built massively expensive enterprise-ready products, while the open source community provides free, albeit usually inferior, software. For a product to be enterprise-ready, it must guarantee complete reliability alongside performance and flexibility. There are notable successes in the open source world such as Linux and Open SSH/SSL, but the realm of distributed stream computing has lacked comparable solutions.

Capital One set out to find whether we could build or find enterprise-ready technology in the open source world to tackle difficult streaming problems that also provides equivalent performance, durability, and availability as a mainframe computer. Ilya Ganelin details Capital One’s attempt to answer this question in a rigorous and complete way, not just by making a prototype or discovering exciting new tools, but by creating an open source-based, enterprise-ready product that can transparently replace an enormously expensive proprietary solution. Ilya presents Capital One’s novel solution for real-time decisioning on Apache Apex.

Topics include:

  • A detailed dive into the business requirements of a new real-time decisioning platform for model building, feature computation, and model scoring
  • A survey and analysis of the leading open source technologies for stream processing and what tradeoffs Capital One considered when selecting their technology stack
  • Capital One’s solution, based on Apache Apex, which provides unparalleled performance on Hadoop and meets the stringent performance, scalability, and durability requirements necessary for enterprise-grade decision making
Photo of Ilya Ganelin

Ilya Ganelin

Capital One Data Innovation Lab

Ilya Ganelin is a roboticist turned data engineer. After a few years building self-discovering robots at the University of Michigan and another few years working on embedded DSP software with cell phones and radios at Boeing, he landed in the world of big data at the Capital One Data Innovation Lab. Ilya is an active contributor to the core components of Apache Spark and a committer to Apache Apex with the goal of learning what it takes to build a next-generation distributed computing platform. Ilya is an avid bread maker, cook, skier, and race-car driver.