Presented By O'Reilly and Cloudera
Make Data Work
September 26–27, 2016: Training
September 27–29, 2016: Tutorials & Conference
New York, NY

How GE analyzes billions of mission-critical events in real time using Apache Apex, Spark, and Kudu

Venkatesh Sivasubramanian (GE Digital), Luis Ramos (GE Digital)
2:05pm–2:45pm Thursday, 09/29/2016
IoT & real-time
Location: 1 E 12/1 E 13 Level: Intermediate
Average rating: ***..
(3.50, 2 ratings)

Prerequisite knowledge

  • A general understanding of data infrastructure and stream processing systems
  • What you'll learn

  • Explore GE's approach to performing analytics on a massive volume of time series data using Apache Apex, Spark, and Kudu
  • Description

    Digital consumer companies are disrupting the old guard and changing the way we do business in fundamental ways; for example, Uber, Airbnb, and Zipcar have disrupted the traditional businesses of taxis, hotels, and car rental companies by leveraging software capabilities to create new business models. Opportunities in the industrial world are expected to outpace consumer business cases. Time series data is growing exponentially as new machines around the world get connected. Venkatesh Sivasubramanian and Luis Ramos explain how GE makes it faster and easier for systems to access (using a common layer) and perform analytics on a massive volume of time series data by taking what they’ve learned from Apache Arrow and applying it today for highly efficient time series storage using Apache Apex, Spark, and Kudu.

    At the heart of GE’s digital portfolio is the Predix platform, a cloud-based platform as a service (PaaS) for the Industrial IoT. Predix provides the tools, framework, guidelines, and best practices to enable you to create solutions to run industrial-scale analytics. Distributed processing is a de facto standard when dealing with a lot of data. But as there are many heterogenous data processing systems geared for different work loads, the need to agree and standardize the communication layer becomes paramount. Apache Arrow is working with several products in an attempt to do just that and agree on a common in-memory columnar storage layer to avoid serialization of data between different systems. Venkat and Luis discuss GE’s approach, which uses similar concepts to time series-centric data.

    Topics include:

    • Complexities with industrial use cases (e.g., aviation and oil and gas)
    • Why GE chose Apache Apex (incubating) and how it simplifies real-time streaming ingestion and processing
    • Apache Spark for running in-stream analytics and machine-learning algorithms
    • Experiments with Apache Kudu (incubating) and lessons learned
    Photo of Venkatesh Sivasubramanian

    Venkatesh Sivasubramanian

    GE Digital

    Venkatesh Sivasubramanian is currently a Senior Director at GE Digital, where he drives the architecture and development of Data Services for Predix, an Industrial IoT platform. Prior to joining GE Digital, he worked as a lead engineer in the Big Fast Data team at WalmartLabs, building its stream processing engine and distributed systems. Venkatesh holds a master’s degree in software engineering from Birla Institute of Technology and Science (BITS), India.

    Photo of Luis Ramos

    Luis Ramos

    GE Digital

    Luis Ramos is a senior staff engineer at GE Digital who recently transitioned from GE Global Research, where he drove initiatives on industrial big data projects during early stages of Predix. Currently with the Predix Data Services team, Luis leads the Time Series Service development team. Prior to GE, he worked in startups, where he contributed to Hadoop ecosystem projects and built an analytics system that was used by major telecom companies including Verizon, Sprint, and T-Mobile for smartphone usage and MND. Luis holds a master’s degree in computer science from Cal State Fullerton.