Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference

Distributed real-time highly available stream processing

Yu-Xi Lim (Teralytics), Michal Wegrzyn (Teralytics)
5:05pm5:45pm Thursday, December 7, 2017
Average rating: *****
(5.00, 2 ratings)

Who is this presentation for?

  • Data engineers

Prerequisite knowledge

  • A basic understanding of functional reactive programming (i.e., Scala) and streaming frameworks

What you'll learn

  • Explore a functional reactive programming architecture pattern that naturally lends itself to designing highly distributed, scalable systems
  • Learn an orchestration strategy for the pattern that leads to high availability and ensures data consistency and integrity


Real-time event stream processing is increasingly becoming a staple of data-driven decision making in many organizations. However, providing such data processing capability is not without challenges. Real-time decision-making support systems must be able to handle high event traffic volume and execute sophisticated analyses and must be highly available and scalable with high data consistency and integrity guarantees. However, satisfying all these requirements simultaneously is difficult.

Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. It can easily define an event flow (i.e., a stream-in/stream-out transformer) and shard the state, pass state fragments to independent state machines which would process independent stream shards, and then splice resulting state fragments into a global state if necessary. With proper orchestration, this framework becomes capable of performing asynchronous checkpointing and seamless recovery of individually failed instances, leading to a highly reliable, highly available framework.

Yu-Xi and Michal illustrate the framework via a case study of Teralytics, a company that analyzes cellular location data to derive transportation patterns and compute transportation-related metrics. Yu-Xi and Michal demonstrate how the algorithm for computing one such metric can be formulated in the framework and how the resulting component reflects the traits given above.

Photo of Yu-Xi Lim

Yu-Xi Lim


Yu-Xi Lim is lead data scientist at Teralytics, where he leads the technical team in the company’s Singapore office. Yu-Xi is interested in applying data science to retail and travel. Previously, he led teams at Southeast Asian ecommerce giant Lazada and at TravelShark; was vice president of engineering at payment startup Fastacash; and was a software engineer in Microsoft’s Windows Division. Yu-Xi holds a PhD in electrical and computer engineering from Georgia Tech, where he did research on WiFi positioning systems.

Michal Wegrzyn


Michal Wegrzyn is a software engineer at Teralytics, where he is responsible for developing batch and streaming big data analytics applications. He previously worked with CellVision and Orange Poland where he built and delivered telco analytics applications. He holds a master’s degree in electronics and telecommunications from AGH University of Science and Technology.