San FranciscoLondon New York

Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Please log in

Add to Your Schedule

Data processing at the speed of 100 Gbps using Apache Crail

Patrick Stuedi (IBM Research)

4:40pm–5:20pm Thursday, March 28, 2019

Data Engineering & Architecture
Location: 2008

Secondary topics: Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Average rating:

(4.00, 1 rating)

Download slides (PDF)

Level

Advanced

What you'll learn

Learn how to use Apache Crail run and deploy machine learning and other data processing workloads on modern clusters equipped with fast networking and storage hardware

Description

Once a staple of HPC clusters, today high-performance network and storage devices are everywhere. For a fraction of the cost, you can rent 40/100 Gbps RDMA networks and high-end NVMe flash devices supporting millions of IOPS, tens of GB/s bandwidth, and less than 100 microseconds of latencies. But how do you leverage the speed of high-throughput low-latency I/O hardware in distributed data processing systems like Spark, Flink, or TensorFlow?

Patrick Stuedi offers an overview of Apache Crail (incubating) a fast, distributed data store that is designed specifically for high-performance network and storage devices. Crail’s focus is on ephemeral data, such as shuffle data or temporary datasets in complex job pipelines, with the goal of enabling data sharing at the speed of the hardware in an accessible way. From a user perspective, Crail offers a hierarchical storage namespace implemented over distributed or disaggregated DRAM and Flash. At its core, Crail supports multiple storage backends (DRAM, NVMe Flash, and 3D XPoint) and networking protocols (RDMA and TPC/sockets). Patrick explores Crail’s design, use cases, and performance results on a 100 Gbps cluster.

Patrick Stuedi

IBM Research

Patrick Stuedi is a member of the research staff at IBM research Zurich. His research interests include distributed systems, networking, and operating systems. The general theme of his work is to explore how modern networking and storage hardware can be exploited in distributed systems. Previously, he was a postdoc at Microsoft Research Silicon Valley. Patrick is the creator of several open source projects such as DiSNI (RDMA for Java), DaRPC (Low latency RPC), and Apache Crail (incubating). He holds a PhD from ETH Zurich.

Website

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com