Mar 15–18, 2020

Schedule: Data Engineering sessions

Add to your personal schedule
11:00am11:40am Tuesday, March 17, 2020
Location: LL20A
Alasdair Allan (Babilim Light Industries)
Much of the data we collect is thrown away, but that's about to change; the power envelope needed to run machine learning models on embedded hardware has fallen dramatically, enabling you to put the smarts on the device rather than in the cloud. Alasdair Allan explains how the data you throw away can be processed in real time at the edge, and this has huge implications for how you deal with data. Read more.
Add to your personal schedule
11:50am12:30pm Tuesday, March 17, 2020
Location: LL20A
Secondary topics:  Cloud Platforms and SaaS
Alexander Pierce (Pepperdata)
Alex Pierce evaluates Amazon Elastic MapReduce (EMR), Azure HDInsight, and Google Cloud DataProc, three leading cloud service providers, with respect to Hadoop and big data autoscaling capabilities and offers guidance to help you determine the flavor of autoscaling to best fit your business needs. Read more.
Add to your personal schedule
1:45pm2:25pm Tuesday, March 17, 2020
Location: LL20A
Secondary topics:  Cloud Platforms and SaaS
Jacques Nadeau (Dremio)
Jacques Nadeau leads a deep dive into important considerations when choosing between data lake storage options—speed, cost, and consistency. You'll learn about these differences and on how caching and ephemeral storage can affect these trade-offs. Jacques demonstrates technologies that improve analytical experience by compensating for slow reads. Read more.
Add to your personal schedule
2:35pm3:15pm Tuesday, March 17, 2020
Location: LL20A
Michael Freedman (TimescaleDB | Princeton University)
Time series data tends to accumulate very quickly, across DevOps, IoT, industrial and energy, finance, and other domains. Time series data is everywhere, with monitoring and IoT applications generating tens of millions of metrics per second and petabytes of data. Michael Freedman shows you how to build a distributed time series database that offers the power of full SQL at scale. Read more.
Add to your personal schedule
4:15pm4:55pm Tuesday, March 17, 2020
Location: LL20A
Secondary topics:  Data Management and Storage
Kamil Bajda-Pawlikowski explores Presto, an open source SQL engine, featuring low-latency queries, high concurrency, and the ability to query multiple data sources. With Kubernetes, you can easily deploy and manage Presto clusters across hybrid and multicloud environments with built-in high availability, autoscaling, and monitoring. Read more.
Add to your personal schedule
5:05pm5:45pm Tuesday, March 17, 2020
Location: LL20A
Ben Galewsky (National Center for Supercomputing Applications), Lindsey Gray (Fermi National Accelerator Laboratory), Andrew Melo (Vanderbilt University)
Building a data engineering pipeline for serving segments of a 200 Pb dataset to particle physicists around the globe poses many challenges, some unique to high energy physics and some to big science projects across disciplines. Ben Galewsky, Gray Lindsey, and Andrew Melo highlight how much of it can inform industry data science at scale. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 18, 2020
Location: LL20A
Sophie Watson (Red Hat), William Benton (Red Hat)
Cloud native infrastructure like Kubernetes has obvious benefits for machine learning systems, allowing you to scale out experiments, train on specialized hardware, and conduct A/B tests. What isn’t obvious are the challenges that come up on day two. Sophie Watson and William Benton share their experience helping end users navigate these challenges and make the most of new opportunities. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 18, 2020
Location: LL20A
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Data lakes are hot again; with S3 from AWS as the data lake storage, the modern data lake architecture separates compute from storage. You can choose from a variety of elastic, scalable, and cost-efficient technologies when designing a cloud data lake. Tomer Shiran and Jacques Nadeau share best practices for building a data lake on AWS, as well as various services and open source building blocks. Read more.
Add to your personal schedule
1:45pm2:25pm Wednesday, March 18, 2020
Location: LL20A
Secondary topics:  Streaming and IoT
Denise Gosnell (DataStax)
Self-organizing networks rely on sensor communication and a centralized mechanism, like a cell tower, for transmitting the network's status. Denise Gosnell walks you through what happens if the tower goes down and how a graph data structure gets involved in the network's healing process. You'll see graphs in this dynamic network and how path information helps sensors come back online. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, March 18, 2020
Location: LL20A
Wangda Tan (Cloudera), Arpit Agarwal (Cloudera)
2020 Hadoop is still evolving fast. You'll learn the current status of Apache Hadoop community and the exciting present and future of Hadoop 3.x. Wangda Tan and Arpit Agarwal cover new features like Hadoop on Cloud, GPU support, NameNode federation, Docker, 10X scheduling improvements, OZone, etc. And they offer you upgrade guidance from 2.x to 3.x. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, March 18, 2020
Location: LL20A
Zhe Zhang (LinkedIn), Huangming Xie (LinkedIn)
Compute efficiency optimization is of critical importance in the big data era, as data science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource use funnel, which Zhe Zhang and Huangming Xie characterize using a CLUE framework. Read more.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires