Mar 15–18, 2020

Schedule: Data, Analytics, and AI Architecture sessions

Add to your personal schedule
1:30pm5:00pm Monday, March 16, 2020
Location: LL21B
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Anurag Khandelwal (RISELab, UC Berkeley)
We shall walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. The audience will get an overview of the inception and growth of the serverless paradigm. We shall deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions. Read more.
Add to your personal schedule
11:00am11:40am Tuesday, March 17, 2020
Location: Expo Hall
Sandeep U (Intuit), Giriraj Bagadi (Intuit), Sunil Goplani (Intuit)
Data quality metrics today focus on quantifying whether "data is a mess." But what are lead indicators to track before data actually becomes a mess? This talk shares our experiences in developing lead indicators for data quality for our production data pipelines at Intuit. The talk covers details of lead indicators, tools developed to optimize, and lessons that moved the needle on data quality. Read more.
Add to your personal schedule
1:45pm2:25pm Tuesday, March 17, 2020
Location: Expo Hall
Eitan Anzenberg (Bill.com)
Although the field of optical character recognition (OCR) has been around for almost half a century, document parsing and field extraction from images remain an open research topic. We utilize an end-to-end deep learning and OCR architecture to predict regions of interest within documents and automatically extract their text. Read more.
Add to your personal schedule
4:15pm4:55pm Tuesday, March 17, 2020
Location: LL21 C
Lior Gavish (Barracuda)
Lior Gavish breaks down a machine learning (ML)-based system that detects a highly evasive type of email-based fraud. The system combines innovative techniques for labeling and classifying highly unbalanced datasets with a distributed cloud application capable of processing high-volume communication in real time. Read more.
Add to your personal schedule
5:05pm5:45pm Tuesday, March 17, 2020
Location: 230 A
Batch processing can benefit immensely from adopting some techniques from the streaming processing world. In this talk, we will share how Apache Hudi (Incubating), an open source project created at Uber and currently incubating with the ASF, can bridge this gap and enable more productive, efficient batch data engineering. Read more.
Add to your personal schedule
5:05pm5:45pm Tuesday, March 17, 2020
Location: LL20A
Ben Galewsky (National Center for Supercomputing Applications), Gray Lindsey (Fermi National Accelerator Laboratory), Andrew Melo (Vanderbildt University)
Building a data engineering pipeline for serving segments of a 200Pb dataset to particle physicists around the globe poses many challenges. Some of them are unique to high energy physics, some apply to big science projects across disciplines, but much of it can inform industry data science at scale. Read more.
Add to your personal schedule
2:35pm3:15pm Wednesday, March 18, 2020
Location: LL21 C
Micah Wylde (Lyft)
At Lyft, we process millions of events per second in real-time to compute prices, balance marketplace dynamics, detect fraud, among many other use cases. This talk will cover how we are using Kubernetes, along with Flink, Beam, and Kafka, to enable service engineers and data scientists to easily build real-time data applications. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, March 18, 2020
Location: LL21A
Chendi Xue (Intel), Jian Zhang (Intel)
This presentation is going to talk about how we accelerate Spark SQL with AVX supported vectorization technology. Both design and evaluation will be covered in this session. Including how we enabled columnar process in Spark SQL, how we make Arrow as intermediate data and how we leverage AVX enabled Gandiva for the data processing, and then performance analysis with system metrics and breakdown. Read more.
Add to your personal schedule
4:15pm4:55pm Wednesday, March 18, 2020
Location: LL20A
Zhe Zhang (LinkedIn), Huangming Xie (LinkedIn)
Compute efficiency optimization is of critical importance in the big data era, as Data Science and ML algorithms become increasingly complex and data size increases exponentially over time. Opportunities exist throughout the resource utilization funnel, which we characterize using a CLUE framework. Read more.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

pr@oreilly.com

For media/analyst press inquires