Presented By
O’Reilly + Cloudera

Make Data Work

March 25-28, 2019
San Francisco, CA

Schedule: Data preparation, data governance, and data lineage sessions

Much of ML in use within companies falls under supervised learning, which means proper training data (or labeled examples) are essential. The rise of deep learning has made this even more pronounced, as many modern neural network architectures rely on large amounts of training data. Issues pertaining to data security, privacy and governance persist and are not necessarily unique to ML applications. But the hunger for large amounts of training data, the advent of new regulations like GDPR, and the importance of managing risk means a stronger emphasis on reproducibility and data lineage are very much needed.

9:00am–12:30pm Tuesday, March 26, 2019

Hands-on machine learning with Kafka-based streaming pipelines

Data Engineering & Architecture, Streaming and IoT
Location: 2007

Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)

Average rating:

(3.85, 13 ratings)

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.

9:00am–12:30pm Tuesday, March 26, 2019

Hands-on with Cloudera SDX: Setting up your own shared data experience

Data Engineering & Architecture
Location: 2008

Santosh Kumar (Cloudera), Andre Araujo (Cloudera), Wim Stoop (Cloudera)

Average rating:

(5.00, 1 rating)

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Scaling data lineage at Netflix to improve data infrastructure reliability and efficiency

Data Engineering & Architecture
Location: 2001

Jitender Aswani (Netflix), Di Lin (Netflix), Girish Lingappa (Netflix)

Average rating:

(3.40, 15 ratings)

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Understanding the data universe with a data catalog

Executive Briefing and best practices, Strata Business Summit
Location: 2018

John Haddad (Informatica)

Average rating:

(4.60, 5 ratings)

Just like a powerful space telescope that scans the universe, a data catalog scans the data universe to help data scientists and analysts find data, collaborate, and curate data for analytic and data governance projects. John Haddad explains how a data catalog can help you find the data you need and trust for analytic and data governance projects. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Executive Briefing: Overview of data governance

Executive Briefing and best practices, Strata Business Summit
Location: 2020

Paco Nathan (derwen.ai)

Average rating:

(3.67, 6 ratings)

Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa. Read more.

11:00am–11:40am Thursday, March 28, 2019

Disrupting data discovery

Data Engineering & Architecture
Location: 2001

Mark Grover (Lyft), Tao Feng (Lyft)

Average rating:

(4.40, 10 ratings)

Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Mark Grover and Tao Feng offer a demo of Amundsen and lead a deep dive into its architecture, covering how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. They also explore the future roadmap, unsolved problems, and its collaboration model. Read more.

11:00am–11:40am Thursday, March 28, 2019

ML and AI at scale at PayPal

Data Engineering & Architecture
Location: 2002

Subhadra Tatavarti (PayPal), Chen Kovacs (Paypal)

Average rating:

(4.12, 8 ratings)

The PayPal data ecosystem is large, with 250+ PB of data transacting in 200+ countries. Given this massive scale and complexity, discovering and access to the right datasets in a frictionless environment is a challenge. Subhadra Tatavarti and Chen Kovacs explain how PayPal’s data platform team is helping solve this problem with a combination of self-service integrated and interoperable products. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Anomaly detection using deep learning to measure the quality of large datasets

Data Science, Machine Learning & AI
Location: 2016

Sridhar Alla (BlueWhale), Syed Nasar (Cloudera)

Average rating:

(2.86, 7 ratings)

Any business big or small depends on analytics, whether the goal is revenue generation, churn reduction, or sales and marketing. No matter the algorithm and the techniques used, the result depends on the accuracy and consistency of the data being processed. Sridhar Alla and Syed Nasar share techniques used to evaluate the the quality of data and the means to detect the anomalies in the data. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Transforming behavioral analytics at Atlassian

Data Engineering & Architecture
Location: 2002

Rohan Dhupelia (Atlassian), Jimmy Li (Atlassian)

Average rating:

(4.67, 3 ratings)

Analytics is easy, but good analytics is hard. Atlassian knows this all too well. Rohan Dhupelia and Jimmy Li explain how the company's push to become truly data driven has transformed the way it thinks about behavioral analytics, from how it defined its events to how it ingests and analyzes them. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

New directions in record linkage

Data Engineering & Architecture
Location: 2024

Yves Thibaudeau (US Census Bureau)

Average rating:

(3.33, 3 ratings)

The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Taming large state to join datasets for personalization

Data Engineering & Architecture
Location: 2002

Sonali Sharma (Netflix), Shriya Arora (Netflix)

Average rating:

(3.00, 2 ratings)

With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state. Read more.

Presented by

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com