Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Schedule: Data preparation, data governance, and data lineage sessions

Much of ML in use within companies falls under supervised learning, which means proper training data (or labeled examples) are essential. The rise of deep learning has made this even more pronounced, as many modern neural network architectures rely on large amounts of training data. Issues pertaining to data security, privacy and governance persist and are not necessarily unique to ML applications. But the hunger for large amounts of training data, the advent of new regulations like GDPR, and the importance of managing risk means a stronger emphasis on reproducibility and data lineage are very much needed.

Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)
Average rating: ***..
(3.85, 13 ratings)
Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Santosh Kumar (Cloudera), Andre Araujo (Cloudera), Wim Stoop (Cloudera)
Average rating: *****
(5.00, 1 rating)
Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Jitender Aswani (Netflix), Di Lin (Netflix), Girish Lingappa (Netflix)
Average rating: ***..
(3.40, 15 ratings)
Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
John Haddad (Informatica)
Average rating: ****.
(4.60, 5 ratings)
Just like a powerful space telescope that scans the universe, a data catalog scans the data universe to help data scientists and analysts find data, collaborate, and curate data for analytic and data governance projects. John Haddad explains how a data catalog can help you find the data you need and trust for analytic and data governance projects. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Paco Nathan (derwen.ai)
Average rating: ***..
(3.67, 6 ratings)
Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Mark Grover (Lyft), Tao Feng (Lyft)
Average rating: ****.
(4.40, 10 ratings)
Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Mark Grover and Tao Feng offer a demo of Amundsen and lead a deep dive into its architecture, covering how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. They also explore the future roadmap, unsolved problems, and its collaboration model. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Subhadra Tatavarti (PayPal), Chen Kovacs (Paypal)
Average rating: ****.
(4.12, 8 ratings)
The PayPal data ecosystem is large, with 250+ PB of data transacting in 200+ countries. Given this massive scale and complexity, discovering and access to the right datasets in a frictionless environment is a challenge. Subhadra Tatavarti and Chen Kovacs explain how PayPal’s data platform team is helping solve this problem with a combination of self-service integrated and interoperable products. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Sridhar Alla (BlueWhale), Syed Nasar (Cloudera)
Average rating: **...
(2.86, 7 ratings)
Any business big or small depends on analytics, whether the goal is revenue generation, churn reduction, or sales and marketing. No matter the algorithm and the techniques used, the result depends on the accuracy and consistency of the data being processed. Sridhar Alla and Syed Nasar share techniques used to evaluate the the quality of data and the means to detect the anomalies in the data. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Rohan Dhupelia (Atlassian), Jimmy Li (Atlassian)
Average rating: ****.
(4.67, 3 ratings)
Analytics is easy, but good analytics is hard. Atlassian knows this all too well. Rohan Dhupelia and Jimmy Li explain how the company's push to become truly data driven has transformed the way it thinks about behavioral analytics, from how it defined its events to how it ingests and analyzes them. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Yves Thibaudeau (US Census Bureau)
Average rating: ***..
(3.33, 3 ratings)
The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Sonali Sharma (Netflix), Shriya Arora (Netflix)
Average rating: ***..
(3.00, 2 ratings)
With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state. Read more.