Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Data Engineering & Architecture

March 25-28, 2019
San Francisco, CA

Learn to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Featured Speakers

Monday, Mar 25 - Tuesday, Mar 26: 2-Day Training (Platinum & Training passes)
Tuesday Mar 26: Tutorials (Gold & Silver passes)
Wednesday Mar 27: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am | Location: Ballroom
Strata Data Conference Keynotes
10:30am
Morning break
Thursday Mar 28: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am | Location: Ballroom
Strata Data Conference Keynotes
10:30am
Morning break
Add to your personal schedule
9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Location: 2018
Jorge Lopez (Amazon Web Services), Roy Hasson (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Gautam Srinivasan (Amazon Web Services), Anthony Nguyen (Amazon Web Services)
Average rating: ****.
(4.50, 4 ratings)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.
Add to your personal schedule
9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Location: 3016
Jesse Anderson (Big Data Institute)
Average rating: ***..
(3.00, 1 rating)
Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2004
Fabian Hueske (Ververica)
Average rating: *****
(5.00, 1 rating)
Fabian Hueske offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink's various modes of use. Then you'll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2005
Mark Madsen (Teradata), Todd Walter (Archimedata)
Average rating: ****.
(4.21, 28 ratings)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2006
Jonathan Seidman (Cloudera), Ted Malaska (Capital One)
Average rating: ****.
(4.00, 6 ratings)
The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2007
Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)
Average rating: ***..
(3.85, 13 ratings)
Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2008
Santosh Kumar (Cloudera), Andre Araujo (Cloudera), Wim Stoop (Cloudera)
Average rating: *****
(5.00, 1 rating)
Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2004
Matt Fuller (Starburst)
Average rating: ***..
(3.57, 7 ratings)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2005
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Average rating: **...
(2.67, 12 ratings)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2006
Sourav Dey (Manifold), Alex Ng (Manifold)
Average rating: ****.
(4.25, 4 ratings)
Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must offer turnkey solutions to take models into production. Sourav Day and Alex Ng explain how to streamline an ML project and help your engineers work as an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2007
Holden Karau (Independent), Francesca Lazzeri (Microsoft), Trevor Grant (IBM)
Average rating: ***..
(3.00, 2 ratings)
Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2008
Jason Wang (Cloudera), Brandon Freeman (Cloudera), Michael Kohs (Cloudera), Akihiro Ishikawa (Cloudera), Toby Ferguson (Cloudera)
Average rating: ***..
(3.20, 5 ratings)
There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2001
Jitender Aswani (Netflix), Di Lin (Netflix), Girish Lingappa (Netflix)
Average rating: ***..
(3.40, 15 ratings)
Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2002
JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)
Average rating: ****.
(4.50, 4 ratings)
Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2004
Shubham Tagra (Qubole)
Average rating: ***..
(3.50, 8 ratings)
Did you know you can run Presto in AWS at a tenth of the cost with AWS Spot nodes, with just a few architectural enhancements to Presto. Shubham Tagra explores the gaps in Presto architecture, explains how to use Spot nodes, covers enhancements, and showcases the improvements in terms of reliability and TCO achieved through them. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2006
Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)
Average rating: ****.
(4.67, 3 ratings)
Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2008
Diego Oppenheimer (Algorithmia)
Average rating: ****.
(4.00, 11 ratings)
You've invested heavily in cleaning your data, feature engineering, training, and tuning your model—but now you have to deploy your model into production, and you discover it's a huge challenge. Diego Oppenheimer shares common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2001
Sandeep U (Intuit)
Average rating: ****.
(4.57, 7 ratings)
How efficient is your data platform? The single metric Intuit uses is time to reliable insights: the total of time spent to ingest, transform, catalog, analyze, and publish. Sandeep Uttamchandani shares three design patterns/frameworks Intuit implemented to deal with three challenges to determining time to reliable insights: time to discover, time to catalog, and time to debug for data quality. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2002
Kurt Brown (Netflix)
Average rating: ****.
(4.22, 9 ratings)
The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data tech (Spark and Flink), enabling services (federated metadata management), and machine learning support. But with power comes complexity. Kurt Brown explains how Netflix is working toward an easier, "self-service" data platform without sacrificing any enabling capabilities. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2004
Lars Volker (Cloudera), Michael Ho (Cloudera)
Average rating: ****.
(4.50, 6 ratings)
In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Lars Volker and Michael Ho cover the efforts to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC, which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2006
Average rating: ****.
(4.60, 5 ratings)
In a large global health services company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources, to act on this data quickly and share the insights with consumers with the same speed and urgency. Join Mohammad Quraishi to learn why streaming data architectures are a necessity—Kafka and Hadoop are key. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2008
Tobias Knaup (Mesosphere), Joerg Schad (ArangoDB)
Average rating: ****.
(4.50, 2 ratings)
There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2001
James Taylor (Lyft)
Average rating: ***..
(3.56, 9 ratings)
James Taylor offers an overview of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. He also discusses future work to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2002
Yaron Haviv (iguazio)
Average rating: ****.
(4.00, 2 ratings)
Faced with the need to handle increasing volumes of data, alternative datasets ("alt data"), and AI, many enterprises are working to design or redesign their big data architectures, but traditional batch platforms fail to generate sufficient ROI. Yaron Haviv shares a continuous analytics approach that yields faster answers for the business while remaining simpler and less expensive for IT. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2004
Zhenxiao Luo (Twitter)
Average rating: ****.
(4.09, 11 ratings)
From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Zhenxiao Luo explains how Uber supports real-time analytics with deep learning on the fly, without any data copying. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2006
Adem Efe Gencer (LinkedIn)
Average rating: ***..
(3.50, 2 ratings)
Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2008
Denise Gosnell (DataStax)
Average rating: ****.
(4.73, 11 ratings)
The graph community has spent years defining and describing its passion: applying graph thinking to solve difficult problems. Denise Gosnell leverages years of experience shipping large-scale applications built on graph databases to share practical and tangible decisions that come into play when designing and delivering distributed graph applications. . .or playing SimCity 2000. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2001
Alex Kira (Uber)
Average rating: ****.
(4.00, 13 ratings)
Uber operates at scale, with thousands of microservices serving millions of rides a day, leading to 100+ PB of data. Alex Kira details Uber's journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected the system to make it highly available and horizontally scalable. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2002
Jowanza Joseph (Pluralsight), Karthik Ramasamy (Streamlio)
Average rating: ****.
(4.00, 1 rating)
After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2004
Julien Le Dem (WeWork)
Average rating: ****.
(4.83, 6 ratings)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2006
Sean Glover (Lightbend)
Average rating: ****.
(4.00, 1 rating)
The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. Sean Glover offers an overview of the Strimzi Kafka Operator, a popular new open source Operator-based Apache Kafka implementation on Kubernetes. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2008
Secondary topics:  Model lifecycle management
Corey Zumar (Databricks)
Average rating: ****.
(4.89, 9 ratings)
Developing applications that leverage machine learning is difficult. Practitioners need to be able to reproduce their model development pipelines, as well as deploy models and monitor their health in production. Corey Zumar offers an overview of MLflow, which simplies this process by managing, reproducing, and operationalizing machine learning through a suite of model tracking and deployment APIs. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2001
Gwen Shapira (Confluent)
Average rating: ****.
(4.64, 11 ratings)
As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2002
Rustem Feyzkhanov (Instrumental)
Average rating: ***..
(3.50, 8 ratings)
Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2004
Tim Armstrong (Cloudera)
Average rating: ****.
(4.80, 5 ratings)
As the popularity and utilization of Apache Impala deployments increases, clusters often become victims of their own success when demand for resources exceeds the supply. Tim Armstrong dives into the latest resource management features in Impala to maintain high cluster availability and optimal performance and provides examples of how to configure them in your Impala deployment. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2006
Average rating: ****.
(4.50, 2 ratings)
GE produces a third of the world's power and 60% of its airplane engines—a critical portion of the world's infrastructure that requires meticulous monitoring of the hundreds of sensors streaming data from each turbine. June Andrews and John Rutherford explain how GE's monitoring and diagnostics teams released the first real-time ML systems used to determine turbine health into production. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2008
Skyler Thomas (MapR), Terry He (MapR Technologies)
Average rating: ****.
(4.75, 4 ratings)
KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2001
Mark Grover (Lyft), Tao Feng (Lyft)
Average rating: ****.
(4.40, 10 ratings)
Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Mark Grover and Tao Feng offer a demo of Amundsen and lead a deep dive into its architecture, covering how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. They also explore the future roadmap, unsolved problems, and its collaboration model. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2002
Subhadra Tatavarti (PayPal), Chen Kovacs (Paypal)
Average rating: ****.
(4.12, 8 ratings)
The PayPal data ecosystem is large, with 250+ PB of data transacting in 200+ countries. Given this massive scale and complexity, discovering and access to the right datasets in a frictionless environment is a challenge. Subhadra Tatavarti and Chen Kovacs explain how PayPal’s data platform team is helping solve this problem with a combination of self-service integrated and interoperable products. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2004
Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Presto Software Foundation)
Average rating: ***..
(3.33, 3 ratings)
Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2006
Sijie Guo (StreamNative), Penghui Li (Zhaopin)
Average rating: ****.
(4.00, 1 rating)
Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2008
Yue Li (MemVerge), Shouwei Chen (Rutgers University)
Average rating: *****
(5.00, 4 ratings)
JD.com recently designed a brand-new architecture to optimize Spark computing clusters. Yue Li and Shouwei Chen detail the problems the team faced when building it and explain how the company benefits from the in-memory distributed filesystem now. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2024
Thomas Phelan (HPE BlueData)
Average rating: ****.
(4.50, 2 ratings)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). But TDE is difficult to configure and manage—particularly when run in Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2007
Eric Jonas (UC Berkeley)
Average rating: ****.
(4.50, 2 ratings)
Eric Jonas offers a quick history of cloud computing, including an accounting of the predictions of the 2009 "Berkeley View of Cloud Computing" paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2001
Jason Wang (Cloudera), Sushant Rao (Cloudera)
Average rating: ****.
(4.00, 2 ratings)
Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2002
Average rating: ****.
(4.75, 4 ratings)
Juan Paulo Gutierrez explains how a small team in Tokyo went through several evolutions as they built an analytics service to help 200+ businesses accelerate their decision-making process. Join in to hear about the background, challenges, architecture, success stories, and best practices as they built and productionalized Rakuten Analytics. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2004
Fabian Hueske (Ververica)
Average rating: ****.
(4.30, 10 ratings)
Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2006
Vivek Pasari (Netflix), Jitender Aswani (Netflix)
Average rating: ***..
(3.14, 7 ratings)
Netflix has over 125 million members spread across 191 countries. Each day its members interact with its client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. Vivek Pasari dives into the data engineering and architecture that enables application performance measurement at this scale. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2008
Fait Poms (Stanford University), Will Crichton (Stanford University)
Average rating: ****.
(4.75, 4 ratings)
Video is now the largest source of data on the internet, so we need tools to make it easier to process and analyze. Alex Poms and Will Crichton offer an overview of Scanner, the first open source distributed system for building large-scale video processing applications, and explore real-world use cases. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2024
Alkis Simitsis (Micro Focus), Shivnath Babu (Unravel Data Systems | Duke University)
Average rating: **...
(2.67, 3 ratings)
Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. The concepts they discuss apply generally to the big data stack. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2007
Avner Braverman (Binaris)
Average rating: ****.
(4.00, 3 ratings)
What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2001
Krishna Gade (Fiddler Labs)
Average rating: ****.
(4.67, 3 ratings)
Join Krishna Gade to learn how to address engineering and organizational challenges for AI fairness and operationalize these concepts in a production AI system—and crucially, create a culture of trust in AI. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2002
Jacques Nadeau (Dremio)
Average rating: ****.
(4.60, 5 ratings)
Apache Arrow Flight is a new initiative focused on providing high-performance communication within data engineering and data science infrastructure. Jacques Nadeau explains how Flight works and where it has been integrated. He also discusses how Flight can be used to abstract physical data management from logical access and sharse benchmarks of workloads that have been improved by Flight. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2004
Haifeng Chen (Intel)
Average rating: ****.
(4.00, 3 ratings)
Spark SQL is widely used, but it still suffers from stability and performance challenges in highly dynamic environments with large-scale data. Haifeng Chen shares a Spark adaptive execution engine built to address these challenges. It can handle task parallelism, join conversion, and data skew dynamically during runtime, guaranteeing the best plan is chosen using runtime statistics. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2006
Matvey Arye (TimescaleDB)
Average rating: ***..
(3.75, 4 ratings)
Matvey Arye offers an overview of two newly released features of TimescaleDB—automated adaptation of time-partitioning intervals and continuous aggregations in near real time—and discusses how these capabilities ease time series data management. Along the way, he also shares real-world use cases, including TimescaleDB's use with other technologies such as Kafka. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2008
Arun Kumar (University of California, San Diego)
Average rating: ****.
(4.00, 2 ratings)
Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2001
Xiao Li (Databricks), Wenchen Fan (Databricks)
Average rating: ***..
(3.25, 4 ratings)
Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4 and give insight into upcoming releases. Then you'll get the chance to ask all your burning Spark questions. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2002
Rohan Dhupelia (Atlassian), Jimmy Li (Atlassian)
Average rating: ****.
(4.67, 3 ratings)
Analytics is easy, but good analytics is hard. Atlassian knows this all too well. Rohan Dhupelia and Jimmy Li explain how the company's push to become truly data driven has transformed the way it thinks about behavioral analytics, from how it defined its events to how it ingests and analyzes them. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2004
Eva Andreasson (Cloudera), Mark Brine (Cloudera), Michael Kohs (Cloudera)
Average rating: **...
(2.00, 3 ratings)
Michael Kohs, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2006
Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)
Average rating: ***..
(3.67, 3 ratings)
Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2008
Paul Curtis (Weaveworks)
Average rating: ****.
(4.50, 2 ratings)
What do you do when your technology doesn’t easily fit on a single laptop and consists of many components? Paul Curtis explains how MapR Technologies rolled out a containerized, scalable, globally available, and easily updatable environment using a combination of Kubernetes to orchestrate, shared data fabric to store and persist, and AppLariat to provide the user interface. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2024
John Bennett (Netflix), Siamac Mirzaie (Netflix)
Average rating: ***..
(3.33, 3 ratings)
Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. John Bennett and Siamac Mirzaie offer an overview of Netflix's internal platform for quickly deploying data-based detection capabilities in the corporate environment. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2001
Li Gao (Lyft), Bill Graham (Lyft)
Average rating: ****.
(4.00, 2 ratings)
Li Gao and Bill Graham discuss the challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2002
Igor Canadi (Rockset), Dhruba Borthakur (Rockset)
Average rating: ****.
(4.00, 1 rating)
Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  Data Platforms
Adrian Lungu (Adobe), Serban Teodorescu (Adobe)
Average rating: ****.
(4.75, 4 ratings)
Adrian Lungu and Serban Teodorescu explain how—inspired by the green-blue deployment technique—the Adobe Audience Manager team developed an active-passive database migration procedure that allows them to test database clusters in production, minimizing the risks without compromising the innovation. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2006
Vaclav Surovec (Deutsche Telekom), Gabor Kotalik (Deutsche Telekom)
Average rating: ****.
(4.00, 1 rating)
Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2008
Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)
Average rating: ***..
(3.33, 3 ratings)
Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2024
J Delange (Twitter), N Lu (Twitter)
Average rating: **...
(2.67, 3 ratings)
Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2001
Holden Karau (Independent), Rachel Warren (Salesforce Einstein)
Average rating: ****.
(4.60, 5 ratings)
Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (a.k.a. tuning) or our jobs may be eaten by Cthulhu. Holden Karau and Rachel Warren explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads—including new settings in 2.4. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2002
Sonali Sharma (Netflix), Shriya Arora (Netflix)
Average rating: ***..
(3.00, 2 ratings)
With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2006
Jinchul Kim (SK Telecom)
Average rating: **...
(2.17, 6 ratings)
Druid supports autoscaling for data ingestion, but it's only available on AWS EC2. You can't rely on the feature on your private cloud. Jinchul Kim demonstrates autoscale-out/in on Kubernetes, details the benefit on this approach, and discusses the development of Druid Helm charts, rolling updates, and custom metric usage for horizontal autoscaling. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2008
Patrick Stuedi (IBM Research)
Average rating: ****.
(4.00, 1 rating)
Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2018
Secondary topics:  Model lifecycle management
Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)
Average rating: ****.
(4.00, 1 rating)
Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2024
Yves Thibaudeau (US Census Bureau)
Average rating: ***..
(3.33, 3 ratings)
The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications. Read more.