Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Data Engineering & Architecture

March 25-28, 2019
San Francisco, CA

Learn to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Featured Speakers

Monday, Mar 25 - Tuesday, Mar 26: 2-Day Training (Platinum & Training passes)
Tuesday Mar 26: Tutorials (Gold & Silver passes)
Wednesday Mar 27: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am | Location: Ballroom
Strata Data Conference Keynotes
10:30am
Morning break
Thursday Mar 28: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am | Location: Ballroom
Strata Data Conference Keynotes
10:30am
Morning break
Add to your personal schedule
9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Location: 2018
Secondary topics:  AI and Data technologies in the cloud, Storage
Jorge Lopez (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Jorge Lopez shows you how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. Then, you'll build a big data application using AWS technologies such as S3, Athena, Kinesis, and more Read more.
Add to your personal schedule
9:00am - 5:00pm Monday, March 25 & Tuesday, March 26
Location: 3016
Secondary topics:  Streaming, realtime analytics, and IoT
Jesse Anderson (Big Data Institute)
Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2004
Secondary topics:  Streaming, realtime analytics, and IoT
Fabian Hueske (Ververica)
This hands-on session introduces Flink via the SQL interface. You will receive an overview of stream processing, and a survey of Apache Flink with its various modes of use. Then we’ll use Flink to run SQL queries on data streams and contrast this with the Flink data stream API. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2005
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2006
Secondary topics:  AI and machine learning in the enterprise
Jonathan Seidman (Cloudera), Ted Malaska (Capital One)
The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2007
Secondary topics:  Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Model lifecycle management
Boris Lublinsky (Lightbend), Dean Wampler (Lightbend)
This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 26, 2019
Location: 2008
Secondary topics:  Data preparation, data governance, and data lineage, Storage
Santosh Kumar (Cloudera)
Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar offers an overview of SDX before diving deep into the moving parts and guiding you through serttting it up. You'll leave with all the skills and experience you need to set up your own SDX. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2004
Secondary topics:  Streaming, realtime analytics, and IoT
Matt Fuller (Starburst)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL-on-Anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from Gigabytes to Petabytes. In this tutorial, attendees will learn Presto usages, best practices, and optional hands on exercises. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2005
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2006
Secondary topics:  AI and machine learning in the enterprise
Sourav Dey (Manifold), Alex Ng (Manifold)
Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must be turnkey to take models into production. Sourav Day and Alex Ng explain how to streamline a machine learning project and help your engineers work as an an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2007
Secondary topics:  AI and Data technologies in the cloud, Model lifecycle management
Holden Karau (Google), Francesca Lazzeri (Microsoft), Trevor Grant (IBM), Ilan Filonenko (Bloomberg LP)
This workshop will quickly introduce what Kubeflow is, and how we can use it to train and serve models across different cloud environments (and on-prem). We’ll have a script to do the initial set up work ready so you can jump (almost) straight into training a model on one cloud, and then look at how to set up serving in another cluster/cloud. We will start with a simple model w/follow up links. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 26, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud
Jason Wang (Cloudera), Tony Wu (Cloudera), Vinithra Varadharajan (Cloudera)
Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Media, Marketing, Advertising
Jitender Aswani (Netflix), Di Lin (Netflix)
Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2002
Secondary topics:  AI and machine learning in the enterprise, Automation in data science and big data, Data Platforms, Retail and e-commerce, Storage, Temporal data and time-series analytics
Jian Chang (Alibaba Group), Sanjian Chen (Alibaba Group)
We focus on sharing the design of the AI Engine on Alibaba TSDB service that enables fast and complex analytics of large-scale retail data. A successful case study of the Fresh Hema Supermarket, a major “New Retail” platform operated by Alibaba Group. We will highlight our solutions to the major technical challenges in data cleaning, storage and processing. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2004
Secondary topics:  AI and Data technologies in the cloud
Shubham Tagra (Qubole)
Running Presto in AWS at 1/10th the cost with AWS Spot nodes can be achieved with few architectural enhancements to Presto. This talk will explain the gaps in Presto architecture to use spot nodes and cover these enhancements and showcase the improvements in terms of reliability and TCO achieved through them. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT
Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)
Live Aggregators(LA) is a highly reliable and scalable in-house real time aggregation system that can autoscale for sudden changes in load. LA consumes billions of kafka messages and does over 1.5 billion writes to Cassandra per day. It is 80% cheaper than competing streaming solutions due to running over AWS spot instances and having 70% CPU utilization. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 27, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud, Automation in data science and big data, Model lifecycle management
Diego Oppenheimer (Algorithmia)
You've invested heavily in cleaning your data, feature engineering, training and tuning your model—but now you have to deploy your model into production and you discover it's a huge challenge. In this talk, you'll learn common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Financial Services
Sandeep U (Intuit)
How efficient is your data platform? The single metric Intuit uses is time to reliable insights: the total of time spent to ingest, transform, catalog, analyze, and publish. Sandeep Uttamchandani shares three design patterns/frameworks Intuit implemented to deal with three challenges to determining time to reliable insights: time to discover, time to catalog, and time to debug for data quality. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2002
Secondary topics:  Data Platforms, Media, Marketing, Advertising
Kurt Brown (Netflix)
The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data techs (e.g. Spark and Flink), enabling services (e.g. federated metadata management), and machine learning support. But with power comes complexity. I'll talk through how we are investing towards an easier, "self-service" data platform without sacrificing our enabling capabilities. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2004
Secondary topics:  Streaming, realtime analytics, and IoT
Lars Volker (Cloudera), Michael Ho (Cloudera)
In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Lars Volker and Michael Ho cover the efforts to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC, which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Data Integration and Data Pipelines, Data Platforms, Health and Medicine, Streaming, realtime analytics, and IoT
In a large Global Health Service company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources to act on this data quickly and share the insights with consumers with the same speed and urgency. Streaming data architectures are a necessity. Kafka and Hadoop are key. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 27, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud, Automation in data science and big data, Model lifecycle management, Storage
Tobias Knaup (Mesosphere), Joerg Schad (Mesosphere, Inc.)
There are many great tutorials for training your deep learning models using TensorFlow, Keras, Spark or one of the many other frameworks. But training is only a small part in the overall deep learning pipeline. This talk gives an overview into building a complete automated deep learning pipeline starting with exploratory analysis, over training, model storage, model serving, and monitoring. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Transportation and Logistics
James Taylor (Lyft)
James Taylor offers an overview of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. He also discusses future work to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2002
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines
Yaron Haviv (iguazio)
Faced with the need to handle increasing volumes of data, alternative data sets ("alt data") and AI, many enterprises are working to design or redesign their big data architectures. While traditional batch platforms fail to generate sufficient ROI, Yaron Haviv suggests a Continuous Analytics approach yielding faster answers for the business while remaining simpler and less expensive for IT. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2004
Secondary topics:  Data Platforms, Storage, Streaming, realtime analytics, and IoT, Transportation and Logistics
Zhenxiao Luo (Uber)
From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Inside Uber, analysts would like to run Analytics on any data sources, in real time. This talk will share Uber’s engineering effort about real time Analytics on any data source on the fly, without any data copy. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Streaming, realtime analytics, and IoT
Adem Efe Gencer (LinkedIn)
This talk will describe our work and experiences towards alleviating the management overhead of large-scale Kafka clusters using Cruise Control at LinkedIn. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 27, 2019
Location: 2008
Secondary topics:  Graph technologies and analytics
Denise Gosnell (DataStax)
The graph community has spent years defining and describing our passion - applying graph thinking to solve difficult problems. This talk will leverage years of experience from shipping large scale applications built on graph databases. We’ll discuss some practical and tangible decisions that come into play when designing and delivering distributed graph applications … or playing SimCity 2000. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Transportation and Logistics
Alex Kira (Uber)
Uber operates at scale, with thousands of microservices serving millions of rides a day leading to more than a hundred petabytes of data. We will describe our journey towards a unified and scalable data workflow system at Uber used to manage this data. We will talk about the challenges we faced and how we have re-architected our system to make it highly available and horizontally scalable. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2002
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Retail and e-commerce
Jowanza Joseph (OneClickRetail), Karthik Ramasamy (Streamlio)
After 2 years of running streaming pipelines through Kinesis and Spark at One Click Retail, we evaluated our solution and decided to explore a new platform that would (1) take advantage of Kubernetes and (2) support a simpler data processing DSL. We settled on Apache Pulsar because of its native support for Kubernetes and Pulsar Functions a serverless functions model on top of Pulsar. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2004
Secondary topics:  Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT
Julien Le Dem (WeWork)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Streaming, realtime analytics, and IoT
Sean Glover (Lightbend)
Introducing Strimzi, a Kafka project for Kubernetes. The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. This talk will review a popular new open source operator-based Apache Kafka implementation on Kubernetes called the Strimzi Kafka Operator. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 27, 2019
Location: 2008
Secondary topics:  Model lifecycle management
Corey Zumar (Databricks)
Developing applications that leverage machine learning is difficult. Practitioners need to be able to reproduce their model development pipelines, as well as deploy models and monitor their health in production. Corey Zumar offers an overview of MLflow, which simplies this process by managing, reproducing, and operationalizing machine learning through a suite of model tracking and deployment APIs. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2001
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines
Gwen Shapira (Confluent)
As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud-native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2002
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines
Rustem Feyzkhanov (Instrumental)
Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2004
Tim Armstrong (Cloudera)
As the popularity and utilization of Apache Impala deployments increases, clusters often become victims of their own success when demand for resources exceeds the supply. Tim Armstrong dives into the latest resource management features in Impala to maintain high cluster availability and optimal performance and provides examples of how to configure them in your Impala deployment. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2006
Secondary topics:  Streaming, realtime analytics, and IoT, Transportation and Logistics
GE produces a third of the world's power and 60% of airplane engines. These engines form a critical portion of the world's infrastructure and require meticulous monitoring of the hundreds of sensors streaming data from each turbine. Here, we share the case study of releasing into production the first real-time ML systems used to determine turbine health by GE's monitoring and diagnostics teams. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 27, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud, Model lifecycle management, Storage
Skyler Thomas (MapR), Terry He (MapR Technologies)
KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. This talk will explore the problems of state and storage and how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2001
Secondary topics:  Data preparation, data governance, and data lineage, Transportation and Logistics
Mark Grover (Lyft), Tao Feng (Lyft)
In this talk, we'll discuss how Lyft has reduced time taken for discovering data by 10x by building its own data portal - Amundsen. We will give a demo of Amundsen, deep dive into its architecture and discuss how it leverages centralized metadata, page rank, and a comprehensive data graph to achieve its goal. We will close with future roadmap, unsolved problems and collaboration model. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2002
Secondary topics:  Data Platforms, Data preparation, data governance, and data lineage, Financial Services
Subhadra Tatavarti (PayPal), Vadim Kutsyy (PayPal)
The PayPal data ecosystem is large, with 250+ PB of data transacting in 200+ countries. Given this massive scale and complexity, discovering and access to the right datasets in a frictionless environment is a challenge. Subhadra Tatavarti and Vadim Kutsyy explain how PayPal’s data platform team is helping solve this problem with a combination of self-service integrated and interoperable products. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2004
Secondary topics:  Storage, Streaming, realtime analytics, and IoT
Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Facebook)
Presto, an open source distributed SQL engine, is designed for interactive queries and ability to query multiple data sources. With the ever-growing list of connectors (e.g., Apache Kudu, Pulsar, Netflix Iceberg, Elasticsearch) recently introduced Cost-Based Optimizer in Presto must account for heterogeneous data source with incomplete statistics and new use cases such as geospatial analytics. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2006
Secondary topics:  Data Platforms, Media, Marketing, Advertising, Streaming, realtime analytics, and IoT
Sijie Guo (Streamlio), Penghui Li (ZhaoPin)
Using a messaging system to build an event bus is very common. However, certain use cases demand messaging system with a certain set of features. This talk will focus on the event bus requirements for Zhaopin.com, one of the biggest Chinese online recruitment services provider, and why they chose Apache Pulsar. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2008
Secondary topics:  Data Platforms, Retail and e-commerce, Storage
Zhen Fan (JD.com), Yue Li (MemVerge)
JD.com has designed a brand new architecture to optimize the spark computing clusters. We will show the problems we faced before and how we benefit from the in-memory distributed filesystem now. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 28, 2019
Location: 2024
Secondary topics:  AI and Data technologies in the cloud, Security and Privacy
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for Big Data is HDFS configured with Transparent Data Encryption (TDE). TDE is difficult to configure and manage - even more so when run in Docker containers. This session will discuss these challenges and how to overcome them. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2001
Secondary topics:  AI and Data technologies in the cloud, Data Platforms, Storage
Jason Wang (Cloudera), Tony Wu (Cloudera), Suraj Acharya (Cloudera)
We start with a general overview of cloud paradigms and cloud architectures for big data platforms (focusing on AWS and Azure); then we give an actionable understanding of cloud architecture with a dive into core cloud concepts: compute and virtual machine architectures, cloud storage, authentication and authorization, encryption, security and security best practices, and user management. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2002
Secondary topics:  Data Platforms, Retail and e-commerce
Learn about how a small team in Tokyo went through several evolutions as they built an analytics service to help 200+ businesses accelerate their decision-making process. This presentation will cover the background, challenges, architecture, success stories, and best practices as they built and productionalized Rakuten Analytics. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  Data Integration and Data Pipelines, Streaming, realtime analytics, and IoT
Fabian Hueske (Ververica)
Processing streaming data with SQL is gaining a lot of attention. In this talk, Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. Moreover, Fabian will present a selection of common use cases and demonstrate how easily they can be addressed by Flink SQL. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  Data Platforms, Media, Marketing, Advertising, Streaming, realtime analytics, and IoT
Vivek Pasari (Netflix)
Netflix has over 125 million members spread across 191 countries. Each day its members interact with its client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. Vivek Pasari dives into the data engineering and architecture that enables application performance measurement at this scale. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  Deep Learning, Media, Marketing, Advertising
Alex Poms (Stanford University), Will Crichton (Stanford University)
Systems like Spark made it possible to process big numerical/textual data on hundreds of machines. Today, the majority of data in the world is video. Scanner is the first open-source distributed system for building large-scale video processing applications. Scanner is being used at Stanford for analyzing TBs of film with deep learning on GCP, and at Facebook for synthesizing VR video on AWS. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2024
Secondary topics:  Automation in data science and big data, Deep Learning
Alkis Simitsis (Micro Focus), Shivnath Babu (Unravel Data Systems | Duke University)
Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. The concepts they discuss apply generally to the big data stack. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 28, 2019
Location: 2007
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Data Platforms
Avner Braverman (Binaris)
What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2001
Krishna Gade (Fiddler Labs)
Join Krishna Gade to learn how to address engineering and organizational challenges for AI fairness and operationalize these concepts in a production AI system—and crucially, create a culture of trust in AI. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2002
Jacques Nadeau (Dremio)
Apache Arrow Flight is a new initiative focused on providing high-performance communication within data engineering and data science infrastructure. Jacques Nadeau explains how Flight works and where it has been integrated. He also discusses how Flight can be used to abstract physical data management from logical access and sharse benchmarks of workloads that have been improved by Flight. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  Streaming, realtime analytics, and IoT
Haifeng Chen (Intel)
Spark SQL is widely used today. However, it still suffers from stability and performance challenges in the highly dynamic environment with large scale of data. To address these challenges, we introduced Spark adaptive execution engine which can handle the task parallelism, join conversion and data skew dynamically during run-time, guaranteeing the best plan is chosen using run-time statistics. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  Streaming, realtime analytics, and IoT
Michael Freedman (Timescale)
In this talk, I focus on two newly-released features of TimescaleDB (automated adaptation of time-partitioning intervals and continuous aggregations in near-real-time), and discuss how these capabilities ease time-series data management. I discuss how these capabilities have been leveraged across several different use cases, including in use with other technologies such as Kafka. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  Automation in data science and big data, Storage, Streaming, realtime analytics, and IoT
Arun Kumar (University of California, San Diego)
This talks presents a couple of recent techniques from research to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, we show how to avoid joins before ML to reduce runtimes and memory/storage footprints. Open source software prototypes and sample ML code in both R and Python will also be shown. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2001
Xiao Li (Databricks), Wenchen Fan (Databricks)
This talk will provide an overview of the major features and enhancements in Apache Spark 2.4 release and the upcoming releases and will be followed by a Q&A session. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2002
Secondary topics:  Data Platforms, Data preparation, data governance, and data lineage
Rohan Dhupelia (Atlassian), Jimmy Li (Atlassian)
Analytics is easy, good analytics is hard. Here at Atlassian we know this all to well with our push to become a truely data-driven organisation. In order to achieve this we've transformed the way we thought about behavioural analytics, from how we defined our events all the way to how we ingested and analysed them. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  AI and Data technologies in the cloud
Alan Choi (Cloudera), Eva Andreasson (Cloudera), Mark Brine (Cloudera)
Alan Choi, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  Storage, Streaming, realtime analytics, and IoT
Akshai Sarma (Yahoo), Michael Natkovich (Yahoo)
Bullet is a scalable, pluggable, light, multi-tenant query system on any data flowing through a streaming system without storing it. Bullet queries are submitted first and operate on data flowing through the system from the point of submission. Bullet efficiently supports intractable operations like Top K, Counting Distincts and Windowing without any storage using Sketch-based algorithms. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  AI and Data technologies in the cloud, Storage
Paul Curtis (MapR Technologies)
Just like almost everybody, we needed a way for ordinary users to stand up applications on top of Kubernetes, but we had additional requirements. And we had to do it without breaking the bank. Our field sales engineering force of sixty engineers around the globe now can spin up and down our technology quickly and simply using Kubernetes, the cloud, and shared data storage. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 28, 2019
Location: 2024
Secondary topics:  Data Platforms, Media, Marketing, Advertising, Security and Privacy
John Bennett (Netflix), Siamac Mirzaie (Netflix)
Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. John Bennett and Siamac Mirzaie offer an overview of Netflix's internal platform for quickly deploying data-based detection capabilities in the corporate environment. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2001
Secondary topics:  Data Integration and Data Pipelines, Data Platforms
Li Gao (Lyft Inc.), Bill Graham (Lyft Inc.)
In this talk, Li Gao and Bill Graham will talk about challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  Data Platforms
Adrian Lungu (Adobe), Serban Teodorescu (Adobe)
Inspired by the Green / Blue deployment technique, the Adobe Audience Manager team developed an Active / Passive database migration procedure that allows us to test our database clusters in production, minimising the risks without compromising the innovation. We successfully applied this approach twice to upgrade the entire technology stack. But it never was a smooth move. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  AI and machine learning in the enterprise, Data Platforms, Security and Privacy
Václav Surovec (Deutsche Telekom IT), Gabor Kotalik (Deutsche Telekom AG)
The knowledge of location and travel patterns of customers is important for many companies. One of them is a German telco service operator T-Mobile Czech Republic. Commercial Roaming project using Cloudera Hadoop helped the company to better analyze the behavior of its customers from 10 countries, in a very secure way, to be able to provide better predictions and visualizations for the management. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  Storage, Streaming, realtime analytics, and IoT
Yuan Zhou (Intel), Haodong Tang (Intel), Jian Zhang (Intel)
Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. Read more.
Add to your personal schedule
3:50pm4:30pm Thursday, March 28, 2019
Location: 2024
Secondary topics:  Data Integration and Data Pipelines, Security and Privacy, Streaming, realtime analytics, and IoT
Julien Delange (Twitter), Neng Lu (Twitter)
This presentation presents how Twitter uses the heron data processing engine to monitor and analyze its network infrastructure. Within 2 months, infrastructure engineers implemented a new data pipeline that ingests multiple sources and processes about 1 billion of tuples to detect network issues generate usage statistics. The talk focuses on key technologies used, the architecture and challenges. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2001
Secondary topics:  Automation in data science and big data
Holden Karau (Google), Rachel Warren (Salesforce Einstein)
Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure demons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using historical & static job information using systems like Mahout, and internal Spark ML jobs as workloads including new settings in 2.4. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2002
Secondary topics:  Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Media, Marketing, Advertising
Sonali Sharma (Netflix), Shriya Arora (Netflix)
With so much data being generated in real-time what if we could combine all these high-volume data streams in real time and provide a near realtime feedback for model training, improve personalization and recommendations, thereby taking the customer experience on the product to a whole new level. Well, it is possible to tame large state-join for exactly that purpose using Flink's keyed state. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2004
Secondary topics:  AI and Data technologies in the cloud, Storage, Streaming, realtime analytics, and IoT
Igor Canadi (Rockset), Dhruba Borthakur (Rockset)
Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2006
Secondary topics:  AI and Data technologies in the cloud
Jinchul Kim (SK Telecom)
Druid supports auto scaling feature for data ingestion, but it is only available on AWS EC2. We cannot rely on the feature on our private cloud. In this talk, we are going to introduce auto scale-out/in on Kubernetes. We will show benefit on our approach and where it comes from and share development of Druid Helm chart, rolling update, custom metric usage for horizontal auto scaling. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2008
Secondary topics:  Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT
Patrick Stuedi (IBM Research)
Modern networking and storage technologies like RDMA or NVMe find their ways into the data center. Apache Crail (incubating) is a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. In this talk I will present Apache Crail, what it does and how workloads based on TensorFlow or Spark can benefit from Crail. Read more.
Add to your personal schedule
4:40pm5:20pm Thursday, March 28, 2019
Location: 2024
Secondary topics:  Automation in data science and big data, Data preparation, data governance, and data lineage
Yves Thibaudeau (U.S. Census Bureau)
The U.S. Census Bureau has been involved in record-linkage projects for over 40 years. There has been a lot of change in computing capabilities and new techniques to support record-linkage. The Census Bureau is reviewing an inventory of linkage methodologies. We describe the progress made so far in identifying specific record-linkage techniques for specific applications. Read more.