Data Engineering & Architecture: Big data conference & machine learning training

Wednesday Mar 27: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am \| Location: Ballroom Strata Data Conference Keynotes
10:30am Morning break

Thursday Mar 28: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am \| Location: Ballroom Strata Data Conference Keynotes
10:30am Morning break

9:00am - 5:00pm Monday, March 25 & Tuesday, March 26

Building a serverless big data application on AWS

Location: 2018

Secondary topics: AI and Data technologies in the cloud, Storage

Jorge Lopez (Amazon Web Services), Roy Hasson (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Gautam Srinivasan (Amazon Web Services), Anthony Nguyen (Amazon Web Services)

Average rating:

(4.50, 4 ratings)

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures, looking at design patterns to ingest, store, and analyze your data. You'll then build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.

9:00am - 5:00pm Monday, March 25 & Tuesday, March 26

Professional Kafka development

Location: 3016

Secondary topics: Streaming, realtime analytics, and IoT

Jesse Anderson (Big Data Institute)

Average rating:

(3.00, 1 rating)

Jesse Anderson leads a deep dive into Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it. You'll also discover how to create consumers and publishers in Kafka and how to use Kafka Streams, Kafka Connect, and KSQL as you explore the Kafka ecosystem. Read more.

9:00am–12:30pm Tuesday, March 26, 2019

Introduction to Flink via Flink SQL

Location: 2004

Secondary topics: Streaming, realtime analytics, and IoT

Fabian Hueske (Ververica)

Average rating:

(5.00, 1 rating)

Fabian Hueske offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink's various modes of use. Then you'll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API. Read more.

9:00am–12:30pm Tuesday, March 26, 2019

Architecting a data platform for enterprise use

Location: 2005

Mark Madsen (Teradata), Todd Walter (Archimedata)

Average rating:

(4.21, 28 ratings)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

9:00am–12:30pm Tuesday, March 26, 2019

Foundations for successful data projects

Location: 2006

Secondary topics: AI and machine learning in the enterprise

Jonathan Seidman (Cloudera), Ted Malaska (Capital One)

Average rating:

(4.00, 6 ratings)

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.

9:00am–12:30pm Tuesday, March 26, 2019

Hands-on machine learning with Kafka-based streaming pipelines

Location: 2007

Secondary topics: Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Model lifecycle management

Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)

Average rating:

(3.85, 13 ratings)

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipeline and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.

9:00am–12:30pm Tuesday, March 26, 2019

Hands-on with Cloudera SDX: Setting up your own shared data experience

Location: 2008

Secondary topics: Data preparation, data governance, and data lineage, Storage

Santosh Kumar (Cloudera), Andre Araujo (Cloudera), Wim Stoop (Cloudera)

Average rating:

(5.00, 1 rating)

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Learning Presto: SQL on anything

Location: 2004

Secondary topics: Streaming, realtime analytics, and IoT

Matt Fuller (Starburst)

Average rating:

(3.57, 7 ratings)

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Architecture and algorithms for end-to-end streaming data processing

Location: 2005

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(2.67, 12 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Streamlining a machine learning project team

Location: 2006

Secondary topics: AI and machine learning in the enterprise

Sourav Dey (Manifold), Alex Ng (Manifold)

Average rating:

(4.25, 4 ratings)

Many teams are still run as if data science is mainly about experimentation, but those days are over. Now it must offer turnkey solutions to take models into production. Sourav Day and Alex Ng explain how to streamline an ML project and help your engineers work as an integrated part of your production teams, using a Lean AI process and the Orbyter package for Docker-first data science. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Cross-cloud model training and serving with Kubeflow

Location: 2007

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management

Holden Karau (Independent), Francesca Lazzeri (Microsoft), Trevor Grant (IBM)

Average rating:

(3.00, 2 ratings)

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud. Read more.

1:30pm–5:00pm Tuesday, March 26, 2019

Running multidisciplinary big data workloads in the cloud

Location: 2008

Secondary topics: AI and Data technologies in the cloud

Jason Wang (Cloudera), Brandon Freeman (Cloudera), Michael Kohs (Cloudera), Akihiro Ishikawa (Cloudera), Toby Ferguson (Cloudera)

Average rating:

(3.20, 5 ratings)

There are many challenges with moving multidisciplinary big data workloads to the cloud and running them. Jason Wang, Brandon Freeman, Michael Kohs, Akihiro Nishikawa, and Toby Ferguson explore cloud architecture and its challenges and walk you through using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Scaling data lineage at Netflix to improve data infrastructure reliability and efficiency

Location: 2001

Secondary topics: Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Media, Marketing, Advertising

Jitender Aswani (Netflix), Di Lin (Netflix), Girish Lingappa (Netflix)

Average rating:

(3.40, 15 ratings)

Hundreds of thousands of ETL pipelines ingest over a trillion events daily to populate millions of data tables downstream at Netflix. Jitender Aswani, Girish Lingappa, and Di Lin discuss Netflix’s internal data lineage service, which was essential for enhancing platform’s reliability, increasing trust in data, and improving data infrastructure efficiency. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Building the AI engine for retail in the new era

Location: 2002

Secondary topics: AI and machine learning in the enterprise, Automation in data science and big data, Data Platforms, Retail and e-commerce, Storage, Temporal data and time-series analytics

JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)

Average rating:

(4.50, 4 ratings)

Jian Chang and Sanjian Chen outline the design of the AI engine on Alibaba's TSDB service, which enables fast and complex analytics of large-scale retail data. They then share a successful case study of the Fresh Hema Supermarket, a major “new retail” platform operated by Alibaba Group, highlighting solutions to the major technical challenges in data cleaning, storage, and processing. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Cost-effective Presto on AWS with Spot nodes

Location: 2004

Secondary topics: AI and Data technologies in the cloud

Shubham Tagra (Qubole)

Average rating:

(3.50, 8 ratings)

Did you know you can run Presto in AWS at a tenth of the cost with AWS Spot nodes, with just a few architectural enhancements to Presto. Shubham Tagra explores the gaps in Presto architecture, explains how to use Spot nodes, covers enhancements, and showcases the improvements in terms of reliability and TCO achieved through them. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time

Location: 2006

Secondary topics: Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)

Average rating:

(4.67, 3 ratings)

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization. Read more.

11:00am–11:40am Wednesday, March 27, 2019

Automating DevOps for machine learning

Location: 2008

Secondary topics: AI and Data technologies in the cloud, Automation in data science and big data, Model lifecycle management

Diego Oppenheimer (Algorithmia)

Average rating:

(4.00, 11 ratings)

You've invested heavily in cleaning your data, feature engineering, training, and tuning your model—but now you have to deploy your model into production, and you discover it's a huge challenge. Diego Oppenheimer shares common architectural patterns and best practices of the most advanced organizations who are deploying your model for scalability and accessibility. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

How Intuit reduced time to reliable insights for data pipelines

Location: 2001

Secondary topics: Data Integration and Data Pipelines, Financial Services

Sandeep U (Intuit)

Average rating:

(4.57, 7 ratings)

How efficient is your data platform? The single metric Intuit uses is time to reliable insights: the total of time spent to ingest, transform, catalog, analyze, and publish. Sandeep Uttamchandani shares three design patterns/frameworks Intuit implemented to deal with three challenges to determining time to reliable insights: time to discover, time to catalog, and time to debug for data quality. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

The journey toward a self-service data platform at Netflix

Location: 2002

Secondary topics: Data Platforms, Media, Marketing, Advertising

Kurt Brown (Netflix)

Average rating:

(4.22, 9 ratings)

The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data tech (Spark and Flink), enabling services (federated metadata management), and machine learning support. But with power comes complexity. Kurt Brown explains how Netflix is working toward an easier, "self-service" data platform without sacrificing any enabling capabilities. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Accelerating analytical antelopes: Integrating Apache Kudu's RPC into Apache Impala

Location: 2004

Secondary topics: Streaming, realtime analytics, and IoT

Lars Volker (Cloudera), Michael Ho (Cloudera)

Average rating:

(4.50, 6 ratings)

In recent years, Apache Impala has been deployed to clusters that are large enough to hit architectural limitations in the stack. Lars Volker and Michael Ho cover the efforts to address the scalability limitations in the now legacy Thrift RPC framework by using Apache Kudu's RPC, which was built from the ground up to support asynchronous communication, multiplexed connections, TLS, and Kerberos. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Enabling insights and analytics with data streaming architectures and pipelines using Kafka and Hadoop

Location: 2006

Secondary topics: Data Integration and Data Pipelines, Data Platforms, Health and Medicine, Streaming, realtime analytics, and IoT

Mohammad Quraishi (Cigna)

Average rating:

(4.60, 5 ratings)

In a large global health services company, streaming data for processing and sharing comes with its own challenges. Data science and analytics platforms need data fast, from relevant sources, to act on this data quickly and share the insights with consumers with the same speed and urgency. Join Mohammad Quraishi to learn why streaming data architectures are a necessity—Kafka and Hadoop are key. Read more.

11:50am–12:30pm Wednesday, March 27, 2019

Deep learning beyond the learning

Location: 2008

Secondary topics: AI and Data technologies in the cloud, Automation in data science and big data, Model lifecycle management, Storage

Tobias Knaup (Mesosphere), Joerg Schad (ArangoDB)

Average rating:

(4.50, 2 ratings)

There are many great tutorials for training your deep learning models, but training is only a small part in the overall deep learning pipeline. Tobias Knaup and Joerg Schad offer an introduction to building a complete automated deep learning pipeline, starting with exploratory analysis, overtraining, model storage, model serving, and monitoring. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Adaptive ETL to optimize query performance at Lyft

Location: 2001

Secondary topics: Data Integration and Data Pipelines, Transportation and Logistics

James Taylor (Lyft)

Average rating:

(3.56, 9 ratings)

James Taylor offers an overview of an automated feedback loop at Lyft to adapt ETL based on the aggregate cost of queries run across the cluster. He also discusses future work to enhance the system through the use of materialized views to reduce the number of ad hoc joins and sorting performed by the most expensive queries by transparently rewriting queries when possible. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Goodbye, data lake: Why continuous analytics yield higher ROI

Location: 2002

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Yaron Haviv (iguazio)

Average rating:

(4.00, 2 ratings)

Faced with the need to handle increasing volumes of data, alternative datasets ("alt data"), and AI, many enterprises are working to design or redesign their big data architectures, but traditional batch platforms fail to generate sufficient ROI. Yaron Haviv shares a continuous analytics approach that yields faster answers for the business while remaining simpler and less expensive for IT. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Real-time analytics at Uber: Bring SQL into everything

Location: 2004

Secondary topics: Data Platforms, Storage, Streaming, realtime analytics, and IoT, Transportation and Logistics

Zhenxiao Luo (Twitter)

Average rating:

(4.09, 11 ratings)

From determining the most convenient rider pickup points to predicting the fastest routes, Uber uses data-driven analytics to create seamless trip experiences. Zhenxiao Luo explains how Uber supports real-time analytics with deep learning on the fly, without any data copying. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Cruise Control: Effortless management of Kafka clusters

Location: 2006

Secondary topics: Streaming, realtime analytics, and IoT

Adem Efe Gencer (LinkedIn)

Average rating:

(3.50, 2 ratings)

Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control. Read more.

2:40pm–3:20pm Wednesday, March 27, 2019

Taking graph applications to production

Location: 2008

Secondary topics: Graph technologies and analytics

Denise Gosnell (DataStax)

Average rating:

(4.73, 11 ratings)

The graph community has spent years defining and describing its passion: applying graph thinking to solve difficult problems. Denise Gosnell leverages years of experience shipping large-scale applications built on graph databases to share practical and tangible decisions that come into play when designing and delivering distributed graph applications. . .or playing SimCity 2000. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Managing Uber's data workflows at scale

Location: 2001

Secondary topics: Data Integration and Data Pipelines, Transportation and Logistics

Alex Kira (Uber)

Average rating:

(4.00, 13 ratings)

Uber operates at scale, with thousands of microservices serving millions of rides a day, leading to 100+ PB of data. Alex Kira details Uber's journey toward a unified and scalable data workflow system used to manage this data and shares the challenges faced and how the company has rearchitected the system to make it highly available and horizontally scalable. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Reducing stream processing complexity using Apache Pulsar Functions

Location: 2002

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Retail and e-commerce

Jowanza Joseph (Pluralsight), Karthik Ramasamy (Streamlio)

Average rating:

(4.00, 1 rating)

After two years of running streaming pipelines through Kinesis and Spark at One Click Retail, Jowanza Joseph and Karthik Ramasamy decided to explore a new platform that would take advantage of Kubernetes and support a simpler data processing DSL. Join in to discover why they chose Apache Pulsar and learn tips and tricks for using Pulsar Functions. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

From flat files to deconstructed databases: The evolution and future of the big data ecosystem

Location: 2004

Secondary topics: Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Julien Le Dem (WeWork)

Average rating:

(4.83, 6 ratings)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

Put Kafka in jail with Strimzi

Location: 2006

Secondary topics: Streaming, realtime analytics, and IoT

Sean Glover (Lightbend)

Average rating:

(4.00, 1 rating)

The best way to run stateful services with complex operational needs like Kafka is to use the operator pattern. Sean Glover offers an overview of the Strimzi Kafka Operator, a popular new open source Operator-based Apache Kafka implementation on Kubernetes. Read more.

4:20pm–5:00pm Wednesday, March 27, 2019

MLflow: An open platform to simplify the machine learning lifecycle

Location: 2008

Secondary topics: Model lifecycle management

Corey Zumar (Databricks)

Average rating:

(4.89, 9 ratings)

Developing applications that leverage machine learning is difficult. Practitioners need to be able to reproduce their model development pipelines, as well as deploy models and monitor their health in production. Corey Zumar offers an overview of MLflow, which simplies this process by managing, reproducing, and operationalizing machine learning through a suite of model tracking and deployment APIs. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Cloud native data pipelines with Apache Kafka

Location: 2001

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Gwen Shapira (Confluent)

Average rating:

(4.64, 11 ratings)

As microservices, data services, and serverless APIs proliferate, data engineers need to collect and standardize data in an increasingly complex and diverse system. Gwen Shapira discusses how data engineering requirements have changed in a cloud native world and shares architectural patterns that are commonly used to build flexible, scalable, and reliable data pipelines. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Serverless workflows for orchestration hybrid cluster-based and serverless processing

Location: 2002

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Rustem Feyzkhanov (Instrumental)

Average rating:

(3.50, 8 ratings)

Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

When SQL users run wild: Resource management features and techniques to tame Apache Impala

Location: 2004

Tim Armstrong (Cloudera)

Average rating:

(4.80, 5 ratings)

As the popularity and utilization of Apache Impala deployments increases, clusters often become victims of their own success when demand for resources exceeds the supply. Tim Armstrong dives into the latest resource management features in Impala to maintain high cluster availability and optimal performance and provides examples of how to configure them in your Impala deployment. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Critical turbine maintenance: Monitoring and diagnosing planes and power plants in real time

Location: 2006

Secondary topics: Streaming, realtime analytics, and IoT, Transportation and Logistics

June Andrews (GE), John Rutherford (GE)

Average rating:

(4.50, 2 ratings)

GE produces a third of the world's power and 60% of its airplane engines—a critical portion of the world's infrastructure that requires meticulous monitoring of the hundreds of sensors streaming data from each turbine. June Andrews and John Rutherford explain how GE's monitoring and diagnostics teams released the first real-time ML systems used to determine turbine health into production. Read more.

5:10pm–5:50pm Wednesday, March 27, 2019

Persistent storage for machine learning in KubeFlow

Location: 2008

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management, Storage

Skyler Thomas (MapR), Terry He (MapR Technologies)

Average rating:

(4.75, 4 ratings)

KubeFlow separates compute and storage to provide the ability to deploy best-of-breed open source systems for machine learning to any cluster running Kubernetes, whether on-premises or in the cloud. Skyler Thomas and Terry He explore the problems of state and storage and explain how distributed persistent storage can logically extend the compute flexibility provided by KubeFlow. Read more.

11:00am–11:40am Thursday, March 28, 2019

Disrupting data discovery

Location: 2001

Secondary topics: Data preparation, data governance, and data lineage, Transportation and Logistics

Mark Grover (Lyft), Tao Feng (Lyft)

Average rating:

(4.40, 10 ratings)

Lyft has reduced the time it takes to discover data by 10x by building its own data portal, Amundsen. Mark Grover and Tao Feng offer a demo of Amundsen and lead a deep dive into its architecture, covering how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. They also explore the future roadmap, unsolved problems, and its collaboration model. Read more.

11:00am–11:40am Thursday, March 28, 2019

ML and AI at scale at PayPal

Location: 2002

Secondary topics: Data Platforms, Data preparation, data governance, and data lineage, Financial Services

Subhadra Tatavarti (PayPal), Chen Kovacs (Paypal)

Average rating:

(4.12, 8 ratings)

The PayPal data ecosystem is large, with 250+ PB of data transacting in 200+ countries. Given this massive scale and complexity, discovering and access to the right datasets in a frictionless environment is a challenge. Subhadra Tatavarti and Chen Kovacs explain how PayPal’s data platform team is helping solve this problem with a combination of self-service integrated and interoperable products. Read more.

11:00am–11:40am Thursday, March 28, 2019

Presto: Tuning performance of SQL-on-anything analytics

Location: 2004

Secondary topics: Storage, Streaming, realtime analytics, and IoT

Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Presto Software Foundation)

Average rating:

(3.33, 3 ratings)

Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward. Read more.

11:00am–11:40am Thursday, March 28, 2019

How Zhaopin.com built its enterprise event bus using Apache Pulsar

Location: 2006

Secondary topics: Data Platforms, Media, Marketing, Advertising, Streaming, realtime analytics, and IoT

Sijie Guo (StreamNative), Penghui Li (Zhaopin)

Average rating:

(4.00, 1 rating)

Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar. Read more.

11:00am–11:40am Thursday, March 28, 2019

Optimizing computing cluster resource utilization with an in-memory distributed filesystem

Location: 2008

Secondary topics: Data Platforms, Retail and e-commerce, Storage

Yue Li (MemVerge), Shouwei Chen (Rutgers University)

Average rating:

(5.00, 4 ratings)

JD.com recently designed a brand-new architecture to optimize Spark computing clusters. Yue Li and Shouwei Chen detail the problems the team faced when building it and explain how the company benefits from the in-memory distributed filesystem now. Read more.

11:00am–11:40am Thursday, March 28, 2019

How to protect big data in a containerized environment

Location: 2024

Secondary topics: AI and Data technologies in the cloud, Security and Privacy

Thomas Phelan (HPE BlueData)

Average rating:

(4.50, 2 ratings)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). But TDE is difficult to configure and manage—particularly when run in Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them. Read more.

11:00am–11:40am Thursday, March 28, 2019

Cloud programming simplified: A Berkeley view on serverless computing

Location: 2007

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Eric Jonas (UC Berkeley)

Average rating:

(4.50, 2 ratings)

Eric Jonas offers a quick history of cloud computing, including an accounting of the predictions of the 2009 "Berkeley View of Cloud Computing" paper, explains the motivation for serverless computing, describes applications that stretch the current limits of serverless, and then lists obstacles and research opportunities required for serverless computing to fulfill its full potential. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Journey to the cloud: Architecting for the cloud through customer stories

Location: 2001

Secondary topics: AI and Data technologies in the cloud, Data Platforms, Storage

Jason Wang (Cloudera), Sushant Rao (Cloudera)

Average rating:

(4.00, 2 ratings)

Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Building Rakuten analytics: A story of evolutions

Location: 2002

Secondary topics: Data Platforms, Retail and e-commerce

Juan Paulo Gutierrez (Rakuten)

Average rating:

(4.75, 4 ratings)

Juan Paulo Gutierrez explains how a small team in Tokyo went through several evolutions as they built an analytics service to help 200+ businesses accelerate their decision-making process. Join in to hear about the background, challenges, architecture, success stories, and best practices as they built and productionalized Rakuten Analytics. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Flink SQL in action

Location: 2004

Secondary topics: Data Integration and Data Pipelines, Streaming, realtime analytics, and IoT

Fabian Hueske (Ververica)

Average rating:

(4.30, 10 ratings)

Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL. Read more.

11:50am–12:30pm Thursday, March 28, 2019

How Netflix measures app performance on 250 million unique devices across 190 countries

Location: 2006

Secondary topics: Data Platforms, Media, Marketing, Advertising, Streaming, realtime analytics, and IoT

Vivek Pasari (Netflix), Jitender Aswani (Netflix)

Average rating:

(3.14, 7 ratings)

Netflix has over 125 million members spread across 191 countries. Each day its members interact with its client applications on 250 million+ devices under highly variable network conditions. These interactions result in over 200 billion daily data points. Vivek Pasari dives into the data engineering and architecture that enables application performance measurement at this scale. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Scanner: Efficient video analysis at scale

Location: 2008

Secondary topics: Deep Learning, Media, Marketing, Advertising

Fait Poms (Stanford University), Will Crichton (Stanford University)

Average rating:

(4.75, 4 ratings)

Video is now the largest source of data on the internet, so we need tools to make it easier to process and analyze. Alex Poms and Will Crichton offer an overview of Scanner, the first open source distributed system for building large-scale video processing applications, and explore real-world use cases. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Automation of root cause analysis for big data stack applications

Location: 2024

Secondary topics: Automation in data science and big data, Deep Learning

Alkis Simitsis (Micro Focus), Shivnath Babu (Unravel Data Systems | Duke University)

Average rating:

(2.67, 3 ratings)

Alkis Simitsis and Shivnath Babu share an automated technique for root cause analysis (RCA) for big data stack applications using deep learning techniques, using Spark and Impala. The concepts they discuss apply generally to the big data stack. Read more.

11:50am–12:30pm Thursday, March 28, 2019

Serverless for data and AI

Location: 2007

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Data Platforms

Avner Braverman (Binaris)

Average rating:

(4.00, 3 ratings)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Challenges in addressing bias, fairness, and transparency in AI

Location: 2001

Krishna Gade (Fiddler Labs)

Average rating:

(4.67, 3 ratings)

Join Krishna Gade to learn how to address engineering and organizational challenges for AI fairness and operationalize these concepts in a production AI system—and crucially, create a culture of trust in AI. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Loosely coupled data with Apache Arrow Flight

Location: 2002

Jacques Nadeau (Dremio)

Average rating:

(4.60, 5 ratings)

Apache Arrow Flight is a new initiative focused on providing high-performance communication within data engineering and data science infrastructure. Jacques Nadeau explains how Flight works and where it has been integrated. He also discusses how Flight can be used to abstract physical data management from logical access and sharse benchmarks of workloads that have been improved by Flight. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Spark adaptive execution: Unleash the power of Spark SQL

Location: 2004

Secondary topics: Streaming, realtime analytics, and IoT

Haifeng Chen (Intel)

Average rating:

(4.00, 3 ratings)

Spark SQL is widely used, but it still suffers from stability and performance challenges in highly dynamic environments with large-scale data. Haifeng Chen shares a Spark adaptive execution engine built to address these challenges. It can handle task parallelism, join conversion, and data skew dynamically during runtime, guaranteeing the best plan is chosen using runtime statistics. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Performant time series data management and analytics with Postgres

Location: 2006

Secondary topics: Streaming, realtime analytics, and IoT

Matvey Arye (TimescaleDB)

Average rating:

(3.75, 4 ratings)

Matvey Arye offers an overview of two newly released features of TimescaleDB—automated adaptation of time-partitioning intervals and continuous aggregations in near real time—and discusses how these capabilities ease time series data management. Along the way, he also shares real-world use cases, including TimescaleDB's use with other technologies such as Kafka. Read more.

1:50pm–2:30pm Thursday, March 28, 2019

Faster ML over joins of tables

Location: 2008

Secondary topics: Automation in data science and big data, Storage, Streaming, realtime analytics, and IoT

Arun Kumar (University of California, San Diego)

Average rating:

(4.00, 2 ratings)

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Apache Spark 2.4 and beyond

Location: 2001

Xiao Li (Databricks), Wenchen Fan (Databricks)

Average rating:

(3.25, 4 ratings)

Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4 and give insight into upcoming releases. Then you'll get the chance to ask all your burning Spark questions. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Transforming behavioral analytics at Atlassian

Location: 2002

Secondary topics: Data Platforms, Data preparation, data governance, and data lineage

Rohan Dhupelia (Atlassian), Jimmy Li (Atlassian)

Average rating:

(4.67, 3 ratings)

Analytics is easy, but good analytics is hard. Atlassian knows this all too well. Rohan Dhupelia and Jimmy Li explain how the company's push to become truly data driven has transformed the way it thinks about behavioral analytics, from how it defined its events to how it ingests and analyzes them. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

How to survive future data warehousing challenges with the help of a hybrid cloud

Location: 2004

Secondary topics: AI and Data technologies in the cloud

Eva Andreasson (Cloudera), Mark Brine (Cloudera), Michael Kohs (Cloudera)

Average rating:

(2.00, 3 ratings)

Michael Kohs, Eva Andreasson, and Mark Brine explain how Cloudera’s Finance Department used a hybrid model to speed up report delivery and reduce cost of end-of-quarter reporting. They also share guidelines for deploying modern data warehousing in a hybrid cloud environment, outlining when you should choose a private cloud service over a public one, the available options, and some dos and dont's. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Bullet: Querying streaming data in transit with sketches

Location: 2006

Secondary topics: Storage, Streaming, realtime analytics, and IoT

Akshai Sarma (Yahoo), Nathan Speidel (Yahoo)

Average rating:

(3.67, 3 ratings)

Akshai Sarma and Nathan Speidel offer an overview of Bullet, a scalable, pluggable, light multitenant query system on any data flowing through a streaming system without storing it. Bullet efficiently supports intractable operations like top K, count distincts, and windowing without any storage using sketch-based algorithms. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Clusters in Kubernetes on a cluster: Building a multitenant environment for the field

Location: 2008

Secondary topics: AI and Data technologies in the cloud, Storage

Paul Curtis (Weaveworks)

Average rating:

(4.50, 2 ratings)

What do you do when your technology doesn’t easily fit on a single laptop and consists of many components? Paul Curtis explains how MapR Technologies rolled out a containerized, scalable, globally available, and easily updatable environment using a combination of Kubernetes to orchestrate, shared data fabric to store and persist, and AppLariat to provide the user interface. Read more.

2:40pm–3:20pm Thursday, March 28, 2019

Building and scaling a security detection platform: A Netflix Original

Location: 2024

Secondary topics: Data Platforms, Media, Marketing, Advertising, Security and Privacy

John Bennett (Netflix), Siamac Mirzaie (Netflix)

Average rating:

(3.33, 3 ratings)

Data has become a foundational pillar for security teams operating in organizations of all shapes and sizes. This new norm has created a need for platforms that enable engineers to harness data for various security purposes. John Bennett and Siamac Mirzaie offer an overview of Netflix's internal platform for quickly deploying data-based detection capabilities in the corporate environment. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Scaling Apache Spark on Kubernetes at Lyft

Location: 2001

Secondary topics: Data Integration and Data Pipelines, Data Platforms

Li Gao (Lyft), Bill Graham (Lyft)

Average rating:

(4.00, 2 ratings)

Li Gao and Bill Graham discuss the challenges the Lyft team faced and solutions they developed to support Apache Spark on Kubernetes in production and at scale. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Location: 2002

Secondary topics: AI and Data technologies in the cloud, Storage, Streaming, realtime analytics, and IoT

Igor Canadi (Rockset), Dhruba Borthakur (Rockset)

Average rating:

(4.00, 1 rating)

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Database migrations don't have to be painful, but the road will be bumpy

Location: 2004

Secondary topics: Data Platforms

Adrian Lungu (Adobe), Serban Teodorescu (Adobe)

Average rating:

(4.75, 4 ratings)

Adrian Lungu and Serban Teodorescu explain how—inspired by the green-blue deployment technique—the Adobe Audience Manager team developed an active-passive database migration procedure that allows them to test database clusters in production, minimizing the risks without compromising the innovation. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Data science at Deutsche Telekom: Predicting global travel patterns and network demand

Location: 2006

Secondary topics: AI and machine learning in the enterprise, Data Platforms, Security and Privacy

Vaclav Surovec (Deutsche Telekom), Gabor Kotalik (Deutsche Telekom)

Average rating:

(4.00, 1 rating)

Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Location: 2008

Secondary topics: Storage, Streaming, realtime analytics, and IoT

Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)

Average rating:

(3.33, 3 ratings)

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance. Read more.

3:50pm–4:30pm Thursday, March 28, 2019

Real-time monitoring of Twitter's network infrastructure with Heron

Location: 2024

Secondary topics: Data Integration and Data Pipelines, Security and Privacy, Streaming, realtime analytics, and IoT

J Delange (Twitter), N Lu (Twitter)

Average rating:

(2.67, 3 ratings)

Julien Delange and Neng Lu explain how Twitter uses the Heron stream processing engine to monitor and analyze its network infrastructure—implementing a new data pipeline that ingests multiple sources and processes about 1 billion tuples to detect network issues and generate usage statistics. Join in to learn the key technologies used, the architecture, and the challenges Twitter faced. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

Location: 2001

Secondary topics: Automation in data science and big data

Holden Karau (Independent), Rachel Warren (Salesforce Einstein)

Average rating:

(4.60, 5 ratings)

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (a.k.a. tuning) or our jobs may be eaten by Cthulhu. Holden Karau and Rachel Warren explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads—including new settings in 2.4. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Taming large state to join datasets for personalization

Location: 2002

Secondary topics: Data Integration and Data Pipelines, Data preparation, data governance, and data lineage, Media, Marketing, Advertising

Sonali Sharma (Netflix), Shriya Arora (Netflix)

Average rating:

(3.00, 2 ratings)

With so much data being generated in real time, what if we could combine all these high-volume data streams and provide near real-time feedback for model training, improving personalization and recommendations and taking the customer experience to a whole new level. Sonali Sharma and Shriya Arora explain how to do exactly that, using Flink's keyed state. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Apache Druid autoscale-out/in for streaming data ingestion on Kubernetes

Location: 2006

Secondary topics: AI and Data technologies in the cloud

Jinchul Kim (SK Telecom)

Average rating:

(2.17, 6 ratings)

Druid supports autoscaling for data ingestion, but it's only available on AWS EC2. You can't rely on the feature on your private cloud. Jinchul Kim demonstrates autoscale-out/in on Kubernetes, details the benefit on this approach, and discusses the development of Druid Helm charts, rolling updates, and custom metric usage for horizontal autoscaling. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Data processing at the speed of 100 Gbps using Apache Crail

Location: 2008

Secondary topics: Data Integration and Data Pipelines, Storage, Streaming, realtime analytics, and IoT

Patrick Stuedi (IBM Research)

Average rating:

(4.00, 1 rating)

Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

Model governance in the enterprise

Location: 2018

Secondary topics: Model lifecycle management

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Average rating:

(4.00, 1 rating)

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them. Read more.

4:40pm–5:20pm Thursday, March 28, 2019

New directions in record linkage

Location: 2024

Secondary topics: Automation in data science and big data, Data preparation, data governance, and data lineage

Yves Thibaudeau (US Census Bureau)

Average rating:

(3.33, 3 ratings)

The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications. Read more.

Data Engineering & Architecture

Learn to build an analytics infrastructure that unlocks the value of your data

Featured Speakers

Sponsorship Opportunities

Partner Opportunities

Contact Us