Data engineering and architecture: Big data conference & machine learning training

Wednesday March 7: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45 \| Location: San Jose Ballroom (salon 1&2) Strata Data Conference Keynotes
10:30am Morning break

Thursday March 8: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45 \| Location: San Jose Ballroom (salon 1&2) Strata Data Conference Keynotes
10:30am Morning break

9:00am - 5:00pm Monday, March 5 & Tuesday, March 6

Real-time systems with Spark Streaming and Kafka

Location: 114

Jesse Anderson (Big Data Institute)

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments

Location: LL20 C

Mark Donsky (Okera), Andre Araujo (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera)

Average rating:

(2.00, 1 rating)

New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Building your first big data application on AWS

Location: LL21 B

Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)

Average rating:

(4.50, 2 ratings)

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Modern real-time streaming architectures

Location: 210 B/F

Secondary topics: Graphs and Time-series

Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (StreamNative), Arun Kejariwal (Independent)

Average rating:

(5.00, 2 ratings)

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Stream processing with Kafka

Location: 210 C/G

Tim Berglund (Confluent)

Average rating:

(4.36, 11 ratings)

Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

A deep dive into running data analytic workloads in the cloud

Location: 210 D/H

Jason Wang (Cloudera), Mala Ramakrishnan (Cloudera), Stefan Salandy (Cloudera), Aishwarya Venkataraman (Cloudera), Vinithra Varadharajan (Cloudera), Aaron Myers (Cloudera, Inc.)

Average rating:

(3.25, 4 ratings)

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Data Case Studies

Location: LL20 A

Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Jennie Shin (Kaiser Permanente), Valentin Bercovici (PencilDATA), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin Martin (O'Reilly Media), Divya Ramachandran (Captricity)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

How to use Impala's query plan and profile to fix performance issues

Location: LL21 A

Juan Yu (Cloudera)

Average rating:

(4.75, 4 ratings)

Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Deploying deep learning with TensorFlow

Location: LL21 B

Ron Bodkin (Google), Brian Foo (Google)

Average rating:

(3.00, 2 ratings)

TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Time series data: Architecture and use cases

Location: 210 B/F

Secondary topics: Graphs and Time-series

Ted Malaska (Capital One)

Average rating:

(2.80, 5 ratings)

If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Location: 210 C/G

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Average rating:

(3.50, 2 ratings)

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Cloud, multicloud, and the data refinery

Location: LL21 C/D

Tags:

Tom Fisher (MapR Technologies)

The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Accelerating development velocity of production ML systems with Docker

Location: LL21 E/F

Kinnary Jangla (Pinterest)

Average rating:

(2.25, 8 ratings)

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Using machine learning to simplify Kafka operations

Location: 230 A

Secondary topics: Graphs and Time-series

Shivnath Babu (Duke University | Unravel Data Systems), mdhruvg goel (Microsoft)

Average rating:

(4.50, 2 ratings)

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.

11:00am–11:40am Wednesday, March 7, 2018

What's new in Hadoop 3.0

Location: 230 C

Daniel Templeton (Cloudera), Andrew Wang (Cloudera)

Average rating:

(4.67, 6 ratings)

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet. Read more.

11:00am–11:40am Wednesday, March 7, 2018

The future of ETL isn’t what it used to be

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Gwen Shapira (Confluent)

Average rating:

(4.93, 14 ratings)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Powering robotics clouds with Alluxio

Location: LL21 C/D

Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)

Average rating:

(4.00, 1 rating)

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Machine learning versus machine learning in production

Location: LL21 E/F

Manu Mukerji (8x8)

Average rating:

(4.22, 9 ratings)

Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Streaming big data in the cloud: What to consider and why

Location: 230 A

Secondary topics: Graphs and Time-series

Bill Chambers (Databricks), michael dddd (Databricks)

Average rating:

(4.60, 5 ratings)

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Metrics-driven tuning of Apache Spark at scale

Location: 230 C

Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)

Average rating:

(4.00, 4 ratings)

Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Big data, big problems: Predicting climate change

Location: 210 D/H

Ari Gesher (Kairos Aerospace)

A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Radically modular data ingestion APIs in Apache Beam

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Eugene Kirpichov (Google)

Average rating:

(4.75, 4 ratings)

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

20 Netflix-style principles and practices to get the most out of your data platform

Location: LL21 C/D

Kurt Brown (Netflix)

Average rating:

(4.19, 16 ratings)

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Spark on Kubernetes: A case study from JD.com

Location: LL21 E/F

Zhen Fan (JD.com), Wei Ting Chen (Intel Corporate)

Average rating:

(4.00, 4 ratings)

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Approximation data structures in streaming data processing

Location: 230 A

Debasish Ghosh (Lightbend)

Average rating:

(3.33, 3 ratings)

Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Vectorized query processing using Apache Arrow

Location: 230 C

Siddharth Teotia (Dremio)

Average rating:

(5.00, 1 rating)

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)

Average rating:

(4.25, 4 ratings)

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Dogfooding data at Lyft

Location: LL21 C/D

Mark Grover (Lyft), Arup Malakar (Lyft)

Average rating:

(4.00, 2 ratings)

Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

DataOps: An Agile methodology for data-driven organizations

Location: LL21 E/F

Tags:

Ellen Friedman (Independent)

Average rating:

(4.43, 7 ratings)

DataOps—a culture and practice for building data-intensive applications, including machine learning pipelines—expands DevOps philosophy to include data-heavy roles such as data engineering and data science. DataOps is based on cross-functional collaboration resulting in fast time to value and an agile workflow. Ellen Friedman offers an overview of DataOps and explains how to implement it. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Location: 230 A

Henry Cai (Pinterest), Yi Yin (Pinterest)

Average rating:

(3.00, 1 rating)

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Taming deep learning

Location: LL20 C

Evan Sparks (Determined AI)

Average rating:

(5.00, 1 rating)

Deep learning has shown tremendous improvements in a number of areas and has justifiably generated enormous excitement. However, several key challenges—from prohibitive hardware requirements to immature software offerings—are impeding widespread enterprise adoption. Evan Sparks details fundamental challenges facing organizations looking to adopt deep learning and shares possible solutions. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Presto query gate: Identifying and stopping rogue queries

Location: 230 C

Ritesh Agrawal (Uber), Anirban Deb (Uber)

Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Semi-automated analytic pipeline creation and validation using active learning

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Sean Ma (Trifacta)

Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Real-time deep link analytics: The next stage of graph analytics

Location: Expo Hall 1

Secondary topics: Expo Hall, Graphs and Time-series

Yu Xu (TigerGraph)

Average rating:

(5.00, 2 ratings)

Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Pirelli Connesso: Where the road meets the cloud

Location: LL21 C/D

Carlo Torniai (Pirelli Tyre)

Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Personalization at scale: Mastering the challenges of personalization to create compelling user experiences

Location: LL21 E/F

Rahim Daya (Pinterest)

Average rating:

(3.50, 4 ratings)

Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Stream storage with Apache BookKeeper

Location: 230 A

Secondary topics: Graphs and Time-series

Sijie Guo (StreamNative)

Average rating:

(3.67, 3 ratings)

Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

NoSQL no more: SQL on Druid with Apache Calcite

Location: 230 C

Gian Merlino (Imply)

Average rating:

(4.00, 2 ratings)

Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Building a flexible ML pipeline at a B2B AI startup

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Dorna Bandari (Jetlore)

Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Leveraging live data to realize the smart cities vision

Location: Expo Hall 1

Secondary topics: Expo Hall

Arun Kejariwal (Independent), Roman Smolgovsky (MZ)

One of the key application domains leveraging live data is smart cities, but success depends on the availability of generic platforms that support high throughput and ultralow latency. Arun Kejariwal and Francois Orsini offer an overview of Satori's live data platform and walk you through a country-scale case study of its implementation. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

How to protect big data in a containerized environment

Location: LL21 C/D

Thomas Phelan (HPE BlueData)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Better machine learning logistics with the rendezvous architecture

Location: LL21 E/F

Tags:

Ted Dunning (MapR, now part of HPE)

Average rating:

(5.00, 1 rating)

Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber

Location: 230 A

Fabian Hueske (data Artisans), Shuyi Chen (Uber)

Average rating:

(5.00, 1 rating)

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Pipeline testing with Great Expectations

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Abe Gong (Superconductive Health), James Campbell (USG)

Average rating:

(5.00, 4 ratings)

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Classifying job execution using deep learning

Location: 230 C

Ash Munshi (Pepperdata)

Average rating:

(5.00, 1 rating)

Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

The future of ETL isn’t what it used to be

Location: 210 B/F

Secondary topics: Data Integration and Data Pipelines

Gwen Shapira (Confluent)

Average rating:

(5.00, 3 ratings)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.

11:00am–11:40am Thursday, March 8, 2018

Operationalize deep learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios

Location: LL21 C/D

Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)

Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance. Read more.

11:00am–11:40am Thursday, March 8, 2018

Analytics in the cloud: Building a modern cloud-based big data warehouse

Location: LL21 E/F

Greg Rahn (Cloudera)

Average rating:

(3.40, 5 ratings)

For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud. Read more.

11:00am–11:40am Thursday, March 8, 2018

Foundations of streaming SQL; or, How I learned to love stream and table theory

Location: 230 A

Tyler Akidau (Google)

Average rating:

(5.00, 4 ratings)

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.

11:00am–11:40am Thursday, March 8, 2018

The secret sauce behind LinkedIn's self-managing Kafka clusters

Location: 230 C

Jiangjie Qin (LinkedIn)

Average rating:

(4.00, 3 ratings)

LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention. Read more.

11:00am–11:40am Thursday, March 8, 2018

Kafka streaming applications with Akka Streams and Kafka Streams

Location: Expo Hall 1

Secondary topics: Expo Hall

Dean Wampler (Anyscale)

Average rating:

(5.00, 1 rating)

Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Distributed deep learning with containers on heterogeneous GPU clusters

Location: LL21 C/D

Tags:

dong meng (MapR)

Average rating:

(3.33, 3 ratings)

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Hive as a service

Location: LL21 E/F

Szehon Ho (Criteo), Pawel Szostek (Criteo)

Average rating:

(4.50, 2 ratings)

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Effectively once, exactly once, and more in Heron

Location: 230 A

Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)

Average rating:

(4.00, 1 rating)

Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and demonstrate how your applications will benefit from using them. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Building a contacts graph from activity data

Location: 230 C

Secondary topics: Graphs and Time-series

Alexis Roos (Salesforce), Noah Burbank (Salesforce)

Average rating:

(3.00, 1 rating)

In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data. Read more.

11:50am–12:30pm Thursday, March 8, 2018

The state of Postgres

Location: LL20 B

Umur Cubukcu (Citus Data)

Average rating:

(4.00, 3 ratings)

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases. Read more.

11:50am–12:30pm Thursday, March 8, 2018

20 Netflix-style principles and practices to get the most out of your data platform

Location: LL21 A

Kurt Brown (Netflix)

Average rating:

(5.00, 2 ratings)

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Machine-learned model quality monitoring in fast data and streaming applications

Location: LL21 C/D

Emre Velipasaoglu (Lightbend)

Average rating:

(4.00, 1 rating)

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Crafting data products for the augmented writing experience

Location: LL21 E/F

Chris Harland (Textio)

The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

TimescaleDB: Reengineering PostgreSQL as a time series database

Location: 230 A

Secondary topics: Graphs and Time-series

Michael Freedman (TimescaleDB)

Average rating:

(4.50, 4 ratings)

Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Playing well together: Big data beyond the JVM with Spark and friends

Location: 230 C

Holden Karau (Independent), Rachel Warren (Salesforce Einstein)

Average rating:

(3.40, 5 ratings)

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Read more.

1:50pm–2:30pm Thursday, March 8, 2018

The real-time journey from raw streaming data to AI-based analytics

Location: Expo Hall 1

Secondary topics: Expo Hall, Graphs and Time-series

Roy Ben Alta (Amazon Web Services), Ira Cohen (Anodot)

Average rating:

(5.00, 1 rating)

Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Data-driven ecosystems in the automotive industry

Location: LL20 B

Josef Viehhauser (BMW Group), Tobias Burger (BMW Group)

Average rating:

(5.00, 1 rating)

The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Josef Viehhauser and Tobias Bürger discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Continuous delivery for NLP on Kubernetes: Lessons learned

Location: LL21 C/D

Michelle Casbon (Google)

Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Achieving GDPR compliance and data privacy using blockchain technology

Location: LL21 E/F

Ajay Kumar Mothukuri (Sapient), Vijay Agneeswaran (Walmart Labs)

Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Unified and elastic batch and stream processing with Pravega and Apache Flink

Location: 230 A

Secondary topics: Graphs and Time-series

Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)

Average rating:

(3.33, 3 ratings)

Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Data reflections: Making data fast and easy to use without making copies

Location: 230 C

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)

Average rating:

(5.00, 3 ratings)

Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Building ML and AI pipelines with Spark and TensorFlow

Location: Expo Hall 1

Secondary topics: Expo Hall

Chris Fregly (Amazon Web Services)

Average rating:

(5.00, 1 rating)

Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

HDFS on Kubernetes: Tech deep dive on locality and security

Location: LL21 C/D

Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)

Average rating:

(5.00, 1 rating)

There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Lyft's analytics pipeline: From Redshift to Apache Hive and Presto

Location: LL21 E/F

Shenghu Yang (Lyft)

Average rating:

(5.00, 1 rating)

Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Effectively once in Apache Pulsar, the next-generation messaging system

Location: 230 A

Matteo Merli (Streamlio)

Average rating:

(1.00, 1 rating)

Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Cuttlefish: Lightweight primitives for online tuning

Location: 230 C

Tomer Kaftan (University of Washington)

Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Big data insights equal big money: Stories from the trenches at GoDaddy

Location: 210 C/G

Felix Gorodishter (GoDaddy)

Average rating:

(3.00, 2 ratings)

GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.

Data Engineering & Architecture

How to build an analytics infrastructure that unlocks the value of your data

Sponsorship Opportunities

Partner Opportunities

Contact Us