Data engineering and architecture: Big data conference & machine learning training

Wednesday Sep 12: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00am \| Location: 3E Strata Data Conference Keynotes
10:50am Morning break

Thursday Sep 13: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00am \| Location: 3E Strata Data Conference Keynotes
10:50am Morning break

9:00am–12:30pm Tuesday, 09/11/2018

Architecting a data platform for enterprise use

Location: 1A 06/07 Level: Intermediate

Secondary topics: Data Platforms

Mark Madsen (Teradata), Todd Walter (Archimedata)

Average rating:

(3.50, 10 ratings)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

9:00am–12:30pm Tuesday, 09/11/2018

Stream processing with Kafka and KSQL

Location: 1E 07/08 Level: Intermediate

Tim Berglund (Confluent)

Average rating:

(4.33, 3 ratings)

Tim Berglund leads this solid introduction to Apache Kafka as a streaming data platform. You'll cover the internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams, then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka. Read more.

9:00am–12:30pm Tuesday, 09/11/2018

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step

Location: 1E 11 Level: Intermediate

Secondary topics: Data preparation, governance and privacy, Ethics and Privacy

Mark Donsky (Okera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)

Average rating:

(4.50, 2 ratings)

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR. Read more.

9:00am–12:30pm Tuesday, 09/11/2018

Designing modern streaming data applications

Location: 1E 12/13 Level: Intermediate

Secondary topics: Data Platforms

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(3.12, 8 ratings)

Arun Kejariwal and Karthik Ramasamy lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, covering messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. They also share case studies from the IoT, gaming, and healthcare and their experience operating these systems at internet scale. Read more.

1:30pm–5:00pm Tuesday, 09/11/2018

Architecting a next-generation data platform

Location: 1A 06/07 Level: Advanced

Secondary topics: Data Platforms

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Average rating:

(3.12, 8 ratings)

Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics. Read more.

1:30pm–5:00pm Tuesday, 09/11/2018

Hands-on Kafka streaming microservices with Akka Streams and Kafka Streams

Location: 1A 23/24 Level: Intermediate

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Average rating:

(3.67, 3 ratings)

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. You'll also discover a few ML model serving ideas along the way. Read more.

1:30pm–5:00pm Tuesday, 09/11/2018

Apache Metron: Open source cybersecurity at scale

Location: 1E 06 Level: Intermediate

Carolyn Duby (Cloudera)

Carolyn Duby shows you how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable open source platform. After this interactive overview of the platform's major features, you'll be ready to analyze your own haystack back at the office. Read more.

1:30pm–5:00pm Tuesday, 09/11/2018

From training to serving: Deploying TensorFlow models with Kubernetes

Location: 1E 09 Level: Intermediate

Secondary topics: Model lifecycle management

Brian Foo (Google), Holden Karau (Independent), Jay Smith (Google)

Average rating:

(2.00, 7 ratings)

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Brian Foo, Jay Smith, and Holden Karau explain how to bring deep learning models from training to serving in a cloud production environment. You'll learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.

1:30pm–5:00pm Tuesday, 09/11/2018

Building your first big data application on AWS

Location: 1E 12/13 Level: Intermediate

Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Faria Bruno (Amazon Web Services)

Average rating:

(2.86, 7 ratings)

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services. Read more.

1:30pm–5:00pm Tuesday, 09/11/2018

Running multidisciplinary big data workloads in the cloud

Location: 1E 14 Level: Intermediate

Sudhanshu Arora (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon Freeman (Cloudera), Jason Wang (Cloudera), Shravan Pabba (Cloudera)

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS. Read more.

11:20am–12:00pm Wednesday, 09/12/2018

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber

Location: 1A 10 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines, Transportation and Logistics

Felix Cheung (Uber)

Average rating:

(4.60, 5 ratings)

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame. Read more.

11:20am–12:00pm Wednesday, 09/12/2018

Protecting sensitive data in huge datasets: Cloud tools you can use

Location: 1A 21/22 Level: Intermediate

Secondary topics: Ethics and Privacy

Felipe Hoffa (Google), Damien Desfontaines (Google | ETH Zürich)

Average rating:

(4.00, 1 rating)

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa and Damien Desfontaines explore how to handle massive public datasets, taking you from theory to real life as they showcase newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm. Read more.

11:20am–12:00pm Wednesday, 09/12/2018

The future of ETL isn’t what it used to be.

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Gwen Shapira (Confluent)

Average rating:

(4.00, 4 ratings)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering. You'll learn how modern engineering organizations use Apache Kafka, microservices, and event streams to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.

11:20am–12:00pm Wednesday, 09/12/2018

DIY versus designer approaches to deploying data center infrastructure for machine learning and analytics

Location: 1E 09 Level: Beginner

Secondary topics: Data Platforms

Cory Minton (Dell EMC), Colm Moynihan (Cloudera)

Average rating:

(5.00, 1 rating)

Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble. Read more.

1:15pm–1:55pm Wednesday, 09/12/2018

Why data scientists should love Linux containers

Location: 1A 08 Level: Beginner

Secondary topics: Model lifecycle management

William Benton (Red Hat)

Average rating:

(5.00, 2 ratings)

Containers are a hot technology for application developers, but they also provide key benefits for data scientists. William Benton details the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively. Read more.

1:15pm–1:55pm Wednesday, 09/12/2018

A data marketplace case study with the blockchain and advanced multitenant Hadoop in a smart open data platform

Location: 1A 21/22 Level: Intermediate

Secondary topics: Blockchain and decentralization, Data preparation, governance and privacy

Minh Chau Nguyen (ETRI), Heesun Won (ETRI)

Average rating:

(2.20, 5 ratings)

Minh Chau Nguyen and Heesun Won explain how to implement analytics services in data marketplace systems on a single Hadoop cluster across distributed data centers. The solution extends the overall architecture of the Hadoop ecosystem with the blockchain so that multiple tenants and authorized third parties can securely access data while still maintaining privacy, scalability, and reliability. Read more.

1:15pm–1:55pm Wednesday, 09/12/2018

Lessons learned building a scalable and extendable data pipeline for Call of Duty

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Yaroslav Tkachenko (Activision)

Average rating:

(4.67, 3 ratings)

What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision. Read more.

1:15pm–1:55pm Wednesday, 09/12/2018

Data governance: A big job that's getting bigger

Location: 1E 09 Level: Intermediate

Secondary topics: Data preparation, governance and privacy

Andrew Brust (Blue Badge Insights | ZDNet)

Average rating:

(4.50, 2 ratings)

Data governance has grown from a set of mostly data management-oriented technologies in the data warehouse era to encompass catalogs, glossaries, and more in the data lake era. Now new requirements are emerging, and new products are rising to meet the challenge. Andrew Brust tracks data governance's past and present and offers a glimpse of the future. Read more.

1:15pm–1:55pm Wednesday, 09/12/2018

A comparative analysis of the fundamentals of AWS and Azure

Location: Expo Hall Level: Beginner

Jason Wang (Cloudera), Suraj Acharya (Cloudera), Tony Wu (Cloudera)

The largest infrastructure paradigm change of the 21st century is the shift to the cloud. Companies now face the difficult decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. Jason Wang, Suraj Acharya, and Tony Wu compare the relative strengths and weaknesses of AWS and Azure. Read more.

2:05pm–2:45pm Wednesday, 09/12/2018

Building a recommendation engine

Location: 1A 10 Level: Beginner

Sophie Watson (Red Hat)

Average rating:

(3.50, 6 ratings)

Recommender systems enhance user experience and business revenue every day. Sophie Watson demonstrates how to develop a robust recommendation engine using a microservice architecture. Read more.

2:05pm–2:45pm Wednesday, 09/12/2018

Using the blockchain in the enterprise

Location: 1A 21/22 Level: Non-technical

Secondary topics: Blockchain and decentralization, Financial Services

Jim Scott (NVIDIA)

Average rating:

(2.67, 3 ratings)

Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures. Read more.

2:05pm–2:45pm Wednesday, 09/12/2018

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)

Average rating:

(3.80, 5 ratings)

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works. Read more.

2:05pm–2:45pm Wednesday, 09/12/2018

What's the Hadoop-la about Kubernetes?

Location: 1E 09 Level: Advanced

Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)

Average rating:

(5.00, 1 rating)

Kubernetes (K8s)—the open source container orchestration system for modern big data workloads—is increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s. Read more.

2:05pm–2:45pm Wednesday, 09/12/2018

MLflow: An open platform to simplify the machine learning lifecycle

Location: Expo Hall

Secondary topics: Model lifecycle management

Mani Parkhe (Databricks), Andrew Chen (Databricks)

Successfully building and deploying a machine learning model is difficult to do once. Enabling other data scientists to reproduce your pipeline, compare the results of different versions, track what's running where, and redeploy and rollback updated models is much harder. Mani Parkhe and Andrew Chen offer an overview of MLflow—a new open source project from Databricks that simplifies this process. Read more.

2:55pm–3:35pm Wednesday, 09/12/2018

Executive Briefing: Managing successful data projects—Technology selection and team building

Location: 1E 14 Level: Intermediate

Secondary topics: Machine Learning in the enterprise, Media, Marketing, Advertising

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Average rating:

(4.00, 3 ratings)

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation. Read more.

2:55pm–3:35pm Wednesday, 09/12/2018

Zipline: Airbnb's data management platform for machine learning

Location: 1A 21/22 Level: Intermediate

Secondary topics: Data Platforms, Retail and e-commerce

Varant Zanoyan (Airbnb)

Average rating:

(4.33, 6 ratings)

Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems. Read more.

2:55pm–3:35pm Wednesday, 09/12/2018

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Mauricio Aristizabal (Impact)

Average rating:

(2.67, 3 ratings)

Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components. Read more.

2:55pm–3:35pm Wednesday, 09/12/2018

Clouds and containers: Case studies for big data

Location: 1E 09 Level: Beginner

Paul Curtis (Weaveworks)

Average rating:

(5.00, 2 ratings)

Once the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? Paul Curtis explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment. Read more.

2:55pm–3:35pm Wednesday, 09/12/2018

Performant time series data management and analytics with Postgres

Location: Expo Hall Level: Intermediate

Michael Freedman (TimescaleDB)

Michael Freedman explains how to leverage Postgres for high-volume time series workloads using TimescaleDB, an open source time series database built as a Postgres plug-in. Michael covers the general architectural design principles and new time series data management features, including adaptive time partitioning and near-real-time continuous aggregations. Read more.

4:35pm–5:15pm Wednesday, 09/12/2018

Setting up a lightweight distributed caching layer using Apache Arrow

Location: 1A 10 Level: Intermediate

Jacques Nadeau (Dremio)

Average rating:

(5.00, 1 rating)

Jacques Nadeau leads a deep dive into a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture—including the cache life cycle, update patterns, cache cohesion, and appropriate use cases—learn how it all works, and see it in action. Read more.

4:35pm–5:15pm Wednesday, 09/12/2018

How to cost-effectively and reliably build infrastructure for machine learning

Location: 1A 21/22 Level: Beginner

Secondary topics: Data Platforms

Osman Sarood (Mist Systems)

Average rating:

(2.00, 1 rating)

Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million. Read more.

4:35pm–5:15pm Wednesday, 09/12/2018

Tracking data lineage at Stitch Fix

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines, Data preparation, governance and privacy

Neelesh Salian (Stitch Fix)

Average rating:

(1.33, 3 ratings)

Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases. Read more.

4:35pm–5:15pm Wednesday, 09/12/2018

AppNexus's stream-based control system for automated buying of digital ads

Location: 1E 07/08 Level: Intermediate

Brian Wu (AppNexus)

Average rating:

(5.00, 1 rating)

Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. Brian Wu describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization, and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus. Read more.

4:35pm–5:15pm Wednesday, 09/12/2018

Using machine learning to drive intelligence at the edge

Location: 1E 09 Level: Intermediate

Secondary topics: Model lifecycle management

Dave Shuman (Cloudera), Bryan Dean (Red Hat)

The focus on the IoT is turning increasingly to the edge, and the way to make the edge more intelligent is by building machine learning models in the cloud and pushing them back out to the edge. Dave Shuman and Bryan Dean explain how Cloudera and Red Hat executed this architecture at one of Europe's leading manufacturers, along with a demo highlighting this architecture. Read more.

4:35pm–5:15pm Wednesday, 09/12/2018

Architectural principles for building trusted, real-time, distributed IoT systems

Location: Expo Hall Level: Intermediate

Secondary topics: Blockchain and decentralization, Data Platforms

Dan Harple (Context Labs)

Dan Harple explains how distributed systems are being influenced by and are influencing operational, financial, and social impact requirements of a wide range of enterprises and how trust in these distributed systems is being challenged, elevated, and resolved by engineers and architects today. Read more.

5:25pm–6:05pm Wednesday, 09/12/2018

From flat files to deconstructed database: The evolution and future of the big data ecosystem

Location: 1A 10 Level: Intermediate

Julien Le Dem (WeWork)

Average rating:

(5.00, 1 rating)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.

5:25pm–6:05pm Wednesday, 09/12/2018

Apache Kafka and the four challenges of production machine learning systems

Location: 1A 21/22 Level: Intermediate

Secondary topics: Model lifecycle management

Jay Kreps (Confluent)

Average rating:

(4.00, 2 ratings)

Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. Jay Kreps explores some of the difficulties of building production machine learning systems and explains how Apache Kafka and stream processing can help. Read more.

5:25pm–6:05pm Wednesday, 09/12/2018

Circuit breakers to safeguard for garbage in, garbage out

Location: 1A 23/24 Level: Beginner

Secondary topics: Data Integration and Data Pipelines, Financial Services

Sandeep Uttamchandani (Intuit)

Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights. Read more.

5:25pm–6:05pm Wednesday, 09/12/2018

Hudi: Unifying storage and serving for batch and near-real-time analytics

Location: 1E 07/08 Level: Beginner

Secondary topics: Data Integration and Data Pipelines

Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond. Read more.

5:25pm–6:05pm Wednesday, 09/12/2018

Introducing Iceberg: Tables designed for object stores

Location: 1E 09 Level: Intermediate

Owen O'Malley (Cloudera), Ryan Blue (Netflix)

Average rating:

(4.33, 3 ratings)

Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet. Read more.

11:20am–12:00pm Thursday, 09/13/2018

TonY: Native support of TensorFlow on Hadoop

Location: 1A 10 Level: Intermediate

Secondary topics: Data Platforms, Deep Learning

Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)

Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop. Read more.

11:20am–12:00pm Thursday, 09/13/2018

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

Location: 1A 21/22 Level: Intermediate

Holden Karau (Independent), Rachel Warren (Salesforce Einstein), Anya Bida (Salesforce)

Average rating:

(4.00, 2 ratings)

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.

11:20am–12:00pm Thursday, 09/13/2018

Near-real-time anomaly detection at Lyft

Location: 1E 07/08 Level: Beginner

Secondary topics: Temporal data and time-series analytics, Transportation and Logistics

Thomas Weise (Lyft), Mark Grover (Lyft)

Average rating:

(2.50, 2 ratings)

Thomas Weise and Mark Grover explain how Lyft uses its streaming platform to detect and respond to anomalous events, using data science tools for machine learning and a process that allows for fast and predictable deployment. Read more.

11:20am–12:00pm Thursday, 09/13/2018

Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types

Location: 1E 09 Level: Advanced

Secondary topics: Data Integration and Data Pipelines, Data preparation, governance and privacy, Media, Marketing, Advertising

Barbara Eckman (Comcast)

Average rating:

(4.33, 6 ratings)

Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro. Read more.

11:20am–12:00pm Thursday, 09/13/2018

Data at Netflix: See what’s next

Location: Expo Hall Level: Intermediate

Secondary topics: Data Platforms

Michelle Ufford (Netflix)

Average rating:

(4.40, 5 ratings)

Michelle Ufford shares some of the cool things Netflix is doing with data and the big bets the company is making on data infrastructure, covering workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, platform intelligence, and more. Read more.

1:10pm–1:50pm Thursday, 09/13/2018

Deep learning on YARN: Running distributed TensorFlow, MXNet, Caffe, and XGBoost on Hadoop clusters

Location: 1A 10 Level: Intermediate

Secondary topics: Data Platforms, Deep Learning, Model lifecycle management

Wangda Tan (Cloudera)

Average rating:

(4.50, 2 ratings)

In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN. Read more.

1:10pm–1:50pm Thursday, 09/13/2018

A/B testing at Uber: How we built a BYOM (bring your own metrics) platform

Location: 1A 21/22 Level: Intermediate

Secondary topics: Data Platforms, Transportation and Logistics

Milene Darnis (Uber)

Average rating:

(4.22, 9 ratings)

Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze. Read more.

1:10pm–1:50pm Thursday, 09/13/2018

Case study: A Spark-based distributed simulation optimization architecture for portfolio optimization in retail banking

Location: 1A 23/24 Level: Intermediate

Kaushik Deka (Novantas), Ted Gibson (Novantas)

Average rating:

(4.50, 2 ratings)

Kaushik Deka and Ted Gibson share a large-scale optimization architecture in Spark for a consumer product portfolio optimization use case in retail banking. The architecture combines a simulator that distributes computation of complex real-world scenarios and a constraint optimizer that uses business rules as constraints to meet growth targets. Read more.

1:10pm–1:50pm Thursday, 09/13/2018

How Komatsu is improving mining efficiencies using the IoT and machine learning

Location: 1E 09 Level: Non-technical

Secondary topics: Transportation and Logistics

Shawn Terry (Komatsu Mining Corp)

Average rating:

(4.50, 2 ratings)

Global heavy equipment manufacturer Komatsu is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Shawn Terry details the company's data journey and explains how it is using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment. Read more.

1:10pm–1:50pm Thursday, 09/13/2018

The state of Postgres

Location: Expo Hall Level: Beginner

Umur Cubukcu (Citus Data)

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases. Read more.

2:00pm–2:40pm Thursday, 09/13/2018

Big data at speed

Location: 1A 06/07 Level: Intermediate

Secondary topics: Transportation and Logistics

Ted Malaska (Capital One), Mark Grover (Lyft)

Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed. Read more.

2:00pm–2:40pm Thursday, 09/13/2018

Kubeflow explained: Portable machine learning on Kubernetes

Location: 1A 10 Level: Intermediate

Secondary topics: Model lifecycle management

Michelle Casbon (Google)

Average rating:

(5.00, 2 ratings)

Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project. Read more.

2:00pm–2:40pm Thursday, 09/13/2018

Aetna's advanced analytics platform, Data Fabric

Location: 1A 21/22 Level: Intermediate

Secondary topics: Data Platforms, Health and Medicine

Occhio Orsini (Aetna)

Occhio Orsini offers an overview of Aetna's Data Fabric platform. Join in to learn the needs and desires that led to the creation of the advanced analytics platform, explore the platform's architecture, technology, and capabilities, and understand the key technologies and capabilities that made it possible to build a hybrid solution across on-premises and cloud-hosted data centers. Read more.

2:00pm–2:40pm Thursday, 09/13/2018

Using big data to unlock the delivery of personalized, multilingual real-time chat services for global financial service organizations

Location: 1A 23/24 Level: Beginner

Secondary topics: Data Platforms, Financial Services

Timothy Walpole (BJSS)

Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform. Read more.

2:00pm–2:40pm Thursday, 09/13/2018

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks

Location: 1E 09 Level: Beginner

Secondary topics: Data Platforms, Retail and e-commerce, Transportation and Logistics

tao huang (JD.com), mang zhang (JD.com), Bing Bai (JD.com)

Average rating:

(3.00, 1 rating)

Tao Huang, Mang Zhang, and 白冰 explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average. Read more.

2:00pm–2:40pm Thursday, 09/13/2018

Building a high-performance model serving engine from scratch using Kubernetes, GPUs, Docker, Istio, and TensorFlow

Location: Expo Hall Level: Intermediate

Secondary topics: Model lifecycle management

Chris Fregly (Amazon Web Services)

Average rating:

(3.50, 2 ratings)

Chris Fregly details a full-featured, open source end-to-end TensorFlow model training and deployment system, using the latest advancements with Kubernetes, TensorFlow, and GPUs. Read more.

3:30pm–4:10pm Thursday, 09/13/2018

Scaling data infrastructure in the fashion world; or, “What is this? Business intelligence for ants?”

Location: 1E 10/11 Level: Non-technical

Secondary topics: Data Platforms, Media, Marketing, Advertising, Retail and e-commerce

Francesco Mucio (Francescomuc.io)

Average rating:

(3.50, 2 ratings)

Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead. Read more.

3:30pm–4:10pm Thursday, 09/13/2018

Managing data chaos in the world of microservices

Location: 1A 10 Level: Intermediate

Oleksii Kachaiev (Attendify)

Average rating:

(3.50, 2 ratings)

When we talk about microservices, we usually focus on the communication layer. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity, such as structural and semantic changes, knowledge sharing, and data discovery. Join Alexey Kachayev to explore emerging technologies created to tackle these challenges. Read more.

3:30pm–4:10pm Thursday, 09/13/2018

Self-service modern analytics on the GovCloud

Location: 1A 21/22 Level: Intermediate

Ramesh Krishnan (lmco), Steven Morgan (Lockheed Martin)

Average rating:

(4.00, 1 rating)

Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud. Read more.

3:30pm–4:10pm Thursday, 09/13/2018

Machine learning for nonstationary streaming data using Structured Streaming and StreamDM

Location: 1E 07/08 Level: Intermediate

Secondary topics: Temporal data and time-series analytics

Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts). Read more.

4:20pm–5:00pm Thursday, 09/13/2018

Infrastructure for deploying machine learning to production in large financial institutions: Lessons learned and best practices

Location: 1A 08 Level: Intermediate

Secondary topics: Financial Services, Model lifecycle management

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions. Read more.

4:20pm–5:00pm Thursday, 09/13/2018

The move to a modern data platform in the cloud: Pitfalls to avoid and best practices to follow

Location: 1A 10 Level: Intermediate

Amandeep Khurana (Okera)

Amandeep Khurana shares critical data management practices for easy and unified data access that meets security and regulatory compliance, helping you avoid the pitfalls that could lead to complex expensive architectures. Read more.

4:20pm–5:00pm Thursday, 09/13/2018

Building turnkey recommendations for 5% of internet video

Location: 1A 21/22 Level: Intermediate

Secondary topics: Deep Learning, Media, Marketing, Advertising, Recommendation Systems

Nir Yungster (JW Player), Kamil Sindi (JW Player)

JW Player—the world’s largest network-independent video platform, representing 5% of global internet video—provides on-demand recommendations as a service to thousands of media publishers. Nir Yungster and Kamil Sindi explain how the company is systematically improving model performance while navigating the many engineering challenges and unique needs of the diverse publishers it serves. Read more.

4:20pm–5:00pm Thursday, 09/13/2018

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Location: 1A 23/24 Level: Beginner

Secondary topics: Data Integration and Data Pipelines

Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)

Average rating:

(4.50, 2 ratings)

Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture. Read more.

4:20pm–5:00pm Thursday, 09/13/2018

IoT edge processing with Apache NiFi, Apache MiniFi, and multiple deep learning libraries

Location: 1E 07/08 Level: Beginner

TIMOTHY SPANN (Cloudera)

Average rating:

(4.00, 2 ratings)

Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices. Read more.

4:20pm–5:00pm Thursday, 09/13/2018

TuneIn: How to get your jobs tuned while you are sleeping

Location: 1E 09 Level: Intermediate

Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn)

Average rating:

(5.00, 1 rating)

Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage. Read more.

Data Engineering & Architecture

Learn to build an analytics infrastructure that unlocks the value of your data

Sponsorship Opportunities

Partner Opportunities

Contact Us