Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Data Engineering & Architecture

September 11-13, 2018
New York, NY

Learn to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Tuesday Sep 11: Tutorials (Gold & Silver passes)
Wednesday Sep 12: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00am | Location: 3E
Strata Data Conference Keynotes
10:50am
Morning break
Thursday Sep 13: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00am | Location: 3E
Strata Data Conference Keynotes
10:50am
Morning break
9:00am–12:30pm Tuesday, 09/11/2018
Location: 1A 06/07 Level: Intermediate
Secondary topics:  Data Platforms
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Average rating: ***..
(3.50, 10 ratings)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
9:00am–12:30pm Tuesday, 09/11/2018
Location: 1E 07/08 Level: Intermediate
Tim Berglund (Confluent)
Average rating: ****.
(4.33, 3 ratings)
Tim Berglund leads this solid introduction to Apache Kafka as a streaming data platform. You'll cover the internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams, then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka. Read more.
9:00am–12:30pm Tuesday, 09/11/2018
Location: 1E 11 Level: Intermediate
Secondary topics:  Data preparation, governance and privacy, Ethics and Privacy
Mark Donsky (Okera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)
Average rating: ****.
(4.50, 2 ratings)
New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR. Read more.
9:00am–12:30pm Tuesday, 09/11/2018
Location: 1E 12/13 Level: Intermediate
Secondary topics:  Data Platforms
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Average rating: ***..
(3.12, 8 ratings)
Arun Kejariwal and Karthik Ramasamy lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, covering messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. They also share case studies from the IoT, gaming, and healthcare and their experience operating these systems at internet scale. Read more.
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1A 06/07 Level: Advanced
Secondary topics:  Data Platforms
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Average rating: ***..
(3.12, 8 ratings)
Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics. Read more.
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1A 23/24 Level: Intermediate
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Average rating: ***..
(3.67, 3 ratings)
Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. You'll also discover a few ML model serving ideas along the way. Read more.
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1E 06 Level: Intermediate
Carolyn Duby (Hortonworks)
Carolyn Duby shows you how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable open source platform. After this interactive overview of the platform's major features, you'll be ready to analyze your own haystack back at the office. Read more.
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1E 09 Level: Intermediate
Secondary topics:  Model lifecycle management
Brian Foo (Google), Holden Karau (Google), Jay Smith (Google)
Average rating: **...
(2.00, 7 ratings)
TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Brian Foo, Jay Smith, and Holden Karau explain how to bring deep learning models from training to serving in a cloud production environment. You'll learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1E 12/13 Level: Intermediate
Jorge A. Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Faria Bruno (Amazon Web Services)
Average rating: **...
(2.86, 7 ratings)
Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services. Read more.
1:30pm–5:00pm Tuesday, 09/11/2018
Location: 1E 14 Level: Intermediate
Sudhanshu Arora (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon Freeman (Cloudera), Jason Wang (Cloudera), Shravan Pabba (Cloudera)
Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS. Read more.
11:20am–12:00pm Wednesday, 09/12/2018
Location: 1A 10 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines, Transportation and Logistics
Felix Cheung (Uber)
Average rating: ****.
(4.60, 5 ratings)
Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame. Read more.
11:20am–12:00pm Wednesday, 09/12/2018
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Ethics and Privacy
Felipe Hoffa (Google), Damien Desfontaines (Google | ETH Zürich)
Average rating: ****.
(4.00, 1 rating)
Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa and Damien Desfontaines explore how to handle massive public datasets, taking you from theory to real life as they showcase newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm. Read more.
11:20am–12:00pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Gwen Shapira (Confluent)
Average rating: ****.
(4.00, 4 ratings)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering. You'll learn how modern engineering organizations use Apache Kafka, microservices, and event streams to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.
11:20am–12:00pm Wednesday, 09/12/2018
Location: 1E 09 Level: Beginner
Secondary topics:  Data Platforms
Cory Minton (Dell EMC), Colm Moynihan (Cloudera)
Average rating: *****
(5.00, 1 rating)
Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble. Read more.
1:15pm–1:55pm Wednesday, 09/12/2018
Location: 1A 08 Level: Beginner
Secondary topics:  Model lifecycle management
William Benton (Red Hat)
Average rating: *****
(5.00, 2 ratings)
Containers are a hot technology for application developers, but they also provide key benefits for data scientists. William Benton details the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively. Read more.
1:15pm–1:55pm Wednesday, 09/12/2018
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Blockchain and decentralization, Data preparation, governance and privacy
Minh Chau Nguyen (ETRI), Heesun Won (ETRI)
Average rating: **...
(2.20, 5 ratings)
Minh Chau Nguyen and Heesun Won explain how to implement analytics services in data marketplace systems on a single Hadoop cluster across distributed data centers. The solution extends the overall architecture of the Hadoop ecosystem with the blockchain so that multiple tenants and authorized third parties can securely access data while still maintaining privacy, scalability, and reliability. Read more.
1:15pm–1:55pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Yaroslav Tkachenko (Activision)
Average rating: ****.
(4.67, 3 ratings)
What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision. Read more.
1:15pm–1:55pm Wednesday, 09/12/2018
Location: 1E 09 Level: Intermediate
Secondary topics:  Data preparation, governance and privacy
Andrew J Brust (ZDNet | Blue Badge Insights)
Average rating: ****.
(4.50, 2 ratings)
Data governance has grown from a set of mostly data management-oriented technologies in the data warehouse era to encompass catalogs, glossaries, and more in the data lake era. Now new requirements are emerging, and new products are rising to meet the challenge. Andrew Brust tracks data governance's past and present and offers a glimpse of the future. Read more.
1:15pm–1:55pm Wednesday, 09/12/2018
Location: Expo Hall Level: Beginner
Jason Wang (Cloudera), Suraj Acharya (Cloudera), Tony Wu (Cloudera)
The largest infrastructure paradigm change of the 21st century is the shift to the cloud. Companies now face the difficult decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. Jason Wang, Suraj Acharya, and Tony Wu compare the relative strengths and weaknesses of AWS and Azure. Read more.
2:05pm–2:45pm Wednesday, 09/12/2018
Location: 1A 10 Level: Beginner
Sophie Watson (Red Hat)
Average rating: ***..
(3.50, 6 ratings)
Recommender systems enhance user experience and business revenue every day. Sophie Watson demonstrates how to develop a robust recommendation engine using a microservice architecture. Read more.
2:05pm–2:45pm Wednesday, 09/12/2018
Location: 1A 21/22 Level: Non-technical
Secondary topics:  Blockchain and decentralization, Financial Services
Jim Scott (MapR Technologies)
Average rating: **...
(2.67, 3 ratings)
Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures. Read more.
2:05pm–2:45pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Danny Chen (Uber Technologies), Omkar Joshi (Uber Technologies), Eric Sayle (Uber Technologies)
Average rating: ***..
(3.80, 5 ratings)
Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works. Read more.
2:05pm–2:45pm Wednesday, 09/12/2018
Location: 1E 09 Level: Advanced
Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)
Average rating: *****
(5.00, 1 rating)
Kubernetes (K8s)—the open source container orchestration system for modern big data workloads—is increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s. Read more.
2:05pm–2:45pm Wednesday, 09/12/2018
Location: Expo Hall
Secondary topics:  Model lifecycle management
Mani Parkhe (Databricks), Andrew Chen (Databricks)
Successfully building and deploying a machine learning model is difficult to do once. Enabling other data scientists to reproduce your pipeline, compare the results of different versions, track what's running where, and redeploy and rollback updated models is much harder. Mani Parkhe and Andrew Chen offer an overview of MLflow—a new open source project from Databricks that simplifies this process. Read more.
2:55pm–3:35pm Wednesday, 09/12/2018
Location: 1E 14 Level: Intermediate
Secondary topics:  Machine Learning in the enterprise, Media, Marketing, Advertising
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Average rating: ****.
(4.00, 3 ratings)
Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation. Read more.
2:55pm–3:35pm Wednesday, 09/12/2018
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Data Platforms, Retail and e-commerce
Varant Zanoyan (Airbnb)
Average rating: ****.
(4.33, 6 ratings)
Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems. Read more.
2:55pm–3:35pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Average rating: **...
(2.67, 3 ratings)
Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components. Read more.
2:55pm–3:35pm Wednesday, 09/12/2018
Location: 1E 09 Level: Beginner
Paul Curtis (MapR Technologies)
Average rating: *****
(5.00, 2 ratings)
Once the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? Paul Curtis explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment. Read more.
2:55pm–3:35pm Wednesday, 09/12/2018
Location: Expo Hall Level: Intermediate
Michael Freedman (TimescaleDB)
Michael Freedman explains how to leverage Postgres for high-volume time series workloads using TimescaleDB, an open source time series database built as a Postgres plug-in. Michael covers the general architectural design principles and new time series data management features, including adaptive time partitioning and near-real-time continuous aggregations. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: 1A 10 Level: Intermediate
Jacques Nadeau (Dremio)
Average rating: *****
(5.00, 1 rating)
Jacques Nadeau leads a deep dive into a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture—including the cache life cycle, update patterns, cache cohesion, and appropriate use cases—learn how it all works, and see it in action. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: 1A 21/22 Level: Beginner
Secondary topics:  Data Platforms
Osman Sarood (Mist Systems)
Average rating: **...
(2.00, 1 rating)
Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines, Data preparation, governance and privacy
Average rating: *....
(1.33, 3 ratings)
Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: 1E 07/08 Level: Intermediate
Brian Wu (AppNexus)
Average rating: *****
(5.00, 1 rating)
Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. Brian Wu describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization, and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: 1E 09 Level: Intermediate
Secondary topics:  Model lifecycle management
Dave Shuman (Cloudera), Bryan Dean (Red Hat)
The focus on the IoT is turning increasingly to the edge, and the way to make the edge more intelligent is by building machine learning models in the cloud and pushing them back out to the edge. Dave Shuman and Bryan Dean explain how Cloudera and Red Hat executed this architecture at one of Europe's leading manufacturers, along with a demo highlighting this architecture. Read more.
4:35pm–5:15pm Wednesday, 09/12/2018
Location: Expo Hall Level: Intermediate
Secondary topics:  Blockchain and decentralization, Data Platforms
Dan Harple (Context Labs)
Dan Harple explains how distributed systems are being influenced by and are influencing operational, financial, and social impact requirements of a wide range of enterprises and how trust in these distributed systems is being challenged, elevated, and resolved by engineers and architects today. Read more.
5:25pm–6:05pm Wednesday, 09/12/2018
Location: 1A 10 Level: Intermediate
Julien Le Dem (WeWork)
Average rating: *****
(5.00, 1 rating)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.
5:25pm–6:05pm Wednesday, 09/12/2018
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Model lifecycle management
Jay Kreps (Confluent)
Average rating: ****.
(4.00, 2 ratings)
Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. Jay Kreps explores some of the difficulties of building production machine learning systems and explains how Apache Kafka and stream processing can help. Read more.
5:25pm–6:05pm Wednesday, 09/12/2018
Location: 1A 23/24 Level: Beginner
Secondary topics:  Data Integration and Data Pipelines, Financial Services
Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights. Read more.
5:25pm–6:05pm Wednesday, 09/12/2018
Location: 1E 07/08 Level: Beginner
Secondary topics:  Data Integration and Data Pipelines
Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond. Read more.
5:25pm–6:05pm Wednesday, 09/12/2018
Location: 1E 09 Level: Intermediate
Owen O'Malley (Hortonworks), Ryan Blue (Netflix)
Average rating: ****.
(4.33, 3 ratings)
Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet. Read more.
11:20am–12:00pm Thursday, 09/13/2018
Location: 1A 10 Level: Intermediate
Secondary topics:  Data Platforms, Deep Learning
Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)
Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop. Read more.
11:20am–12:00pm Thursday, 09/13/2018
Location: 1A 21/22 Level: Intermediate
Holden Karau (Google), Rachel Warren (Salesforce Einstein), Anya Bida (Salesforce)
Average rating: ****.
(4.00, 2 ratings)
Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.
11:20am–12:00pm Thursday, 09/13/2018
Location: 1E 07/08 Level: Beginner
Secondary topics:  Temporal data and time-series analytics, Transportation and Logistics
Thomas Weise (Lyft), Mark Grover (Lyft)
Average rating: **...
(2.50, 2 ratings)
Thomas Weise and Mark Grover explain how Lyft uses its streaming platform to detect and respond to anomalous events, using data science tools for machine learning and a process that allows for fast and predictable deployment. Read more.
11:20am–12:00pm Thursday, 09/13/2018
Location: 1E 09 Level: Advanced
Secondary topics:  Data Integration and Data Pipelines, Data preparation, governance and privacy, Media, Marketing, Advertising
Barbara Eckman (Comcast)
Average rating: ****.
(4.33, 6 ratings)
Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro. Read more.
11:20am–12:00pm Thursday, 09/13/2018
Location: Expo Hall Level: Intermediate
Secondary topics:  Data Platforms
Michelle Ufford (Netflix)
Average rating: ****.
(4.40, 5 ratings)
Michelle Ufford shares some of the cool things Netflix is doing with data and the big bets the company is making on data infrastructure, covering workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, platform intelligence, and more. Read more.
1:10pm–1:50pm Thursday, 09/13/2018
Location: 1A 10 Level: Intermediate
Secondary topics:  Data Platforms, Deep Learning, Model lifecycle management
Wangda Tan (Hortonworks)
Average rating: ****.
(4.50, 2 ratings)
In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN. Read more.
1:10pm–1:50pm Thursday, 09/13/2018
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Data Platforms, Transportation and Logistics
Milene Darnis (Uber)
Average rating: ****.
(4.22, 9 ratings)
Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze. Read more.
1:10pm–1:50pm Thursday, 09/13/2018
Location: 1A 23/24 Level: Intermediate
Kaushik Deka (Novantas), Ted Gibson (Novantas)
Average rating: ****.
(4.50, 2 ratings)
Kaushik Deka and Ted Gibson share a large-scale optimization architecture in Spark for a consumer product portfolio optimization use case in retail banking. The architecture combines a simulator that distributes computation of complex real-world scenarios and a constraint optimizer that uses business rules as constraints to meet growth targets. Read more.
1:10pm–1:50pm Thursday, 09/13/2018
Location: 1E 09 Level: Non-technical
Secondary topics:  Transportation and Logistics
Shawn Terry (Komatsu Mining Corp)
Average rating: ****.
(4.50, 2 ratings)
Global heavy equipment manufacturer Komatsu is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Shawn Terry details the company's data journey and explains how it is using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment. Read more.
1:10pm–1:50pm Thursday, 09/13/2018
Location: Expo Hall Level: Beginner
Umur Cubukcu (Citus Data)
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: 1A 06/07 Level: Intermediate
Secondary topics:  Transportation and Logistics
Ted Malaska (Capital One), Mark Grover (Lyft)
Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: 1A 10 Level: Intermediate
Secondary topics:  Model lifecycle management
Michelle Casbon (Google)
Average rating: *****
(5.00, 2 ratings)
Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Data Platforms, Health and Medicine
Occhio Orsini (Aetna)
Occhio Orsini offers an overview of Aetna's Data Fabric platform. Join in to learn the needs and desires that led to the creation of the advanced analytics platform, explore the platform's architecture, technology, and capabilities, and understand the key technologies and capabilities that made it possible to build a hybrid solution across on-premises and cloud-hosted data centers. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: 1A 23/24 Level: Beginner
Secondary topics:  Data Platforms, Financial Services
Tim Walpole (BJSS)
Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: 1E 09 Level: Beginner
Secondary topics:  Data Platforms, Retail and e-commerce, Transportation and Logistics
tao huang (JD.com), mang zhang (JD.com), Bing Bai (JD.com)
Average rating: ***..
(3.00, 1 rating)
Tao Huang, Mang Zhang, and 白冰 explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average. Read more.
2:00pm–2:40pm Thursday, 09/13/2018
Location: Expo Hall Level: Intermediate
Secondary topics:  Model lifecycle management
Chris Fregly (PipelineAI)
Average rating: ***..
(3.50, 2 ratings)
Chris Fregly details a full-featured, open source end-to-end TensorFlow model training and deployment system, using the latest advancements with Kubernetes, TensorFlow, and GPUs. Read more.
3:30pm–4:10pm Thursday, 09/13/2018
Location: 1E 10/11 Level: Non-technical
Secondary topics:  Data Platforms, Media, Marketing, Advertising, Retail and e-commerce
Francesco Mucio (Zalando SE)
Average rating: ***..
(3.50, 2 ratings)
Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead. Read more.
3:30pm–4:10pm Thursday, 09/13/2018
Location: 1A 10 Level: Intermediate
Oleksii Kachaiev (Attendify)
Average rating: ***..
(3.50, 2 ratings)
When we talk about microservices, we usually focus on the communication layer. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity, such as structural and semantic changes, knowledge sharing, and data discovery. Join Alexey Kachayev to explore emerging technologies created to tackle these challenges. Read more.
3:30pm–4:10pm Thursday, 09/13/2018
Location: 1A 21/22 Level: Intermediate
Ramesh Krishnan (lmco), Steve Morgan (Lockheed Martin)
Average rating: ****.
(4.00, 1 rating)
Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud. Read more.
3:30pm–4:10pm Thursday, 09/13/2018
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Temporal data and time-series analytics
Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)
The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts). Read more.
4:20pm–5:00pm Thursday, 09/13/2018
Location: 1A 08 Level: Intermediate
Secondary topics:  Financial Services, Model lifecycle management
Harish Doddi (Datatron Technologies), Jerry Xu (Datatron Technologies)
Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions. Read more.
4:20pm–5:00pm Thursday, 09/13/2018
Location: 1A 10 Level: Intermediate
Amandeep Khurana shares critical data management practices for easy and unified data access that meets security and regulatory compliance, helping you avoid the pitfalls that could lead to complex expensive architectures. Read more.
4:20pm–5:00pm Thursday, 09/13/2018
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Deep Learning, Media, Marketing, Advertising, Recommendation Systems
Nir Yungster (JW Player), Kamil Sindi (JW Player)
JW Player—the world’s largest network-independent video platform, representing 5% of global internet video—provides on-demand recommendations as a service to thousands of media publishers. Nir Yungster and Kamil Sindi explain how the company is systematically improving model performance while navigating the many engineering challenges and unique needs of the diverse publishers it serves. Read more.
4:20pm–5:00pm Thursday, 09/13/2018
Location: 1A 23/24 Level: Beginner
Secondary topics:  Data Integration and Data Pipelines
Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
Average rating: ****.
(4.50, 2 ratings)
Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture. Read more.
4:20pm–5:00pm Thursday, 09/13/2018
Location: 1E 07/08 Level: Beginner
Timothy Spann (DZone)
Average rating: ****.
(4.00, 2 ratings)
Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices. Read more.
4:20pm–5:00pm Thursday, 09/13/2018
Location: 1E 09 Level: Intermediate
Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn)
Average rating: *****
(5.00, 1 rating)
Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage. Read more.