Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Data Engineering & Architecture

21-24 May 2018
London, UK

Learn to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Monday-Tuesday 21-22 May: 2-Day Training (Platinum & Training passes)

Tuesday 22 May: Tutorials (Gold & Silver passes)

Wednesday 23 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 \| Location: Auditorium Strata Data Conference Keynotes
10:45 Morning break

Thursday 24 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 \| Location: Auditorium Strata Data Conference Keynotes
10:45 Morning break

9:00 - 17:00 Monday, 21 May & Tuesday, 22 May

Real-time systems with Spark Streaming and Kafka

Location: Capital Suite 16

Jesse Anderson (Big Data Institute)

Average rating:

(5.00, 1 rating)

To handle real-time big data, you need to solve two difficult problems: How do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.

9:00–12:30 Tuesday, 22 May 2018

Modern real-time streaming architectures

Location: Capital Suite 8 Level: Intermediate

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

Average rating:

(3.67, 3 ratings)

The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. Karthik Ramasamy, Arun Kejariwal, and Ivan Kelly walk you through state-of-the-art streaming frameworks, algorithms, and architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.

9:00–12:30 Tuesday, 22 May 2018

Running data analytic workloads in the cloud

Location: Capital Suite 13 Level: Intermediate

Eugene Fratkin (Cloudera), Vinithra Varadharajan (Cloudera), Mael Ropars (Cloudera), Jason Wang (Cloudera)

Average rating:

(5.00, 1 rating)

Vinithra Varadharajan, Jason Wang, Eugene Fratkin, and Mael Ropars detail new paradigms to effectively run production-level pipelines with minimal operational overhead. Join in to learn how to remove barriers to data discovery, metadata sharing, and access control. Read more.

9:00–12:30 Tuesday, 22 May 2018

Architecting a data platform for enterprise use

Location: Capital Suite 14 Level: Intermediate

Secondary topics: Data Platforms

Mark Madsen (Teradata), Todd Walter (Archimedata)

Average rating:

(4.29, 7 ratings)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

13:30–17:00 Tuesday, 22 May 2018

Kafka streaming microservices with Akka Streams and Kafka Streams

Location: Capital Suite 8 Level: Intermediate

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Average rating:

(3.25, 4 ratings)

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Along the way, Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.

13:30–17:00 Tuesday, 22 May 2018

Architecting a next-generation data platform

SOLD OUT

Location: Capital Suite 12 Level: Advanced

Secondary topics: Data Platforms

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Average rating:

(4.33, 3 ratings)

Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics. Read more.

11:15–11:55 Wednesday, 23 May 2018

The cloud is expensive, so build your own redundant Hadoop clusters.

Location: S11A Level: Intermediate

Stuart Pook (Criteo)

Average rating:

(4.40, 5 ratings)

Criteo has a production cluster of 2K nodes running over 300K jobs a day in the company's own data centers. These clusters were meant to provide a redundant solution to Criteo's storage and compute needs. Stuart Pook offers an overview of the project, shares challenges and lessons learned, and discusses Criteo's progress in building another cluster to survive the loss of a full DC. Read more.

11:15–11:55 Wednesday, 23 May 2018

Web analytics at scale with Druid at Naver

Location: S11B Level: Intermediate

Secondary topics: Data Platforms, Media, Advertising, Entertainment

Jason Heo (Naver), Dooyong Kim (Navercorp)

Average rating:

(3.00, 1 rating)

Naver.com is the largest search engine in Korea, with a 70% share of the Korean search market, and it handles billions of pages and events everyday. Jason Heo and Dooyong Kim offer an overview of Naver's web analytics system, built with Druid. Read more.

11:15–11:55 Wednesday, 23 May 2018

Architecting data platforms for cybersecurity

Location: Capital Suite 7 Level: Intermediate

Secondary topics: Security and Privacy

Charaka Goonatilake (Panaseer)

Average rating:

(4.50, 2 ratings)

Data is becoming a crucial weapon to secure an organization against cyber threats. Charaka Goonatilake shares strategies for designing effective data platforms for cybersecurity using big data technologies, such as Spark and Hadoop, and explains how these platforms are being used in real-world examples of data-driven security. Read more.

11:15–11:55 Wednesday, 23 May 2018

Processing fast data with Apache Spark: A tale of two APIs

Location: Capital Suite 8/9 Level: Intermediate

Gerard Maas (Lightbend)

Average rating:

(4.00, 13 ratings)

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences in key aspects of a streaming application, from the API user experience to dealing with time and with state and machine learning capabilities, and shares practical guidance on picking one or combining both to implement resilient streaming pipelines. Read more.

12:05–12:45 Wednesday, 23 May 2018

Using a global data fabric to run a mixed cloud deployment

Location: S11A Level: Beginner

Jim Scott (NVIDIA)

Average rating:

(4.00, 2 ratings)

Creating a business solution is a lot of work. Instead of building to run on a single cloud provider, it is far more cost effective to leverage the cloud as infrastructure as a service (IaaS). Jim Scott explains why a global data fabric is a requirement for running on all cloud providers simultaneously. Read more.

12:05–12:45 Wednesday, 23 May 2018

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks

Location: S11B Level: Beginner

Secondary topics: Data Platforms, E-commerce and Retail, Transportation and Logistics

Baolong Mao (JD.com), Yiran Wu (JD.com), Yupeng Fu (Alluxio)

Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average. Read more.

12:05–12:45 Wednesday, 23 May 2018

Hadoop under attack: Securing data in a banking domain

Location: Capital Suite 7 Level: Intermediate

Secondary topics: Security and Privacy

Federico Leven (ReactoData)

Average rating:

(2.67, 3 ratings)

The apparent difficulty of managing Hadoop compared to more traditional and proprietary data products makes some companies wary of the Hadoop ecosystem, but managing security is becoming more accessible in the Hadoop space, particularly in the Cloudera stack. Federico Leven offers an overview of an end-to-end security deployment on Hadoop and the data and security governance policies implemented. Read more.

12:05–12:45 Wednesday, 23 May 2018

How BT delivers better broadband and TV using Spark and Kafka

Location: Capital Suite 8/9 Level: Beginner

Secondary topics: Telecom

Phillip Radley (BT)

Average rating:

(3.67, 3 ratings)

In the past year, British Telecom has added a streaming network analytics use case to its multitenant data platform. Phillip Radley demonstrates how the solution works and explains how it delivers better broadband and TV services, using Kafka and Spark on YARN and HDFS encryption. Read more.

14:05–14:45 Wednesday, 23 May 2018

Analytics in the cloud: Building a modern cloud-based big data warehouse

Location: S11A Level: Intermediate

Greg Rahn (Cloudera)

Average rating:

(3.29, 7 ratings)

For many organizations, the next big data warehouse will be in the cloud. Greg Rahn shares considerations for evaluating the cloud for analytics and big data warehousing, including different architectural approaches to optimize price and performance. Read more.

14:05–14:45 Wednesday, 23 May 2018

Audi's journey to an enterprise big data platform

Location: S11B Level: Intermediate

Secondary topics: Data Platforms, Transportation and Logistics

Carsten Herbe (Audi Business Innovation GmbH), Matthias Graunitz (Audi AG)

Average rating:

(4.33, 3 ratings)

Carsten Herbe and Matthias Graunitz detail Audi's journey from a Hadoop proof of concept to a multitenant enterprise platform, sharing lessons learned, the decisions Audi made, and how a number of use cases are implemented using the platform. Read more.

14:05–14:45 Wednesday, 23 May 2018

GPU-accelerated threat detection with GOAI

Location: Capital Suite 7 Level: Intermediate

Secondary topics: Security and Privacy

Joshua Patterson (NVIDIA), Chau Dang (NVIDIA)

Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration. Read more.

14:05–14:45 Wednesday, 23 May 2018

Unlocking the world of stream processing with KSQL, the streaming SQL engine for Apache Kafka

Location: Capital Suite 8/9 Level: Beginner

Michael Noll (Confluent)

Average rating:

(4.67, 6 ratings)

Michael Noll offers an overview of KSQL, the open source streaming SQL engine for Apache Kafka, which makes it easy to get started with a wide range of real-time use cases, such as monitoring application behavior and infrastructure, detecting anomalies and fraudulent activities in data feeds, and real-time ETL. Read more.

14:05–14:45 Wednesday, 23 May 2018

Time for a new relation: Going from RDBMS to a graph database

Location: Expo Hall Level: Intermediate

Secondary topics: Time Series and Graphs

Tags:

Patrick McFadin (DataStax)

Average rating:

(5.00, 2 ratings)

Graph databases are becoming mainstream. Patrick McFadin explains how to use the knowledge you have gained from your years of working with relational databases in this brave new world. There are many similarities but also some significant differences that can open up completely new use cases. If you're deciding whether to take the plunge into graph databases, this is the talk for you. Read more.

14:55–15:35 Wednesday, 23 May 2018

Data science across data sources with Apache Arrow

Location: S11A Level: Intermediate

Tomer Shiran (Dremio)

Average rating:

(3.50, 2 ratings)

It's often impractical for organizations to physically consolidate all data into one system. Tomer Shiran offers an overview of Apache Arrow, an open source columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real time, simplifying and accelerating data access without having to copy all data into one location. Read more.

14:55–15:35 Wednesday, 23 May 2018

Elastic map matching using Cloudera Altus and Apache Spark

Location: S11B Level: Beginner

Secondary topics: Transportation and Logistics

Timo Graen (Volkswagen AG ), Robert Neumann (Ultra Tendency)

Average rating:

(3.50, 2 ratings)

Map-matching applications exist in almost every telematics use case and are therefore crucial to all car manufacturers. Timo Graen and Robert Neumann detail the architecture behind Volkswagen Commercial Vehicle’s Altus-based map-matching application and lead a live demo featuring a map matching job in Altus. Read more.

14:55–15:35 Wednesday, 23 May 2018

The ultimate data scientist's playground: Building a multipetabyte analytic infrastructure for cyber defense

Location: Capital Suite 7 Level: Intermediate

Lee Blum (Verint Systems)

Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the system’s overall results. Read more.

14:55–15:35 Wednesday, 23 May 2018

Multi-data center and multitenant durable messaging with Apache Pulsar

Location: Capital Suite 8/9 Level: Intermediate

Ivan Kelly (Streamlio)

Average rating:

(3.00, 2 ratings)

Ivan Kelly offers an overview of Apache Pulsar, a durable, distributed messaging system, underpinned by Apache BookKeeper, that provides the enterprise features necessary to guarantee that your data is where is should be and only accessible by those who should have access. Read more.

16:35–17:15 Wednesday, 23 May 2018

Making stateless containers reliable and available even with stateful applications

Location: S11A Level: Intermediate

Paul Curtis (Weaveworks)

Average rating:

(4.00, 2 ratings)

The flexibility advantage conferred by containers depends on their ephemeral nature, so it’s useful to keep containers stateless. However, many applications require state—access to a scalable persistence layer that supports real mutable files, tables, and streams. Paul Curtis demonstrates how to make containerized applications reliable, available, and performant, even with stateful applications. Read more.

16:35–17:15 Wednesday, 23 May 2018

Improving DevOps and QA efficiency using machine learning and NLP methods

Location: S11B Level: Intermediate

Secondary topics: Text and Language processing and analysis

Ran Taig (Dell), Omer Sagi (Dell)

Average rating:

(2.00, 1 rating)

DevOps and QA engineers spend a significant amount of time investigating reoccurring issues. These issues are often represented by large configuration and log files, so the process of investigating whether two issues are duplicates can be a very tedious task. Ran Taig and Omer Sagi outline a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues. Read more.

16:35–17:15 Wednesday, 23 May 2018

How to protect big data in a containerized environment

Location: Capital Suite 7 Level: Non-technical

Secondary topics: Security and Privacy

Thomas Phelan (HPE BlueData)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE), but TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them. Read more.

16:35–17:15 Wednesday, 23 May 2018

Kafka in jail: Running Kafka in container-orchestrated clusters

Location: Capital Suite 8/9 Level: Intermediate

Sean Glover (Lightbend)

Average rating:

(2.50, 2 ratings)

Kafka is best suited to run close to the metal on dedicated machines in static clusters, but these clusters are quickly becoming extinct. Companies want mixed-use clusters that take advantage of every resource available. Sean Glover offers an overview of leading Kafka implementations on DC/OS and Kubernetes to explore how reliably they run Kafka in container-orchestrated clusters. Read more.

16:35–17:15 Wednesday, 23 May 2018

Data-driven ecosystems in the automotive industry

Location: Expo Hall Level: Intermediate

Tobias Burger (BMW Group), Joshua Goerner (BMW AG)

Average rating:

(5.00, 1 rating)

The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Tobias Bürger and Joshua Görner discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments. Read more.

16:35–17:15 Wednesday, 23 May 2018

The eAGLE accelerator: How to speed up migrations from legacy ETL to big data implementations

Location: Capital Suite 2/3 Level: Intermediate

Enric Biosca Trias (everis), Angel Valencia (everis)

Average rating:

(2.00, 2 ratings)

Enric Biosca offers an overview of the eAGLE accelerator, which speeds up migration processes from legacy ETL to big data implementations by enabling auditing, lineage, and translation of legacy code for big data. Along the way, Enric demonstrates how graph and automatic translation technologies help companies reduce their migration times. Read more.

17:25–18:05 Wednesday, 23 May 2018

Practical advice for driving down the cost of cloud big data platforms

Location: S11A Level: Beginner

Christopher Royles (Cloudera)

Average rating:

(4.00, 1 rating)

Big data and cloud deployments return huge benefits in flexibility and economics but can also result in runaway costs and failed projects. Drawing on his production experience, Christopher Royles shares tips and best practices for determining initial sizing, strategic planning, and longer-term operation, helping you deliver an efficient platform, reduce costs, and implement a successful project. Read more.

17:25–18:05 Wednesday, 23 May 2018

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

Location: S11B Level: Intermediate

Holden Karau (Independent), Rachel Warren (Salesforce Einstein)

Average rating:

(4.00, 2 ratings)

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.

17:25–18:05 Wednesday, 23 May 2018

Security, governance, and cloud analytics, oh my!

Location: Capital Suite 7 Level: Beginner

Secondary topics: Security and Privacy

Nikki Rouda (Cloudera), Nick Curcuru (Mastercard)

Average rating:

(4.00, 2 ratings)

Having so many cloud-based analytics services available is a dream come true. However, it's a nightmare to manage proper security and governance across all those different services. Nikki Rouda and Nick Curcuru share advice on how to minimize the risk and effort in protecting and managing data for multidisciplinary analytics and explain how to avoid the hassle and extra cost of siloed approaches. Read more.

17:25–18:05 Wednesday, 23 May 2018

Stream processing for the practitioner: Blueprints for common stream processing use cases with Apache Flink

Location: Capital Suite 8/9 Level: Intermediate

Aljoscha Krettek (Ververica)

Average rating:

(4.67, 3 ratings)

Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink. Read more.

17:25–18:05 Wednesday, 23 May 2018

Batch and real-time processing in LINE's log analysis platform

Location: Capital Suite 2/3 Level: Beginner

Wataru Yukawa (LINE)

LINE—one of the most popular messaging applications in Asia—offers many services, such as its news application. These services sometimes depend on real-time processing. Wataru Yukawa offers an overview of LINE's web tracking system, which consists of the JavaScript SDK, NGINX Fluentd, Kafka, Elasticsearch, and Hadoop, and explains how it helps with batch and real-time processing. Read more.

11:15–11:55 Thursday, 24 May 2018

Improving ad hoc and production workflows at Stitch Fix

Location: S11A Level: Intermediate

Secondary topics: Data Platforms, E-commerce and Retail

Neelesh Salian (Stitch Fix)

Average rating:

(1.00, 1 rating)

Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way. Read more.

11:15–11:55 Thursday, 24 May 2018

Big data, big quality: Data quality at Spotify

Location: S11B Level: Intermediate

Secondary topics: Data Integration and Data Pipelines sessions, Data Platforms, Media, Advertising, Entertainment

Irene Gonzálvez (Spotify)

Average rating:

(3.88, 8 ratings)

Irene Gonzálvez shares Spotify's process for ensuring data quality, covering why and how the company became aware of its importance, the products it has developed, and future strategy. Read more.

11:15–11:55 Thursday, 24 May 2018

Accelerating development velocity of production ML systems with Docker

Location: Capital Suite 7 Level: Intermediate

Secondary topics: Data Platforms, Managing and Deploying Machine Learning, Media, Advertising, Entertainment

Kinnary Jangla (Pinterest)

Average rating:

(3.00, 5 ratings)

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of its ML teams while increasing uptime and ease of deployment. Read more.

11:15–11:55 Thursday, 24 May 2018

You’re doing it wrong: How Zoomdata rearchitected streaming

Location: Capital Suite 8/9 Level: Beginner

Secondary topics: Visualization, Design, and UX

Erin Recachinas (Zoomdata)

Average rating:

(4.00, 2 ratings)

The value of real-time streaming analytics with historical data is immense. Big data application Zoomdata updates historical dashboards in real time without complex reaggregations, but streaming in the age of the IoT requires handling of data in volumes not seen in traditional feeds. Erin Recachinas explains how Zoomdata moved to a scalable microservice architecture for streaming sources. Read more.

12:05–12:45 Thursday, 24 May 2018

Setting up a lightweight distributed caching layer using Apache Arrow

Location: S11A Level: Advanced

Jacques Nadeau (Dremio)

Average rating:

(4.00, 3 ratings)

Jacques Nadeau offers an overview of a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture, learn how data science, analytical, and custom applications can all leverage the cache simultaneously, and see a live demo. Read more.

12:05–12:45 Thursday, 24 May 2018

Big data at speed

Location: S11B Level: Intermediate

Secondary topics: Transportation and Logistics

Mark Grover (Lyft), Ted Malaska (Capital One)

Average rating:

(5.00, 6 ratings)

Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed. Read more.

12:05–12:45 Thursday, 24 May 2018

Deep learning with TensorFlow and Spark using GPUs and Docker containers

Location: Capital Suite 7 Level: Beginner

Secondary topics: Managing and Deploying Machine Learning

Nanda Vijaydev (BlueData), Thomas Phelan (HPE BlueData)

Average rating:

(4.17, 6 ratings)

In the past, you needed a high-end proprietary stack for advanced machine learning, but today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. Nanda Vijaydev and Thomas Phelan demonstrate how to deploy a TensorFlow and Spark with NVIDIA CUDA stack on Docker containers in a multitenant environment. Read more.

12:05–12:45 Thursday, 24 May 2018

Autonomous ETL with materialized views

Location: Capital Suite 8/9 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines sessions

Adesh Rao (Qubole), Abhishek Somani (Qubole)

Average rating:

(3.00, 2 ratings)

Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Read more.

14:05–14:45 Thursday, 24 May 2018

Why knowledge graphs are important to finance

Location: S11A Level: Intermediate

haikal haikal (GRAKN.AI)

Average rating:

(3.50, 2 ratings)

Haikal Pribadi explains why knowledge graphs (KGs) are important for AI systems in the finance sector and details how they are being used to detect and uncover new knowledge, specifically for risk analysis, fraud detection, and GDPR use cases. Read more.

14:05–14:45 Thursday, 24 May 2018

Bringing AI to BI: Microsoft's road to automated business incident monitoring and diagnostics with Project Kensho

Location: S11B Level: Intermediate

Secondary topics: Data Platforms, Time Series and Graphs

Tony Xing (Microsoft), Bixiong Xu (Microsoft)

Average rating:

(2.00, 1 rating)

Tony Xing and Bixiong Xu offer an overview of Project Kensho, Microsoft's one-stop shop for business incident monitoring and automated insights. Tony and Bixiong cover the technology's evolution, the architecture, the algorithms, and the benefits and the trade-offs. Along the way, they share a case study on Bing ads key metrics monitoring and automated diagnostic insights. Read more.

14:05–14:45 Thursday, 24 May 2018

Continuous delivery and machine learning

Location: Capital Suite 7 Level: Beginner

Secondary topics: Managing and Deploying Machine Learning

Guillaume Salou (OVH)

Average rating:

(3.00, 5 ratings)

Guillaume Salou shares OVH's approach to continuous deployment of machine learning models, which involved building a full stack of automated machine learning. Automated machine learning allows the company to rebuild models efficiently and keep models up to date with fresh data brought by its data convergence tool. Read more.

14:05–14:45 Thursday, 24 May 2018

Complex event processing with Apache Flink

Location: Capital Suite 8/9 Level: Intermediate

Kostas Kloudas (data Artisans)

Average rating:

(2.25, 4 ratings)

Complex event processing (CEP) helps detect patterns over continuous streams of data. DNA sequencing, fraud detection, shipment tracking with specific characteristics (e.g., contaminated goods), and user activity analysis fall into this category. Kostas Kloudas offers an overview of Flink's CEP library and explains the benefits of its integration with Flink. Read more.

14:05–14:45 Thursday, 24 May 2018

Scaling the AI hierarchy of needs with TensorFlow, Spark, and Hops

Location: Capital Suite 13 Level: Beginner

Jim Dowling (Logical Clocks)

Average rating:

(5.00, 2 ratings)

Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy. Read more.

14:55–15:35 Thursday, 24 May 2018

Mixing causal consistency and asynchronous replication for large Neo4j clusters

Location: S11A Level: Intermediate

Secondary topics: Time Series and Graphs

Jim Webber (Neo4j)

Average rating:

(5.00, 3 ratings)

Jim Webber details how Neo4j mixes the strongly consistent Raft protocol with async log shipping and provides a strong consistency guarantee: causal, which means you can always at least read your writes even in very large multidata center clusters. Read more.

14:55–15:35 Thursday, 24 May 2018

ClickFox: Customer journey analytics powered by OpenStack and Cloudera

Location: S11B Level: Intermediate

Secondary topics: Data Platforms

Alvin HEIB (Cloudera), guy le roux (Atos)

Alvin Heib and Guy Leroux offer an overview of ClickFox, a platform able to cope with high-performance analytical needs, from bits and bytes to solving a customer needs, covering the platform's virtualization, big data, and analytical layers. Read more.

14:55–15:35 Thursday, 24 May 2018

Machine learning platform lifecycle management

Location: Capital Suite 7 Level: Intermediate

Secondary topics: Financial Services, Managing and Deploying Machine Learning

Hope Wang (Intuit)

Average rating:

(4.00, 3 ratings)

A machine learning platform is not just the sum of its parts; the key is how it supports the model lifecycle end to end. Hope Wang explains how to manage various artifacts and their associations, automate deployment to support the lifecycle of a model, and build a cohesive machine learning platform. Read more.

14:55–15:35 Thursday, 24 May 2018

Radically modular data ingestion APIs in Apache Beam

Location: Capital Suite 8/9 Level: Advanced

Secondary topics: Data Integration and Data Pipelines sessions

Eugene Kirpichov (Google)

Average rating:

(4.50, 2 ratings)

Apache Beam offers users a novel programming model in which the classic batch-streaming dichotomy is erased and ships with a rich set of I/O connectors to popular storage systems. Eugene Kirpichov explains why Beam has made these connectors flexible and modular—a key component of which is Splittable DoFn, a novel programming model primitive that unifies data ingestion between batch and streaming. Read more.

14:55–15:35 Thursday, 24 May 2018

Improving computer vision models at scale

Location: Capital Suite 2/3 Level: Intermediate

Marton Balassi (Cloudera), Mirko Kämpf (Cloudera), Jan Kunigk (Cloudera)

Average rating:

(5.00, 2 ratings)

Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable. Read more.

16:35–17:15 Thursday, 24 May 2018

Learning how to design automatically updating AI with Apache Kafka and Deeplearning4j

Location: S11A Level: Beginner

Jason Bell (Independent Speaker)

Jason Bell offers an overview of a self-learning knowledge system that uses Apache Kafka and Deeplearning4j to accept data, apply training to a neural network, and output predictions. Jason covers the system design and the rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence. Read more.

16:35–17:15 Thursday, 24 May 2018

You call it data lake; we call it Data Historian.

Location: S11B Level: Intermediate

Secondary topics: Data Platforms

Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer)

Average rating:

(4.50, 2 ratings)

There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security. Read more.

16:35–17:15 Thursday, 24 May 2018

DevOps at ING Analytics: Combining data engineering with data operations

Location: Capital Suite 7 Level: Intermediate

Giuseppe D'alessio (ING Group)

Average rating:

(3.25, 4 ratings)

Giuseppe D'alessio details ING's DevOps journey, covering its impact on people, processes and tools, best practices, and pitfalls. Giuseppe concludes with a concrete example of using analytics and streaming technology on real-time applications. Read more.

16:35–17:15 Thursday, 24 May 2018

Stream scaling in Pravega

Location: Capital Suite 8/9 Level: Intermediate

Flavio Junqueira (Dell EMC)

Stream processing is in the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains, and the ability to adapt to workload variations is critical to many applications. Flavio Junqueira explores Pravega, a stream store that scales streams automatically and enables applications to scale downstream by signaling changes. Read more.

16:35–17:15 Thursday, 24 May 2018

Human-in-the-loop data science with Jupyter widgets

Location: Capital Suite 14 Level: Intermediate

Pascal Bugnion (ASI Data Science)

Jupyter widgets let you create lightweight, interactive graphical interfaces directly in Jupyter notebooks. Pascal Bugnion demonstrates how to use Jupyter widgets to implement human-in-the-loop machine learning with highly interactive user interfaces. Read more.

Presented by

Elite Sponsors

Exabyte Sponsor

Impact Sponsors

Supporting Sponsor

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com