Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Data Engineering & Architecture

22-24 May 2018
London, UK

Learn to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Add to your personal schedule
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 13 Level: Intermediate
Mala Ramakrishnan (Cloudera), Eugene Fratkin (Cloudera), Mark Samson (Cloudera)
The cloud enables the delivery of solutions to single multipurpose clusters offering hyperscale storage decoupled from elastic, on-demand computing. Mala Ramakrishnan, Eugene Fratkin, and Mark Samson detail new paradigms to effectively run production-level pipelines with minimal operational overhead. Join in to learn how to remove barriers to data discovery, metadata sharing, and access control. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 14 Level: Intermediate
Mark Madsen (Third Nature)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. Mark Madsen explores design assumptions and principles and walks you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
Add to your personal schedule
9:0017:00 Tuesday, 22 May 2018
Location: Capital Suite 4
Paul Lashmet (Arcadia Data), Konrad Sippel (Deustche Borse), Paul Damien Lynn (Nordea), Olaf Hein (ORDIX AG), Mikheil Nadareishvili (TBC Bank)
From analyzing risk and detecting fraud to predicting payments and improving customer experience, take a deep dive into the ways data technologies are transforming the financial industry. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 22 May 2018
Location: Capital Suite 13 Level: Advanced
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Stuart Pook (Criteo)
Criteo has a production cluster of 2000 nodes running over 300000 jobs/day and a backup cluster of 1200 nodes. These clusters are in our own data centres as the cloud is more expensive. They were meant to provide a redundant solution to Criteo's storage and compute needs. We will explain our project, what went wrong, and our progress in building another cluster to survive the loss of a full DC. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Jason Heo (Navercorp), Dooyong Kim (Navercorp)
naver.com is the largest search engine in Korea, shares 70% of search engine market, serves several Billions of Page View per day. My team has successfully built Web Analytics System with Druid at scale. In this presentation, I'll present our architecture, techniques for speed-up, Spark on Druid, and how we extend Druid. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Security and Privacy
Charaka Goonatilake (Panaseer)
Data is becoming a crucial weapon to secure an organization against cyber threats. Charaka Goonatilake shares strategies for designing effective data platforms for cybersecurity using big data technologies, such as Spark and Hadoop, and explains how these platforms are being used in real-world examples of data-driven security. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Gerard Maas (Lightbend)
Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences in key aspects of a streaming application, from the API user experience to dealing with time and with state and machine learning capabilities, and shares practical guidance on picking one or combining both to implement resilient streaming pipelines. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Ihab Ilyas (University of Waterloo | Tamr)
Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: S11A Level: Beginner
Jim Scott (MapR Technologies)
Creating a business solution is a lot of work. Instead of building to run on a single cloud provider, it is far more cost effective to leverage the cloud as infrastructure-as-a-service (IaaS). Using a global data fabric is a requirement for running on all cloud providers simultaneously. Includes having multi-master, active-active environments with full support for disaster management. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: S11B Level: Beginner
JD.com use Alluxio to provide support for ADHOC and real-time stream computing. One of them, the JDPresto on Alluxio has led to a 10x performance improvement on average. We use Alluxio compatible hdfs-url and as a pluggable optimization component. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Beginner
In the past 12 months British Telecom has added a streaming Network Analytics usecase to its multi-tenant data platform. This presentation shows how the solution works and is used deliver better Broadband and TV services. It explains how Kafka Spark on Yarn and HDFS encryption have been used to transform a mature Hadoop platform Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Greg Rahn (Cloudera)
For many organizations, the next big data warehouse will be in the cloud. Greg Rahn shares considerations for evaluating the cloud for analytics and big data warehousing, including different architectural approaches to optimize price and performance. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Carsten Herbe (Audi Business Innovation GmbH), Matthias Graunitz (Audi AG)
This talk is about Audi's journey from a first Hadoop PoC to a multi-tenant enterpise platform. We share our experiences gained on that journey, explain the decisions we had to make and show how some use cases are implemented using our platform. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Security and Privacy
Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA)
Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Beginner
Michael Noll (Confluent)
We introduce KSQL, the open source streaming SQL engine for Apache Kafka. KSQL makes it easy to get started with a wide range of real-time use cases such as monitoring application behavior and infrastructure, detecting anomalies and fraudulent activities in data feeds, and real-time ETL. We cover how to get up and running with KSQL and also explore the under-the-hood details of how it all works. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Manas Ranjan Kar (Episource)
At Episource, we work on building Deep Learning frameworks and architectures to help summarize a medical chart, extract medical coding opportunities and their dependencies to recommend best possible ICD10 codes. This not only required building a wide variety of deep learning algorithms to account for natural language variations but also fairly complex in-house training data creation exercises Read more.
Add to your personal schedule
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Eran Avidan (Intel)
Deep learning is revolutionizing many domains within computer vision, however real-time analysis is challenging. To address this, we have constructed a novel architecture that enables real time analysis of high-resolution streaming video. Our solution is a fully asynchronous system based on Redis,Docker and Tensorflow, that nonetheless gives the user the notion of a real-time video feed analysis Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: S11A Level: Advanced
Jacques Nadeau (Dremio)
Jacques Nadeau offers an overview of a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture, learn how data science, analytical, and custom applications can all leverage the cache simultaneously, and see a live demo. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: S11B Level: Beginner
Dr.-Ing. Michael Nolting (Volkswagen Commercial Vehicles)
Map matching applications exist in almost every telematics use case and are therefore crucial to all car manufacturers. This talk details the architecture behind Volkswagen Commercial Vehicle’s Altus-based map matching application. It closes with a live demo featuring the map matching job in Altus. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Lee Blum (Verint Systems)
Using an actual complex case study, Lee Blum will share how we built our Large Scale Cyber Defense system to serve our data scientists with versatile analytic operations on petabytes of data and trillions of records. He will discuss our extremely challenging use case, decision considerations, major design challenges, tips and tricks and the system’s overall results. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Paul Curtis (MapR Technologies)
The flexibility advantage conferred by containers depends on their ephemeral nature, so it’s useful to keep containers stateless. However, many applications require state, so access to a scalable persistence layer that supports real mutable files, tables and streams. This talk shows how to make containerized applications reliable, available and performant, even with stateful applications. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Non-technical
Secondary topics:  Security and Privacy
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for Big Data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage; issues that are only compounded when running on Docker containers. This session will discuss these challenges and how to overcome them. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Sean Glover (Lightbend)
Kafka is best suited to run close to the metal on dedicated machines in statically defined clusters. What are the pros and cons of running containerized Kafka In the age of mixed-use clusters? Learn about techniques for running Kafka while also supporting service migration in shared resource environments such as DC/OS (Mesos) and Kubernetes. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 13 Level: Intermediate
Olga Ermolin (MLS Listings)
Aggregation of geospecific real estate databases results in duplicate entries for properties located near geographical boundaries. Olga Ermolin details an approach for identifying duplicate entries via the analysis of images that accompany real estate listings that leverages transfer learning Siamese architecture based on VGG-16 CNN topology. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 14 Level: Beginner
Bargava Subramanian (Independent), Amit Kapoor (narrativeVIZ Consulting)
Visualisation for data science requires an interactive visualisation setup which works at scale. In this talk, we will explore the key architectural design considerations for such a system and illustrate, using real-life examples, the four key tradeoffs in this design space - rendering for data scale, computation for interaction speed, adaptive to data complexity and responsive to data velocity. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 17 Level: Beginner
Mark Madsen (Third Nature)
If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian discuss the trade-offs between a number of architectures that provide self-service access to data. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: S11A Level: Beginner
Christopher Royles (Cloudera)
Big Data and Cloud deployments return huge benefits in flexibility and economics, they can also result in runaway costs and failed projects. Based on practical production experience this session will help your initial sizing, strategic planning through to longer term operation, the focus being delivering an efficient platform, reducing costs and a successful project. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
This talk focuses on the Compute Infrastructure used by the large Data Science team at Stitch Fix. We shall have a look at the architecture, the interacting tools within the ecosystem and discuss the challenges that we overcame along the way. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: S11A Level: Intermediate
Holden Karau (Google), Rachel Warren (Independent), Anya Bida (Alpine Data)
Apache Spark is an amazing distributed system, but part of the bargain we've all made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. This talk will look at auto-tuning jobs using both historical and live job information using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Irene Gonzálvez (Spotify)
Irene Gonzálvez shares Spotify's process for ensuring data quality, covering why and how the company became aware of its importance, the products it has developed, and future strategy. Read more.
Add to your personal schedule
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Non-technical
Radim Řehůřek (RARE Technologies Ltd.)
Radim Řehůřek shares lessons learned and tips for successful R&D in applied data science. You'll learn the primary gaps between the academic and industry skill sets, what businesses should look out for when applying cutting-edge research in practice, what researchers can do to increase the impact of their research, and what companies can do to promote, reward, and nurture good quality ML research. Read more.
Add to your personal schedule
12:0512:45 Thursday, 24 May 2018
Location: S11A Level: Intermediate
Mark Grover (Lyft), Ted Malaska (Blizzard Entertainment)
There are a lot of details that go into building a big data system for speed. What is a respectable latency until data access, how to solve multi-region problem, where to store the data, how to know what data you have, and where does stream processing fit in. In this session, we walk through our experiences and lessons learned from seeing implementations in the wild. Read more.
Add to your personal schedule
12:0512:45 Thursday, 24 May 2018
Location: S11B Level: Beginner
Nanda Vijaydev (BlueData), Thomas Phelan (BlueData)
In the past, advanced machine learning techniques were only possible with a high-end proprietary stack. Today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. This session will focus on how to deploy TensorFlow and Spark, with Nvidia Cuda stack on Docker containers in a multi-tenant environment. Read more.
Add to your personal schedule
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Adesh Rao (Qubole), Abhishek Somani (Qubole)
A framework for Materialized Views in SQL-On-Hadoop engines that automatically suggests, creates, uses, invalidates and refreshes views created on top of data for optimal performance and strict correctness. Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Beginner
We will present how we did solve the issue about continuous deployment of machine learning models lead us to build a full stack of automated Machine Learning. Automated Machine learning allows us to rebuild models efficiently and keep our models up to date with fresh data brought by our data convergence tool. It also offers model management, by keeping the history of models and their performances Read more.
Add to your personal schedule
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Intermediate
Moty Fania (Intel)
In this session, Moty Fania will share Intel’s IT experience from implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming and online actuation. This session highlights the key learnings from this work with a thorough review of platform’s architecture Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: S11A Level: Intermediate
Jim Webber (Neo4j)
How neo4j mixes the strongly consistent Raft protocol with aysnc log shipping and provides a strong consistency guarantee: causal, which means you can always at least read-your-writes even in very large, multi-data center clusters. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Hope Wang (Intuit)
There’s increased demand of developing and scaling machine learning capabilities. A machine learning platform includes multiple phases which are iterative and overlapping with each other. Hope explains how to manage various artifacts, their associations, automate deployment in order to support the life-cycle of a model and build a cohesive Machine Learning platform. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Advanced
Eugene Kirpichov (Google)
Apache Beam offers users a novel programming model in which the classic batch/streaming dichotomy is erased, and ships with a rich set of IO connectors to popular storage systems. We describe Beam's philosophy for making these connectors flexible and modular, a key component of which is "Splittable DoFn" - a novel programming model primitive that unifies data ingestion between batch and streaming. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 10/11 Level: Intermediate
Olivia Klose (Microsoft), Elena Terenzi (Microsoft)
Olivia Klose and Elena Terenzi offer an overview of a collaboration between Microsoft and the Royal Holloway University that applied deep learning to locate illegal small-scale mines in Ghana using satellite imagery, scaled training using Kubernetes, and investigated the mines' impact on surrounding populations and environment. Read more.
Add to your personal schedule
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 15/16 Level: Non-technical
Simon Chan (Salesforce)
The promises of AI are great, but taking the steps to implement AI within an enterprise is challenging. The secret behind enterprise AI success often traces back to the underlying platform that accelerates AI development at scale. Based on years of experiences helping executives establish AI product strategies, Dr. Simon Chan walks through the AI platform journey that is right for your business. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: S11A Level: Beginner
Jason Bell (MastodonC)
Using Apache Kafka and DeepLearning4J Jason Bell presents a the design and implementation of a self learning knowledge system, the design rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Naghman Waheed (Monsanto), Brian Arnold (Monsanto)
Last few years have seen a number of tools appear in the market which make it easy to implementation a Data Lake. However, most tools lack essential features that prevent the data lake from turning into a data swamp. At Monsanto, our data platform engineering team embarked on building a platform which can ingest, store, access data sets without compromising ease of use, governance and security. Read more.
Add to your personal schedule
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Flavio Junqueira (Dell EMC)
Stream processing is at the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains. Critical to many such applications is the ability of adapting to workload variations, e.g., daily cycles. Pravega is a stream store that scales streams automatically and enables applications to scale downstream by signaling changes. Read more.