Presented By O’Reilly and Cloudera
Make Data Work
21–22 May 2018: Training
22–24 May 2018: Tutorials & Conference
London, UK

Data Engineering & Architecture

21-24 May 2018
London, UK

Learn to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Monday-Tuesday 21-22 May: 2-Day Training (Platinum & Training passes)
Tuesday 22 May: Tutorials (Gold & Silver passes)
Wednesday 23 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:45
Morning break
Thursday 24 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:45
Morning break
9:00 - 17:00 Monday, 21 May & Tuesday, 22 May
Location: Capital Suite 16
Jesse Anderson (Big Data Institute)
Average rating: *****
(5.00, 1 rating)
To handle real-time big data, you need to solve two difficult problems: How do you ingest that much data, and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 8 Level: Intermediate
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)
Average rating: ***..
(3.67, 3 ratings)
The need for instant data-driven insights has led the proliferation of messaging and streaming frameworks. Karthik Ramasamy, Arun Kejariwal, and Ivan Kelly walk you through state-of-the-art streaming frameworks, algorithms, and architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 13 Level: Intermediate
Eugene Fratkin (Cloudera), Vinithra Varadharajan (Cloudera), Mael Ropars (Cloudera), Jason Wang (Cloudera)
Average rating: *****
(5.00, 1 rating)
Vinithra Varadharajan, Jason Wang, Eugene Fratkin, and Mael Ropars detail new paradigms to effectively run production-level pipelines with minimal operational overhead. Join in to learn how to remove barriers to data discovery, metadata sharing, and access control. Read more.
9:0012:30 Tuesday, 22 May 2018
Location: Capital Suite 14 Level: Intermediate
Secondary topics:  Data Platforms
Mark Madsen (Teradata), Todd Walter (Archimedata)
Average rating: ****.
(4.29, 7 ratings)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
13:3017:00 Tuesday, 22 May 2018
Location: Capital Suite 8 Level: Intermediate
Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)
Average rating: ***..
(3.25, 4 ratings)
Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Along the way, Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.
13:3017:00 Tuesday, 22 May 2018
SOLD OUT
Location: Capital Suite 12 Level: Advanced
Secondary topics:  Data Platforms
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Average rating: ****.
(4.33, 3 ratings)
Using Customer 360 and the IoT as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics. Read more.
11:1511:55 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Stuart Pook (Criteo)
Average rating: ****.
(4.40, 5 ratings)
Criteo has a production cluster of 2K nodes running over 300K jobs a day in the company's own data centers. These clusters were meant to provide a redundant solution to Criteo's storage and compute needs. Stuart Pook offers an overview of the project, shares challenges and lessons learned, and discusses Criteo's progress in building another cluster to survive the loss of a full DC. Read more.
11:1511:55 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms, Media, Advertising, Entertainment
Jason Heo (Naver), Dooyong Kim (Navercorp)
Average rating: ***..
(3.00, 1 rating)
Naver.com is the largest search engine in Korea, with a 70% share of the Korean search market, and it handles billions of pages and events everyday. Jason Heo and Dooyong Kim offer an overview of Naver's web analytics system, built with Druid. Read more.
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Security and Privacy
Charaka Goonatilake (Panaseer)
Average rating: ****.
(4.50, 2 ratings)
Data is becoming a crucial weapon to secure an organization against cyber threats. Charaka Goonatilake shares strategies for designing effective data platforms for cybersecurity using big data technologies, such as Spark and Hadoop, and explains how these platforms are being used in real-world examples of data-driven security. Read more.
11:1511:55 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Gerard Maas (Lightbend)
Average rating: ****.
(4.00, 13 ratings)
Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences in key aspects of a streaming application, from the API user experience to dealing with time and with state and machine learning capabilities, and shares practical guidance on picking one or combining both to implement resilient streaming pipelines. Read more.
12:0512:45 Wednesday, 23 May 2018
Location: S11A Level: Beginner
Jim Scott (NVIDIA)
Average rating: ****.
(4.00, 2 ratings)
Creating a business solution is a lot of work. Instead of building to run on a single cloud provider, it is far more cost effective to leverage the cloud as infrastructure as a service (IaaS). Jim Scott explains why a global data fabric is a requirement for running on all cloud providers simultaneously. Read more.
12:0512:45 Wednesday, 23 May 2018
Location: S11B Level: Beginner
Secondary topics:  Data Platforms, E-commerce and Retail, Transportation and Logistics
Baolong Mao (JD.com), Yiran Wu (JD.com), Yupeng Fu (Alluxio)
Mao Baolong, Yiran Wu, and Yupeng Fu explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average. Read more.
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Security and Privacy
Federico Leven (ReactoData)
Average rating: **...
(2.67, 3 ratings)
The apparent difficulty of managing Hadoop compared to more traditional and proprietary data products makes some companies wary of the Hadoop ecosystem, but managing security is becoming more accessible in the Hadoop space, particularly in the Cloudera stack. Federico Leven offers an overview of an end-to-end security deployment on Hadoop and the data and security governance policies implemented. Read more.
12:0512:45 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Beginner
Secondary topics:  Telecom
Average rating: ***..
(3.67, 3 ratings)
In the past year, British Telecom has added a streaming network analytics use case to its multitenant data platform. Phillip Radley demonstrates how the solution works and explains how it delivers better broadband and TV services, using Kafka and Spark on YARN and HDFS encryption. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Greg Rahn (Cloudera)
Average rating: ***..
(3.29, 7 ratings)
For many organizations, the next big data warehouse will be in the cloud. Greg Rahn shares considerations for evaluating the cloud for analytics and big data warehousing, including different architectural approaches to optimize price and performance. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms, Transportation and Logistics
Carsten Herbe (Audi Business Innovation GmbH), Matthias Graunitz (Audi AG)
Average rating: ****.
(4.33, 3 ratings)
Carsten Herbe and Matthias Graunitz detail Audi's journey from a Hadoop proof of concept to a multitenant enterprise platform, sharing lessons learned, the decisions Audi made, and how a number of use cases are implemented using the platform. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Security and Privacy
Joshua Patterson (NVIDIA), Chau Dang (NVIDIA)
Joshua Patterson and Mike Wendt explain how NVIDIA used GPU-accelerated open source technologies to improve its cyberdefense platforms by leveraging software from the GPU Open Analytics Initiative (GOAI) and how the company accelerated anomaly detection with more efficient machine learning models, faster deployment, and more granular data exploration. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Beginner
Michael Noll (Confluent)
Average rating: ****.
(4.67, 6 ratings)
Michael Noll offers an overview of KSQL, the open source streaming SQL engine for Apache Kafka, which makes it easy to get started with a wide range of real-time use cases, such as monitoring application behavior and infrastructure, detecting anomalies and fraudulent activities in data feeds, and real-time ETL. Read more.
14:0514:45 Wednesday, 23 May 2018
Location: Expo Hall Level: Intermediate
Secondary topics:  Time Series and Graphs
Tags: us
Patrick McFadin (DataStax)
Average rating: *****
(5.00, 2 ratings)
Graph databases are becoming mainstream. Patrick McFadin explains how to use the knowledge you have gained from your years of working with relational databases in this brave new world. There are many similarities but also some significant differences that can open up completely new use cases. If you're deciding whether to take the plunge into graph databases, this is the talk for you. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Tomer Shiran (Dremio)
Average rating: ***..
(3.50, 2 ratings)
It's often impractical for organizations to physically consolidate all data into one system. Tomer Shiran offers an overview of Apache Arrow, an open source columnar, in-memory data representation that enables analytical systems and data sources to exchange and process data in real time, simplifying and accelerating data access without having to copy all data into one location. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: S11B Level: Beginner
Secondary topics:  Transportation and Logistics
Timo Graen (Volkswagen AG ), Robert Neumann (Ultra Tendency)
Average rating: ***..
(3.50, 2 ratings)
Map-matching applications exist in almost every telematics use case and are therefore crucial to all car manufacturers. Timo Graen and Robert Neumann detail the architecture behind Volkswagen Commercial Vehicle’s Altus-based map-matching application and lead a live demo featuring a map matching job in Altus. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Intermediate
Lee Blum (Verint Systems)
Lee Blum offers an overview of Verint's large-scale cyber-defense system built to serve its data scientists with versatile analytic operations on petabytes of data and trillions of records, covering the company's extremely challenging use case, decision considerations, major design challenges, tips and tricks, and the system’s overall results. Read more.
14:5515:35 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Ivan Kelly (Streamlio)
Average rating: ***..
(3.00, 2 ratings)
Ivan Kelly offers an overview of Apache Pulsar, a durable, distributed messaging system, underpinned by Apache BookKeeper, that provides the enterprise features necessary to guarantee that your data is where is should be and only accessible by those who should have access. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: S11A Level: Intermediate
Paul Curtis (Weaveworks)
Average rating: ****.
(4.00, 2 ratings)
The flexibility advantage conferred by containers depends on their ephemeral nature, so it’s useful to keep containers stateless. However, many applications require state—access to a scalable persistence layer that supports real mutable files, tables, and streams. Paul Curtis demonstrates how to make containerized applications reliable, available, and performant, even with stateful applications. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Text and Language processing and analysis
Ran Taig (Dell), Omer Sagi (Dell)
Average rating: **...
(2.00, 1 rating)
DevOps and QA engineers spend a significant amount of time investigating reoccurring issues. These issues are often represented by large configuration and log files, so the process of investigating whether two issues are duplicates can be a very tedious task. Ran Taig and Omer Sagi outline a solution that leverages NLP and machine learning algorithms to automatically identify duplicate issues. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Non-technical
Secondary topics:  Security and Privacy
Thomas Phelan (HPE BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE), but TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Sean Glover (Lightbend)
Average rating: **...
(2.50, 2 ratings)
Kafka is best suited to run close to the metal on dedicated machines in static clusters, but these clusters are quickly becoming extinct. Companies want mixed-use clusters that take advantage of every resource available. Sean Glover offers an overview of leading Kafka implementations on DC/OS and Kubernetes to explore how reliably they run Kafka in container-orchestrated clusters. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: Expo Hall Level: Intermediate
Tobias Burger (BMW Group), Joshua Goerner (BMW AG)
Average rating: *****
(5.00, 1 rating)
The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Tobias Bürger and Joshua Görner discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments. Read more.
16:3517:15 Wednesday, 23 May 2018
Location: Capital Suite 2/3 Level: Intermediate
Enric Biosca Trias (everis), Angel Valencia (everis)
Average rating: **...
(2.00, 2 ratings)
Enric Biosca offers an overview of the eAGLE accelerator, which speeds up migration processes from legacy ETL to big data implementations by enabling auditing, lineage, and translation of legacy code for big data. Along the way, Enric demonstrates how graph and automatic translation technologies help companies reduce their migration times. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: S11A Level: Beginner
Christopher Royles (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Big data and cloud deployments return huge benefits in flexibility and economics but can also result in runaway costs and failed projects. Drawing on his production experience, Christopher Royles shares tips and best practices for determining initial sizing, strategic planning, and longer-term operation, helping you deliver an efficient platform, reduce costs, and implement a successful project. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: S11B Level: Intermediate
Holden Karau (Independent), Rachel Warren (Salesforce Einstein)
Average rating: ****.
(4.00, 2 ratings)
Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 7 Level: Beginner
Secondary topics:  Security and Privacy
Nikki Rouda (Cloudera), Nick Curcuru (Mastercard)
Average rating: ****.
(4.00, 2 ratings)
Having so many cloud-based analytics services available is a dream come true. However, it's a nightmare to manage proper security and governance across all those different services. Nikki Rouda and Nick Curcuru share advice on how to minimize the risk and effort in protecting and managing data for multidisciplinary analytics and explain how to avoid the hassle and extra cost of siloed approaches. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Aljoscha Krettek (Ververica)
Average rating: ****.
(4.67, 3 ratings)
Aljoscha Krettek offers an overview of the modern stream processing space, details the challenges posed by stateful and event-time-aware stream processing, and shares core archetypes ("application blueprints”) for stream processing drawn from real-world use cases with Apache Flink. Read more.
17:2518:05 Wednesday, 23 May 2018
Location: Capital Suite 2/3 Level: Beginner
Wataru Yukawa (LINE)
LINE—one of the most popular messaging applications in Asia—offers many services, such as its news application. These services sometimes depend on real-time processing. Wataru Yukawa offers an overview of LINE's web tracking system, which consists of the JavaScript SDK, NGINX Fluentd, Kafka, Elasticsearch, and Hadoop, and explains how it helps with batch and real-time processing. Read more.
11:1511:55 Thursday, 24 May 2018
Location: S11A Level: Intermediate
Secondary topics:  Data Platforms, E-commerce and Retail
Neelesh Salian (Stitch Fix)
Average rating: *....
(1.00, 1 rating)
Neelesh Srinivas Salian offers an overview of the compute infrastructure used by the data science team at Stitch Fix, covering the architecture, tools within the larger ecosystem, and the challenges that the team overcame along the way. Read more.
11:1511:55 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions, Data Platforms, Media, Advertising, Entertainment
Irene Gonzálvez (Spotify)
Average rating: ***..
(3.88, 8 ratings)
Irene Gonzálvez shares Spotify's process for ensuring data quality, covering why and how the company became aware of its importance, the products it has developed, and future strategy. Read more.
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Data Platforms, Managing and Deploying Machine Learning, Media, Advertising, Entertainment
Kinnary Jangla (Pinterest)
Average rating: ***..
(3.00, 5 ratings)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of its ML teams while increasing uptime and ease of deployment. Read more.
11:1511:55 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Beginner
Secondary topics:  Visualization, Design, and UX
Erin Recachinas (Zoomdata)
Average rating: ****.
(4.00, 2 ratings)
The value of real-time streaming analytics with historical data is immense. Big data application Zoomdata updates historical dashboards in real time without complex reaggregations, but streaming in the age of the IoT requires handling of data in volumes not seen in traditional feeds. Erin Recachinas explains how Zoomdata moved to a scalable microservice architecture for streaming sources. Read more.
12:0512:45 Thursday, 24 May 2018
Location: S11A Level: Advanced
Jacques Nadeau (Dremio)
Average rating: ****.
(4.00, 3 ratings)
Jacques Nadeau offers an overview of a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture, learn how data science, analytical, and custom applications can all leverage the cache simultaneously, and see a live demo. Read more.
12:0512:45 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Transportation and Logistics
Mark Grover (Lyft), Ted Malaska (Capital One)
Average rating: *****
(5.00, 6 ratings)
Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed. Read more.
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Beginner
Secondary topics:  Managing and Deploying Machine Learning
Nanda Vijaydev (BlueData), Thomas Phelan (HPE BlueData)
Average rating: ****.
(4.17, 6 ratings)
In the past, you needed a high-end proprietary stack for advanced machine learning, but today, you can use open source machine learning and deep learning algorithms available with distributed computing technologies like Apache Spark and GPUs. Nanda Vijaydev and Thomas Phelan demonstrate how to deploy a TensorFlow and Spark with NVIDIA CUDA stack on Docker containers in a multitenant environment. Read more.
12:0512:45 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines sessions
Adesh Rao (Qubole), Abhishek Somani (Qubole)
Average rating: ***..
(3.00, 2 ratings)
Adesh Rao and Abhishek Somani share a framework for materialized views in SQL-on-Hadoop engines that automatically suggests, creates, uses, invalidates, and refreshes views created on top of data for optimal performance and strict correctness. Read more.
14:0514:45 Thursday, 24 May 2018
Location: S11A Level: Intermediate
haikal haikal (GRAKN.AI)
Average rating: ***..
(3.50, 2 ratings)
Haikal Pribadi explains why knowledge graphs (KGs) are important for AI systems in the finance sector and details how they are being used to detect and uncover new knowledge, specifically for risk analysis, fraud detection, and GDPR use cases. Read more.
14:0514:45 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms, Time Series and Graphs
Tony Xing (Microsoft), Bixiong Xu (Microsoft)
Average rating: **...
(2.00, 1 rating)
Tony Xing and Bixiong Xu offer an overview of Project Kensho, Microsoft's one-stop shop for business incident monitoring and automated insights. Tony and Bixiong cover the technology's evolution, the architecture, the algorithms, and the benefits and the trade-offs. Along the way, they share a case study on Bing ads key metrics monitoring and automated diagnostic insights. Read more.
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Beginner
Secondary topics:  Managing and Deploying Machine Learning
Average rating: ***..
(3.00, 5 ratings)
Guillaume Salou shares OVH's approach to continuous deployment of machine learning models, which involved building a full stack of automated machine learning. Automated machine learning allows the company to rebuild models efficiently and keep models up to date with fresh data brought by its data convergence tool. Read more.
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Kostas Kloudas (data Artisans)
Average rating: **...
(2.25, 4 ratings)
Complex event processing (CEP) helps detect patterns over continuous streams of data. DNA sequencing, fraud detection, shipment tracking with specific characteristics (e.g., contaminated goods), and user activity analysis fall into this category. Kostas Kloudas offers an overview of Flink's CEP library and explains the benefits of its integration with Flink. Read more.
14:0514:45 Thursday, 24 May 2018
Location: Capital Suite 13 Level: Beginner
Jim Dowling (Logical Clocks)
Average rating: *****
(5.00, 2 ratings)
Distributed deep learning can increase the productivity of AI practitioners and reduce time to market for training models. Hadoop can fulfill a crucial role as a unified feature store and resource management platform for distributed deep learning. Jim Dowling offers an introduction to writing distributed DL applications, covering TensorFlow and Apache Spark frameworks that make distribution easy. Read more.
14:5515:35 Thursday, 24 May 2018
Location: S11A Level: Intermediate
Secondary topics:  Time Series and Graphs
Jim Webber (Neo4j)
Average rating: *****
(5.00, 3 ratings)
Jim Webber details how Neo4j mixes the strongly consistent Raft protocol with async log shipping and provides a strong consistency guarantee: causal, which means you can always at least read your writes even in very large multidata center clusters. Read more.
14:5515:35 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms
Alvin HEIB (Cloudera), guy le roux (Atos)
Alvin Heib and Guy Leroux offer an overview of ClickFox, a platform able to cope with high-performance analytical needs, from bits and bytes to solving a customer needs, covering the platform's virtualization, big data, and analytical layers. Read more.
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Secondary topics:  Financial Services, Managing and Deploying Machine Learning
Hope Wang (Intuit)
Average rating: ****.
(4.00, 3 ratings)
A machine learning platform is not just the sum of its parts; the key is how it supports the model lifecycle end to end. Hope Wang explains how to manage various artifacts and their associations, automate deployment to support the lifecycle of a model, and build a cohesive machine learning platform. Read more.
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Advanced
Secondary topics:  Data Integration and Data Pipelines sessions
Eugene Kirpichov (Google)
Average rating: ****.
(4.50, 2 ratings)
Apache Beam offers users a novel programming model in which the classic batch-streaming dichotomy is erased and ships with a rich set of I/O connectors to popular storage systems. Eugene Kirpichov explains why Beam has made these connectors flexible and modular—a key component of which is Splittable DoFn, a novel programming model primitive that unifies data ingestion between batch and streaming. Read more.
14:5515:35 Thursday, 24 May 2018
Location: Capital Suite 2/3 Level: Intermediate
Marton Balassi (Cloudera), Mirko Kämpf (Cloudera), Jan Kunigk (Cloudera)
Average rating: *****
(5.00, 2 ratings)
Rigorous improvement of an image recognition model often requires multiple iterations of eyeballing outliers, inspecting statistics of the output labels, then modifying and retraining the model. Marton Balassi, Mirko Kämpf, and Jan Kunigk share a solution that automates the process of running the model on the testing data and populating an index of the labels so they become searchable. Read more.
16:3517:15 Thursday, 24 May 2018
Location: S11A Level: Beginner
Jason Bell (Independent Speaker)
Jason Bell offers an overview of a self-learning knowledge system that uses Apache Kafka and Deeplearning4j to accept data, apply training to a neural network, and output predictions. Jason covers the system design and the rationale behind it and the implications of using a streaming data with deep learning and artificial intelligence. Read more.
16:3517:15 Thursday, 24 May 2018
Location: S11B Level: Intermediate
Secondary topics:  Data Platforms
Naghman Waheed (Bayer Crop Science), Brian Arnold (Bayer)
Average rating: ****.
(4.50, 2 ratings)
There are a number of tools that make it easy to implement a data lake. However, most lack the essential features that prevent your data lake from turning into a data swamp. Naghman Waheed and Brian Arnold offer an overview of Monsanto's Data Historian platform, which can ingest, store, and access datasets without compromising ease of use, governance, or security. Read more.
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 7 Level: Intermediate
Giuseppe D'alessio (ING Group)
Average rating: ***..
(3.25, 4 ratings)
Giuseppe D'alessio details ING's DevOps journey, covering its impact on people, processes and tools, best practices, and pitfalls. Giuseppe concludes with a concrete example of using analytics and streaming technology on real-time applications. Read more.
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 8/9 Level: Intermediate
Flavio Junqueira (Dell EMC)
Stream processing is in the spotlight. Enabling low-latency insights and actions out of continuously generated data is compelling to a number of application domains, and the ability to adapt to workload variations is critical to many applications. Flavio Junqueira explores Pravega, a stream store that scales streams automatically and enables applications to scale downstream by signaling changes. Read more.
16:3517:15 Thursday, 24 May 2018
Location: Capital Suite 14 Level: Intermediate
Pascal Bugnion (ASI Data Science)
Jupyter widgets let you create lightweight, interactive graphical interfaces directly in Jupyter notebooks. Pascal Bugnion demonstrates how to use Jupyter widgets to implement human-in-the-loop machine learning with highly interactive user interfaces. Read more.