Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Martin Görner (Google)
With TensorFlow, deep machine learning has transitioned from an area of research into mainstream software engineering. Martin Görner walks you through building and training a neural network that recognizes handwritten digits with >99% accuracy using Python and TensorFlow.
Iñaki Puigdollers (Social Point)
Low cost, big impact: this is what data science can bring to your business. Iñaki Puigdollers explores how the analytics department changed Social Point games, creating an even better gaming experience and business.
Mark Donsky (Cloudera), Andre Araujo (Cloudera), Mubashir Kazia (Cloudera), Syed Rafice (Cloudera)
Mark Donsky, André Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.
Ziya Ma outlines the challenges for applying machine learning and deep learning at scale and shares solutions that Intel has enabled for customers and partners.
Paul Brook (Dell EMC)
Reliance only upon traditional data could lead to a catastrophic decision. Paul Brook explores the music industry to show how using modern information points, derived from technologies within the Hadoop framework, brings new data that changes the way a business or organization makes decisions.
Luke Han (Kyligence)
Apache Kylin is rapidly being adopted over the world—especially in China. Luke Han explores how various industries use Apache Kylin, sharing why these companies choose Apache Kylin (a technology comparison), how they use Apache Kylin (their production deployment pattern), and most importantly, the resulting business impact.
Jonathan Seidman (Cloudera), Mark Grover (Cloudera), Ted Malaska (Blizzard)
Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.
Victor Zabalza (ASI Data Science)
Data exploration usually entails making endless one-use exploratory plots. Victor Zabalza shares a Python package based on dask execution graphs and interactive visualization in Jupyter widgets built to overcome this drudge work. Victor offers an overview of the tool and explains how it was built and why it will become essential in the first steps of every data science project.
The class of big data computations known as distributed merge trees was built to aggregate user information across multiple data sources in the media domain. Vijay Srinivas Agneeswaran explores prototypes built on top of Apache HAWQ, Druid, and Kinetica, one of the open source GPU databases. Results show that Kinetica on a single G2.8x node outperformed clusters of HAWQ and Druid nodes.
Mark Donsky (Cloudera), Vikas Singh (Cloudera)
Big data needs governance—not just for compliance but also for data scientists. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start, especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Vikas Singh share a step-by-step approach to kickstart your big data governance.
Wael Elrifai (Pentaho)
Wael Elrifai leads a journey through the design and implementation of a predictive maintenance platform for Hitachi Rail. The industrial internet, the IoT, data science, and big data make for an exciting ride.
Pranav Rastogi (Microsoft)
Pranav Rastogi explains how to simplify your big data solutions with Datameer, AtScale, Dataiku, and StreamSets on Microsoft’s Azure HDInsight, a cloud Spark and Hadoop service for the enterprise. Join in to learn practical information that will enable faster time to insights for you and your business.
Ben Sharma (Zaloni)
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started.
Daniel Bäurer (inovex GmbH), Sascha Askani (inovex GmbH)
Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Bäurer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes.
Radhika Rangarajan explains how Intel works with its users to build deep learning-powered big data analytics applications (object detection, image recognition, NLP, etc.) using BigDL.
Jim Scott (MapR Technologies)
The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company or for your use cases is critical to ensuring success. Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations.
Arturo Bayo (Synergic Partners), Alvaro Fernandez Velando (Santander Spain)
Arturo Bayo and Alvaro Fernandez Velando explain how a data hub strategy helps clarify data sharing and governance in an organization and share one way to implement a data hub architecture using big data technology and resources that are already established in the enterprise.
Yishay Carmiel (Spoken Communications)
For years, people have been talking about the great promise of conversation AI. Recently, deep learning has taken us a few steps further toward achieving tangible goals, making a big impact on technologies like speech recognition and natural language processing. Yishay Carmiel offers an overview of the impact of deep learning, recent breakthroughs, and challenges for the future.
Antonio Alvarez (Santander Group), Lidia Crespo (Santander UK)
Successful organizations are becoming increasingly Agile, and the autonomy and empowerment that Agile brings create new, active modes of engagement. Data governance however is still very much a centralized task that only CDOs and data owners actively care about. Antonio Alvarez and Lidia Crespo outline a more engaging and active method of data governance: data citizenship.
Drawing on use cases from Trifacta customers, the speaker explains how to leverage data wrangling solutions in the insurance industry to streamline, strengthen, and improve data analytics initiatives on Hadoop.
Adam Grzywaczewski offers an overview of the types of analytical problems that can be solved using AI and shares a set of heuristics that can be used to evaluate the feasibility of analytical AI projects. Adam then covers the computational requirements for the deep learning training process, leaving you with the key tools you need to initiate an analytical AI project.
Laura Frolich (Think Big, A Teradata Company)
Laura Frolich explores applications of deep learning in companies—looking at practical examples of assessing the opportunity for AI, phased adoption, and lessons going from research to prototype to scaled production deployment—and discusses the future of enterprise AI.
Eric Tilenius (BlueTalon)
Many businesses will have to address EU GDPR as they deploy big data projects. This is an opportunity to rethink data security and deploy a flexible policy framework adapted to big data and regulations. Eric Tilenius explains how consistent visibility and control at a granular level across data domains can address both security and GDPR compliance.
Aurélie Pols (Mind Your Privacy)
The EU's General Data Protection Regulation is an ambitious legal project to reinstate the rights of "data subjects" within an increasingly lucrative data ecosystem. Aurélie Pols explores the legal obligations on companies and their respective interpretations and looks at how scale and integrity will be safeguarded in the data we increasingly base decisions upon in the long term.
Mark Madsen (Third Nature)
In 2007, a computer game company decided to jump ahead of competitors by capturing and using data created during online gaming, but it wasn't prepared to deal with the data management and process challenges stemming from distributed devices creating data. Mark Madsen shares a case study that explores the oversights, failures, and lessons the company learned along its journey.
Kim Nilsson (Pivigo)
More organizations are becoming aware of the value of data and want to get started and scaled up as quickly as possible. But how? Is it possible to get something useful done in five weeks? Kim Nilsson shares her experiences, both good and bad, delivering over 80 five-week data science projects to over 50 organizations, as well as some concrete tips on how to become a data star organization.
Yingsong Zhang (ASI Data Science)
There are sometimes occasions where the labels on data are insufficient. In such situations, semisupervised learning can be of great practical value. Yingsong Zhang explores illustrative examples of how to come up with creative solutions, derived from textbook approaches.
Steve Touw (Immuta)
The global populace is asking for the IT industry to be held responsible for the safe-guarding of individual data. Steve Touw examines some of the data privacy regulations that have arisen and covers design strategies to protect personally identifiable data while still enabling analytics.
Alberto Rey (easyJet PLC)
Many large organizations want to develop data science capabilities, but the traditional complexity and legacy of such companies don’t allow a fast and agile evolution toward data-driven decision making. EasyJet is working toward becoming completely data driven. Alberto Rey shares real-world examples on how easyJet is tackling the challenges of scaling up its analytics capabilities.
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.
If you have Hadoop clusters in research or an early-stage data lake and are considering strategic vision and goals, this session is for you. Phillip Radley explains how to run Hadoop as a shared service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably.
Aurélien Géron (Kiwisoft)
Collaborative filtering is great for recommendations, yet it suffers from the cold-start problem. New content with no views is ignored, and new users get poor recommendation. Aurélien Géron shares a solution: knowledge graphs. With a knowledge graph, you can truly understand your users' interests and make better, more relevant recommendations.
Security has been a large and growing aspect of distributed systems, specifically in the big data ecosystem, but it's an underappreciated topic within the Spark framework itself. Neelesh Srinivas Salian explains how detailed knowledge of setting up security and an awareness of what to be looking out for in terms of problems and issues can help an organization move forward in the right way.
Jack Norris (MapR Technologies)
Leading companies are integrating operations and analytics to make real-time adjustments to improve revenues, reduce costs, and mitigate risks. There are many aspects to digital transformation, but the timely delivery of actionable data is both a key enabler and an obstacle. Jack Norris explores how companies from Altitude Digital to Uber are transforming their businesses.
Xueyan Li (Qunar), Yupeng Fu (Alluxio)
Alluxio—the first memory-speed virtual distributed storage system in the world—unifies the data from various under storage systems and presents a global namespace to various computation frameworks. Xueyan Li and Yupeng Fu explore how Alluxio has led to performance improvements averaging a 300x improvement at service peak time on stream processing workloads at Qunar.
Calum Murray (Intuit)
As Intuit moves its SaaS platform from its own data centers to AWS, it will straddle both worlds for a period of time (and potentially indefinitely). Calum Murray looks at what straddling means to data and data systems.
Eddie Copeland (Nesta)
Eddie Copeland shares lessons learned from piloting the London Office of Data Analytics, a collaboration between the Greater London Authority, Nesta, and ASI Data Science that is exploring the potential of applying data analytics to reform public services.
Matthias Niehoff (codecentric AG)
Matthias Niehoff shares lessons learned working with Spark, Cassandra, and the Spark-Cassandra connector and best practices drawn from his work on multiple big and fast data projects, as well as challenges encountered along the way.
Darren Strange (Google)
Data analytics and machine learning are the drivers of the fourth industrial revolution. As technologists, we stand on the brink of incredible opportunity. Darren Strange explores the tremendous opportunity we have before us and asks, will we be pioneers creating new possibilities or will we hold on to the past?
Harry Powell (Barclays), Raffael Strassnig (Barclays)
Harry Powell and Raffael Strassnig demonstrate how to model unobserved customer preferences over businesses by thinking about transactional data as a bipartite graph and then computing a new similarity metric—the expected degrees of separation—to characterize the full graph.
BT has adopted Hadoop as an enterprise platform for data processing and storage. Come talk to Phillip to find out how they did it—and what you can learn from their experiences.
Leah McGuire (Salesforce)
What if you had to build more models than there are data scientists in the world—a feat enterprise companies serving hundreds of thousands of businesses often have to do? Leah McGuire offers an overview of Salesforce's general-purpose machine-learning platform that automatically builds per-company optimized models for any given predictive problem at scale, beating out most hand-tuned models.
Alon Elishkov (Outbrain)
Migrating petabyte-scale Hadoop installations to a new cluster with hundreds of machines, several thousands of jobs daily, and countless ecosystem integrations while maintaining a stable production environment is a challenging task. Alon Elishkov discusses the techniques and tools Outbrain has developed to achieve this goal.
Nikolay Manchev offers an overview of the restricted Boltzmann machine, a type of neural network with a wide range of applications, and shares his experience using it on Hadoop (MapReduce and Spark) to process unstructured and semistructured data at a scale.
Tristan Stevens (Cloudera)
Vodafone UK’s new SIEM system relies on Apache Flume and Apache Kafka to ingest over 1 million events per second. Tristan Stevens discusses the architecture, deployment, and performance-tuning techniques that enable the system to perform at IoT-scale on modest hardware and at a very low cost.
Mark Madsen (Third Nature)
Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture.
Grace Huang (Pinterest)
Grace Huang shares lessons learned from running and interpreting machine-learning experiments and outlines launch considerations that enable sustainable, long-term ecosystem health.
Rekha Joshi (Intuit)
Performance and security are often at loggerheads. Rekha Joshi explains why and offers a deep dive into how performance and security are managed in some of the most intense and critical data platform services at Intuit.
Wojciech Biela (Teradata), Łukasz Osipiuk (Teradata)
Wojciech Biela and Łukasz Osipiuk offer an introduction to Presto, an open source distributed analytical SQL engine that enables users to run interactive queries over their datasets stored in various data sources, and explore its applications in various big data problems.
M. C. Srivas (Uber)
M. C. Srivas covers the technologies underpinning the big data architecture at Uber and explores some of the real-time problems Uber needs to solve to make ride sharing as smooth and ubiquitous as running water, explaining how they are related to real-time big data analytics.
Kamran Yousaf (Redis Labs)
Kamran Yousaf explains how to substantially accelerate and radically simplify common practices in machine learning, such as running a trained model in production, to meet real-time expectations, using Redis modules that natively store and execute common models generated by Spark ML and TensorFlow algorithms.
Robin Senge (inovex GmbH)
Reliable prediction is the ability of a predictive model to explicitly measure the uncertainty involved in a prediction without feedback. Robin Senge shares two approaches to measure different types of uncertainty involved in a prediction.
Michael Noll (Confluent)
Michael Noll explains how Apache Kafka helps you radically simplify your data processing architectures by building normal applications to serve your real-time processing needs rather than building clusters or similar special-purpose infrastructure—while still benefiting from properties typically associated exclusively with cluster technologies.
Douglas Ashton (Mango Solutions), Aimee Gott (Mango Solutions), Mark Sellors (Mango Solutions)
R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.
Fabian Hueske (data Artisans)
Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage.
Sanjay Mathur (Silicon Valley Data Science)
Deep learning is white-hot at the moment, but why does it matter? Developers are usually the first to understand why some technologies cause more excitement than others. Sanjay Mathur relates this insider knowledge, providing a tour through the hottest emerging data technologies of 2017 to explain why they’re exciting in terms of both new capabilities and the new economies they bring.
Aurélie Pols (Mind Your Privacy)
You may have heard that a piece of legislation called the GDPR is looming. Aurélie Pols draws a broad philosophical picture of the data ecosystem we are all a part of before honing into the right to data portability, hopefully empowering you to reclaim your data subject rights.
Shree Dandekar (Honeywell)
A digital twin is a virtual model of a product or service that allows analysis of data and monitoring of systems to avert problems before they even occur—and even plan for the future by using simulations. Shree Dandekar explores a new cloud-based service from the Honeywell Connected Plant that provides industrial users with around-the-clock monitoring of plant data and rigorous simulations.
Miriam Redi (Bell Labs Cambridge, UK)
Miriam Redi explores the invisible side of visual data, investigating how machine learning can detect subjective properties of images and videos, such as beauty, creativity, sentiment, style, and more curious characteristics. Miriam shows how these detectors can be applied in the context of web media search, advertising, and social media.
Nicolas Poggi (Barcelona Supercomputing-Microsoft Research Center)
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.
Sriskandarajah Suhothayan (WSO2), Roland Major (Transport for London)
Transport for London (TfL) and WSO2 have been working together on broader integration projects focused on getting the most efficient use out of London transport. Roland Major and Sriskandarajah Suhothayan explain how TfL and WSO2 bring together a wide range of data from multiple disconnected systems to understand current and predicted transport network status.
Aljoscha Krettek (data Artisans)
Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.
Tim O'Reilly (O'Reilly Media)
The history of technology shows that while new technology has always destroyed jobs, it has also created new ones, in part because it makes things that were previously too expensive cheap enough to expand demand. Tim O'Reilly explains how AI will make currently unthinkable things possible. If we put it to work properly, it can lead to prosperity.
Kaggle is a community of almost a million data scientists, who have built more than two million machine-learning models while participating in Kaggle competitions. Data scientists come to Kaggle to learn, collaborate, and develop the state of the art in machine learning. Anthony Goldbloom shares lessons learned from top performers in the Kaggle community.