Presented By O'Reilly and Cloudera
Make Data Work
22–23 May 2017: Training
23–25 May 2017: Tutorials & Conference
London, UK

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Iñaki Puigdollers (Social Point)
Low cost, big impact: this is what data science can bring to your business. Iñaki Puigdollers explores how the analytics department changed Social Point games, creating an even better gaming experience and business.
Mark Donsky (Cloudera), Andre Araujo (Cloudera), Mubashir Kazia (Cloudera), Syed Rafice (Cloudera)
Mark Donsky, André Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.
Ziya Ma (Intel Corp)
Ziya Ma outlines the challenges for applying machine learning and deep learning at scale and shares solutions that Intel has enabled for customers and partners.
Paul Brook (Dell EMC)
Reliance only upon traditional data could lead to a catastrophic decision. Paul Brook explores the music industry to show how using modern information points, derived from technologies within the Hadoop framework, brings new data that changes the way a business or organization makes decisions.
Luke Han (Kyligence)
Apache Kylin is rapidly being adopted over the world—especially in China. Luke Han explores how various industries use Apache Kylin, sharing why these companies choose Apache Kylin (a technology comparison), how they use Apache Kylin (their production deployment pattern), and most importantly, the resulting business impact.
Jonathan Seidman (Cloudera), Mark Grover (Cloudera), Ted Malaska (Blizzard)
Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, and Mark Grover explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.
Mark Donsky (Cloudera), Vikas Singh (Cloudera)
Big data needs governance—not just for compliance but also for data scientists. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start, especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Vikas Singh share a step-by-step approach to kickstart your big data governance.
Wael Elrifai (Pentaho)
Wael Elrifai leads a journey through the design and implementation of a predictive maintenance platform for Hitachi Rail. The industrial internet, the IoT, data science, and big data make for an exciting ride.
Ben Sharma (Zaloni)
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started.
Daniel Bäurer (inovex GmbH), Sascha Askani (inovex GmbH)
Multiple challenges arise if distributed applications are provisioned in a containerized environment. Daniel Bäurer and Sascha Askani share a solution for distributed storage in cloud-native environments using Spark on Kubernetes.
Radhika Rangarajan explains how Intel works with its users to build deep learning-powered big data analytics applications (object detection, image recognition, NLP, etc.) using BigDL.
Yishay Carmiel (Spoken Communications)
For years, people have been talking about the great promise of conversation AI. Recently, deep learning has taken us a few steps further toward achieving tangible goals, making a big impact on technologies like speech recognition and natural language processing. Yishay Carmiel offers an overview of the impact of deep learning, recent breakthroughs, and challenges for the future.
Antonio Alvarez (Santander Group), Lidia Crespo (Santander UK)
Successful organizations are becoming increasingly Agile, and the autonomy and empowerment that Agile brings create new, active modes of engagement. Data governance however is still very much a centralized task that only CDOs and data owners actively care about. Antonio Alvarez and Lidia Crespo outline a more engaging and active method of data governance: data citizenship.
Laura Frolich (Think Big, A Teradata Company)
Laura Frolich explores applications of deep learning in companies—looking at practical examples of assessing the opportunity for AI, phased adoption, and lessons going from research to prototype to scaled production deployment—and discusses the future of enterprise AI.
Eric Tilenius (BlueTalon)
Many businesses will have to address EU GDPR as they deploy big data projects. This is an opportunity to rethink data security and deploy a flexible policy framework adapted to big data and regulations. Eric Tilenius explains how consistent visibility and control at a granular level across data domains can address both security and GDPR compliance.
Mark Madsen (Third Nature)
In 2007, a computer game company decided to jump ahead of competitors by capturing and using data created during online gaming, but it wasn't prepared to deal with the data management and process challenges stemming from distributed devices creating data. Mark Madsen shares a case study that explores the oversights, failures, and lessons the company learned along its journey.
Kim Nilsson (Pivigo)
More organizations are becoming aware of the value of data and want to get started and scaled up as quickly as possible. But how? Is it possible to get something useful done in five weeks? Kim Nilsson shares her experiences, both good and bad, delivering over 80 five-week data science projects to over 50 organizations, as well as some concrete tips on how to become a data star organization.
Steve Touw (Immuta)
The global populace is asking for the IT industry to be held responsible for the safe-guarding of individual data. Steve Touw examines some of the data privacy regulations that have arisen and covers design strategies to protect personally identifiable data while still enabling analytics.
Alberto Rey (easyJet PLC)
Many large organizations want to develop data science capabilities, but the traditional complexity and legacy of such companies don’t allow a fast and agile evolution toward data-driven decision making. EasyJet is working toward becoming completely data driven. Alberto Rey shares real-world examples on how easyJet is tackling the challenges of scaling up its analytics capabilities.
Trent Gray-Donald and Gil Vernik explain the challenges of current Hadoop and Apache Spark integration with object stores and discuss Stocator, an open source object store connector that overcomes these shortcomings by leveraging object store semantics. Compared to native Hadoop connectors, Stocator provides close to a 100% speedup for DFSIO on Hadoop and a 500% speedup for Terasort on Spark.
If you have Hadoop clusters in research or an early-stage data lake and are considering strategic vision and goals, this session is for you. Phillip Radley explains how to run Hadoop as a shared service, providing an enterprise-wide data platform hosting hundreds of projects securely and predictably.
Aurélien Géron (Kiwisoft)
Collaborative filtering is great for recommendations, yet it suffers from the cold-start problem. New content with no views is ignored, and new users get poor recommendation. Aurélien Géron shares a solution: knowledge graphs. With a knowledge graph, you can truly understand your users' interests and make better, more relevant recommendations.
Security has been a large and growing aspect of distributed systems, specifically in the big data ecosystem, but it's an underappreciated topic within the Spark framework itself. Neelesh Srinivas Salian explains how detailed knowledge of setting up security and an awareness of what to be looking out for in terms of problems and issues can help an organization move forward in the right way.
Jack Norris (MapR Technologies)
Leading companies are integrating operations and analytics to make real-time adjustments to improve revenues, reduce costs, and mitigate risks. There are many aspects to digital transformation, but the timely delivery of actionable data is both a key enabler and an obstacle. Jack Norris explores how companies from Altitude Digital to Uber are transforming their businesses.
Xueyan Li (Qunar), Yupeng Fu (Alluxio)
Alluxio—the first memory-speed virtual distributed storage system in the world—unifies the data from various under storage systems and presents a global namespace to various computation frameworks. Xueyan Li and Yupeng Fu explore how Alluxio has led to performance improvements averaging a 300x improvement at service peak time on stream processing workloads at Qunar.
Calum Murray (Intuit)
As Intuit moves its SaaS platform from its own data centers to AWS, it will straddle both worlds for a period of time (and potentially indefinitely). Calum Murray looks at what straddling means to data and data systems.
Eddie Copeland (Nesta)
Eddie Copeland shares lessons learned from piloting the London Office of Data Analytics, a collaboration between the Greater London Authority, Nesta, and ASI Data Science that is exploring the potential of applying data analytics to reform public services.
Darren Strange (Google)
Data analytics and machine learning are the drivers of the fourth industrial revolution. As technologists, we stand on the brink of incredible opportunity. Darren Strange explores the tremendous opportunity we have before us and asks, will we be pioneers creating new possibilities or will we hold on to the past?
BT has adopted Hadoop as an enterprise platform for data processing and storage. Come talk to Phillip to find out how they did it—and what you can learn from their experiences.
Nikolay Manchev offers an overview of the restricted Boltzmann machine, a type of neural network with a wide range of applications, and shares his experience using it on Hadoop (MapReduce and Spark) to process unstructured and semistructured data at a scale.
Tristan Stevens (Cloudera)
Vodafone UK’s new SIEM system relies on Apache Flume and Apache Kafka to ingest over 1 million events per second. Tristan Stevens discusses the architecture, deployment, and performance-tuning techniques that enable the system to perform at IoT-scale on modest hardware and at a very low cost.
Mark Madsen (Third Nature)
Building a data lake involves more than installing and using Hadoop. The goal in most organizations is to build multiuse data infrastructure that is not subject to past constraints. Mark Madsen discusses hidden design assumptions, reviews design principles to apply when building multiuse data infrastructure, and provides a reference architecture.
Grace Huang (Pinterest)
Grace Huang shares lessons learned from running and interpreting machine-learning experiments and outlines launch considerations that enable sustainable, long-term ecosystem health.
Rekha Joshi (Intuit)
Performance and security are often at loggerheads. Rekha Joshi explains why and offers a deep dive into how performance and security are managed in some of the most intense and critical data platform services at Intuit.
M. C. Srivas (Uber)
M. C. Srivas covers the technologies underpinning the big data architecture at Uber and explores some of the real-time problems Uber needs to solve to make ride sharing as smooth and ubiquitous as running water, explaining how they are related to real-time big data analytics.
Robin Senge (inovex GmbH)
Reliable prediction is the ability of a predictive model to explicitly measure the uncertainty involved in a prediction without feedback. Robin Senge shares two approaches to measure different types of uncertainty involved in a prediction.
Douglas Ashton (Mango Solutions), Aimee Gott (Mango Solutions), Mark Sellors (Mango Solutions)
R is a top contender for statistics and machine learning, but Spark has emerged as the leader for in-memory distributed data analysis. Douglas Ashton, Aimee Gott, and Mark Sellors introduce Spark, cover data manipulation with Spark as a backend to dplyr and machine learning via MLlib, and explore RStudio's sparklyr package, giving you the power of Spark without having to leave your R session.
Fabian Hueske (data Artisans)
Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage.
Aurélie Pols (Mind Your Group by Mind Your Privacy)
You may have heard that a piece of legislation called the GDPR is looming. Aurélie Pols draws a broad philosophical picture of the data ecosystem we are all a part of before honing into the right to data portability, hopefully empowering you to reclaim your data subject rights.
Miriam Redi (Bell Labs Cambridge, UK)
Miriam Redi explores the invisible side of visual data, investigating how machine learning can detect subjective properties of images and videos, such as beauty, creativity, sentiment, style, and more curious characteristics. Miriam shows how these detectors can be applied in the context of web media search, advertising, and social media.
Nicolas Poggi (Barcelona Supercomputing-Microsoft Research Center)
Nicolas Poggi evaluates the out-of-the-box support for Spark and compares the offerings, reliability, scalability, and price-performance from major PaaS providers, including Azure HDinsight, Amazon Web Services EMR, Google Dataproc, and Rackspace Cloud Big Data, with an on-premises commodity cluster as baseline.
Sriskandarajah Suhothayan (WSO2), Roland Major (Transport for London)
Transport for London (TfL) and WSO2 have been working together on broader integration projects focused on getting the most efficient use out of London transport. Roland Major and Sriskandarajah Suhothayan explain how TfL and WSO2 bring together a wide range of data from multiple disconnected systems to understand current and predicted transport network status.
Tim O'Reilly (O'Reilly Media)
The history of technology shows that while new technology has always destroyed jobs, it has also created new ones, in part because it makes things that were previously too expensive cheap enough to expand demand. Tim O'Reilly explains how AI will make currently unthinkable things possible. If we put it to work properly, it can lead to prosperity.
Kaggle is a community of almost a million data scientists, who have built more than two million machine-learning models while participating in Kaggle competitions. Data scientists come to Kaggle to learn, collaborate, and develop the state of the art in machine learning. Anthony Goldbloom shares lessons learned from top performers in the Kaggle community.