Presented By O'Reilly and Cloudera
Make Data Work
Dec 4–5, 2017: Training
Dec 5–7, 2017: Tutorials & Conference

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.
Paco Nathan (
Paco Nathan explains how O'Reilly employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video.
Bas Geerdink (Aizonic)
Bas Geerdink explains why and how ING is becoming more and more data-driven, sharing use cases, architecture, and technology choices along the way.
Victor Chua (StarHub Ltd)
The rise of densely populated, highly built-up smart cities around the globe has stretched the capabilities of current 2D visualization techniques. With the advent of drones, IoT devices, and indoor geolocation, next-gen 3D visualizations are beginning to address this challenge. Victor Chua explores how SmartHub is gearing up for a 3D future to support cutting-edge data analytics.
Peng Meng (Intel)
Apache Spark ML and MLlib are hugely popular in the big data ecosystem, and Intel has been deeply involved in Spark from a very early stage. Peng Meng outlines the methodology behind Intel's work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib ALS by 60x in’s production environment.
Arun Veettil (Skellam AI)
Arun Veettil shares his experience and lessons learned developing a customized, enterprise-level NLP platform to replace a leading text analytics vendor platform.
Wei Chen (Intel), Zhaojuan Bian (Intel)
Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.
John Mertic (Linux Foundation), Cupid Chan (4C Decision )
John Mertic and Cupid Chan share real end-user perspectives from companies like GE on how they are using big data tools, challenges they face, and where they are looking to focus investments—all from a vendor-neutral viewpoint.
Yousun Jeong (SK Telecom)
Data transfer is one of the most pressing problems for telecom companies, as cost increases in tandem with the growing data requirements. Yousun Jeong details how SKT has dealt with this problem.
Danielle Dean (iRobot), Wee Hyong Tok (Microsoft)
Transfer learning enables you to use pretrained deep neural networks (e.g., AlexNet, ResNet, and Inception V3) and adapt them for custom image classification tasks. Danielle Dean and Wee Hyong Tok walk you through the basics of transfer learning and demonstrate how you can use the technique to bootstrap the building of custom image classifiers.
Xianyan Jia (Intel), zhenhua wang (
Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL. They also teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark.
Melanie Johnston-Hollitt (Victoria University of Wellington)
Keynote with Melanie Johnston-Hollitt
Graham Dumpleton (Red Hat)
Jupyter notebooks provide a rich interactive environment for working with data. Running a single notebook is easy, but what if you need to provide a platform for many users at the same time. Graham Dumpleton demonstrates how to use JupyterHub to run a highly scalable environment for hosting Jupyter notebooks in education and business.
Yu-Xi Lim (Teralytics), Michal Wegrzyn (Teralytics)
Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. At its core, the pattern relies on functional reactive programming idioms to shard and splice state fragments, ensuring high horizontal scalability, reliability, and high availability.
Amit Das (Think Analytics India)
Access to credit in emerging markets is impeded by issues around identity verification, risk assessment and monitoring, and the costs of underwriting and collections. At the core of it all is a lack of data. Amit Das explains how accessing alternate data, real-time risk monitoring and data access solutions, and smart analytics is changing the lending landscape in India.
Yiqun Hu (Singapore Power)
Energy usage is a significant part of daily life, so the ability to monitor this use offers a number of benefits, from cost savings to improved safety. A key challenge is the lack of labeled data. Yiqun Hu shares a new solution: a RNN-based network trained to learn good features from unlabeled data.
Ricky Barron (InfoStrategy)
To many organizations, big data analytics is still a solution looking for a problem. Ricky Barron shares practical methods for getting the best out of your big data analytics capability and explains why establishing an "insights group" can improve the bottom line, drive performance, optimize processes, and create new data-driven products and solutions.
Prateek Nagaria (The Data Team)
Most data scientists use traditional methods of forecasting, such as exponential smoothing or ARIMA, to forecast a product demand. However, when the product experiences several periods of zero demand, approaches such as Croston may provide a better accuracy over these traditional methods. Prateek Nagaria compares traditional and Croston methods in R on intermittent demand time series.
Ofir Sharony (MyHeritage)
What are the most important considerations for shipping billions of daily events to analysis? Ofir Sharony shares MyHeritage's journey to find a reliable and efficient way to achieve real-time analytics. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline.
carme artigas (Synergic Partners)
The concept of smart cities has evolved from sensored urban centers to platform ecosystems that combine data with new technologies such as the IoT, the cloud, and AI. Carme Artigas explores the challenges and opportunities of evolving from smart cities to intelligent societies.
Mark Donsky (Okera), Steven Ross (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.
Yufeng Guo (Google)
Yufeng Guo walks you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng takes you from a conceptual overview all the way to building complex classifiers and explains how you can apply deep learning to complex problems in science and industry.
Vickye Jain (ZS Associates), Raghav Sharma (ZS Associates)
Vickye Jain and Raghav Sharma explain how they built a very high-performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance.
Paco Nathan (
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts.
Ajey Gore (GO-JEK)
Drawing on his experience at GO-JEK, Ajey Gore explains how the impossible can be made possible with technology and data insights.
Joshua Bloom (GE Digital)
The ongoing digitization of the industrial-scale machines that power and enable human activity is itself a major global transformation. Joshua Bloom explains why the real revolution—in efficiencies and in improved and saved lives—will happen when machine learning automation and insights are properly coupled to the complex systems of industrial data.
Bargava Subramanian (Binaize), Amit Kapoor (narrativeVIZ)
One of the challenges in traditional data visualization is that they are static and have bounds on limited physical/pixel space. Interactive visualizations allows us to move beyond this limitation by adding layers of interactions. Bargava Subramanian and Amit Kapoor teach the art and science of creating interactive data visualizations.
Wataru Yukawa (LINE)
Data is a very important asset to LINE, one of the most popular messaging applications in Asia. Wataru Yukawa explains how LINE gets the most out of its data using a Hadoop data lake and an in-house log analysis platform.
Anand Chitipothu (rorodata)
There are many challenges to deploying machine models in production, including managing multiple versions of models, maintaining staging and production models, keeping track of model performance, logging, and scaling. Anand Chitipothu explores the tools, techniques, and system architecture of a cloud platform built to solve these challenges and the new opportunities it opens up.
Kira Radinsky (eBay | Technion)
Kira Radinsky offers an overview of a system that jointly mines 10 years of nation-wide medical records of more than 1.5 million people and extracts medical knowledge from Wikipedia to provide guidance about drug repurposing—the process of applying known drugs in new ways to treat diseases.
Gaurav Godhwani (Open Budgets India, Centre for Budget and Governance Accountability)
Most of the India’s budget documents aren’t easily accessible. Those published online are mostly available as unstructured PDFs, making it difficult to search, analyze, and use this crucial data. Gaurav Godhwani discusses the process of creating Open Budgets India and making India’s budgets open, usable, and easy to comprehend.
Le Zhang (Microsoft), Graham Williams (Microsoft)
R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analytics—all within R.
Graham Gear (Cloudera)
How can we drive more data pipelines, advanced analytics, and machine learning models into production? How can we do this both faster and more reliably? Graham Gear draws on real-world processes and systems to explain how it's possible to apply continuous delivery techniques to advanced analytics, realizing business value earlier and more safely.
Ben Lorica (O'Reilly)
Machine learning models are becoming increasingly widely used and deployed. Ben Lorica explains how to guard against flaws and failures in your machine learning deployments.
Pascale Fung (The Hong Kong University of Science and Technology)
Keynote with Pascale Fung
Mick Hollison (Cloudera), Cesar Delgado (Apple)
Twenty years ago, a company implored us to “think different” about personal computers. Today, Apple continues to live and breathe that legacy. It’s evident in the machine learning and analytics architectures that power many of the company’s most innovative applications. Cesar Delgado joins Mick Hollison to discuss how Apple is using its big data stack and expertise to solve non-data problems.
Mark Donsky (Okera), Syed Rafice (Cloudera)
Smart cities and the electricity smart grid have become leading examples of the IoT, in which distributed sensors describe mission-critical behavior by generating billions of metrics daily. Mark Donsky and Syed Rafice show how smart utilities and cities rely on Hadoop to capture, analyze, and harness this data to increase safety, availability, and efficiency across the entire electricity grid.
Felipe Hoffa (Google)
Organizations waste hours to endless discussions, and people lose sleep to internet debates. Can big data change this? Google Cloud is here to help. Felipe Hoffa explains that solid data-based conclusions are possible when stakeholders have easy access to analyze all relevant data.
Steve Leonard (SGInnovate)
Steve Leonard details how Singapore is bringing together ambitious and capable individuals and teams to imagine, start, build, and scale technology that can solve the world’s toughest challenges.
Isaac Reyes (DataSeer)
Isaac Reyes explores the art and science of data storytelling, covering the essential elements of a good data story, chart design and why it matters, the Gestalt principals of visual perception and how they can be used to tell better stories with data, and how to make over a poor visualization.
Amr Awadallah (Cloudera)
We are witnessing a new revolution in data—the age of decision automation. Amr Awadallah explains the historic importance of this next wave in automation and highlights the foundational capabilities required to enable it: machine learning and analytics optimized for the cloud.
Wai Chee Yau (Zendesk), Jeffrey Theobald (Zendesk)
Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. Drawing on their experience working on Zendesk's article recommendation product, Wai Yau and Jeffrey Theobald discuss design challenges and real-world problems you may encounter when building a machine learning product at scale.
Benjamin Wright-Jones (Microsoft), Simon Lidberg (Microsoft)
As organizations turn to data-driven strategies, they are also increasingly exploring the creation of a data science or analytic center of excellence (COE). Benjamin Wright-Jones and Simon Lidberg outline the building blocks of a center of excellence and describe the value for organizations embarking on data-driven strategies.
Aki Ariga (Cloudera)
Aki Ariga explains how to put your machine learning model into production, discusses common issues and obstacles you may encounter, and shares best practices and typical architecture patterns of deployment ML models with example designs from the Hadoop and Spark ecosystem using Cloudera Data Science Workbench.
Grace Tang (Uber)
Being a data-driven company means that we have to move fast and fail often. But how do we learn to not only be proud of our failures but also turn these fails into wins? Grace Tang explains how to set up experiments so that negative results become epic wins, saving your team time, effort, and money, instead of just being swept under the carpet.