Bargava Subramanian and Harjinder Mistry share data engineering and machine learning strategies for building an efficient real-time recommendation engine when the transaction data is both big and wide. They also outline a novel way of generating frequent patterns using collaborative filtering and matrix factorization on Apache Spark and serving it using Elasticsearch in the cloud.
Paco Nathan explains how O'Reilly employs AI, from the obvious (chatbots, case studies about other firms) to the less so (using AI to show the structure of content in detail, enhance search and recommendations, and guide editors for gap analysis, assessment, pathing, etc.). Approaches include vector embedding search, summarization, TDA for content gap analysis, and speech-to-text to index video.
Bas Geerdink explains why and how ING is becoming more and more data-driven, sharing use cases, architecture, and technology choices along the way.
Apache Spark ML and MLlib are hugely popular in the big data ecosystem, and Intel has been deeply involved in Spark from a very early stage. Peng Meng outlines the methodology behind Intel's work on Spark ML and MLlib optimization and shares a case study on boosting the performance of Spark MLlib ALS by 60x in JD.com’s production environment.
Arun Veettil shares his experience and lessons learned developing a customized, enterprise-level NLP platform to replace a leading text analytics vendor platform.
Kudu is designed to fill the gap between HDFS and HBase. However, designing a Kudu-based cluster presents a number of challenges. Wei Chen and Zhaojuan Bian share a real-world use case from the automobile industry to explain how to design a Kudu-based E2E system. They also discuss key indicators to tune Kudu and OS parameters and how to select the best hardware components for different scenarios.
Data transfer is one of the most pressing problems for telecom companies, as cost increases in tandem with the growing data requirements. Yousun Jeong details how SKT has dealt with this problem.
Transfer learning enables you to use pretrained deep neural networks (e.g., AlexNet, ResNet, and Inception V3) and adapt them for custom image classification tasks. Danielle Dean and Wee Hyong Tok walk you through the basics of transfer learning and demonstrate how you can use the technique to bootstrap the building of custom image classifiers.
Xianyan Jia and Zhenhua Wang explore deep learning applications built successfully with BigDL. They also teach you how to develop fast prototypes with BigDL's off-the-shelf deep learning toolkit and build end-to-end deep learning applications with flexibility and scalability using BigDL on Spark.
Keynote with Melanie Johnston-Hollitt
Jupyter notebooks provide a rich interactive environment for working with data. Running a single notebook is easy, but what if you need to provide a platform for many users at the same time. Graham Dumpleton demonstrates how to use JupyterHub to run a highly scalable environment for hosting Jupyter notebooks in education and business.
Yu-Xi Lim and Michal Wegrzyn outline a high-throughput distributed software pattern capable of processing event streams in real time. At its core, the pattern relies on functional reactive programming idioms to shard and splice state fragments, ensuring high horizontal scalability, reliability, and high availability.
To many organizations, big data analytics is still a solution looking for a problem. Ricky Barron shares practical methods for getting the best out of your big data analytics capability and explains why establishing an "insights group" can improve the bottom line, drive performance, optimize processes, and create new data-driven products and solutions.
Most data scientists use traditional methods of forecasting, such as exponential smoothing or ARIMA, to forecast a product demand. However, when the product experiences several periods of zero demand, approaches such as Croston may provide a better accuracy over these traditional methods. Prateek Nagaria compares traditional and Croston methods in R on intermittent demand time series.
Keynote by Bruno Fernandez-Ruiz
What are the most important considerations for shipping billions of daily events to analysis? Ofir Sharony shares MyHeritage's journey to find a reliable and efficient way to achieve real-time analytics. Along the way, Ofir compares several data loading techniques, helping you make better choices when building your next data pipeline.
The concept of smart cities has evolved from sensored urban centers to platform ecosystems that combine data with new technologies such as the IoT, the cloud, and AI. Carme Artigas explores the challenges and opportunities of evolving from smart cities to intelligent societies.
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Steven Ross and Mark Donsky outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.
Yufeng Guo walks you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng takes you from a conceptual overview all the way to building complex classifiers and explains how you can apply deep learning to complex problems in science and industry.
Vickye Jain and Raghav Sharma explain how they built a very high-performance data processing platform powered by Spark that balances the considerations of extreme performance, speed of development, and cost of maintenance.
Human-in-the-loop is an approach which has been used for simulation, training, UX mockups, etc. A more recent design pattern is emerging for human-in-the-loop (HITL) as a way to manage teams working with machine learning (ML). A variant of semi-supervised learning called _active learning_ allows for mostly automated processes based on ML, where exceptions get referred to human experts.
Drawing on his experience at GO-JEK, Ajey Gore explains how the impossible can be made possible with technology and data insights.
The ongoing digitization of the industrial-scale machines that power and enable human activity is itself a major global transformation. Joshua Bloom explains why the real revolution—in efficiencies and in improved and saved lives—will happen when machine learning automation and insights are properly coupled to the complex systems of industrial data.
One of the challenges in traditional data visualization is that they are static and have bounds on limited physical/pixel space. Interactive visualizations allows us to move beyond this limitation by adding layers of interactions. Bargava Subramanian and Amit Kapoor teach the art and science of creating interactive data visualizations.
Data is a very important asset to LINE, one of the most popular messaging applications in Asia. Wataru Yukawa explains how LINE gets the most out of its data using a Hadoop data lake and an in-house log analysis platform.
There are many challenges to deploying machine models in production, including managing multiple versions of models, maintaining staging and production models, keeping track of model performance, logging, and scaling. Anand Chitipothu explores the tools, techniques, and system architecture of a cloud platform built to solve these challenges and the new opportunities it opens up.
Kira Radinsky offers an overview of a system that jointly mines 10 years of nation-wide medical records of more than 1.5 million people and extracts medical knowledge from Wikipedia to provide guidance about drug repurposing—the process of applying known drugs in new ways to treat diseases.
Gaurav Godhwani (Open Budgets India, Centre for Budget and Governance Accountability)
Most of the India’s budget documents aren’t easily accessible. Those published online are mostly available as unstructured PDFs, making it difficult to search, analyze, and use this crucial data. Gaurav Godhwani discusses the process of creating Open Budgets India and making India’s budgets open, usable, and easy to comprehend.
R has long been criticized for its limitations on scalable data analytics. What's needed is an R-centric paradigm that enables data scientists to elastically harness cloud resources of manifold computing capability for large-scale data analytics. Le Zhang and Graham Williams demonstrate how to operationalize an E2E enterprise-grade pipeline for big data analytics—all within R.
How can we drive more data pipelines, advanced analytics, and machine learning models into production? How can we do this both faster and more reliably? Graham Gear draws on real-world processes and systems to explain how it's possible to apply continuous delivery techniques to advanced analytics, realizing business value earlier and more safely.
Machine learning models are becoming increasingly widely used and deployed. Ben Lorica explains how to guard against flaws and failures in your machine learning deployments.
Pascale Fung (The Hong Kong University of Science and Technology)
Keynote with Pascale Fung
Twenty years ago, a company implored us to “think different” about personal computers. Today, Apple continues to live and breathe that legacy. It’s evident in the machine learning and analytics architectures that power many of the company’s most innovative applications. Cesar Delgado joins Mick Hollison to discuss how Apple is using its big data stack and expertise to solve non-data problems.
Smart cities and the electricity smart grid have become leading examples of the IoT, in which distributed sensors describe mission-critical behavior by generating billions of metrics daily. Mark Donsky and Syed Rafice show how smart utilities and cities rely on Hadoop to capture, analyze, and harness this data to increase safety, availability, and efficiency across the entire electricity grid.
Organizations waste hours to endless discussions, and people lose sleep to internet debates. Can big data change this? Google Cloud is here to help. Felipe Hoffa explains that solid data-based conclusions are possible when stakeholders have easy access to analyze all relevant data.
Steve Leonard details how Singapore is bringing together ambitious and capable individuals and teams to imagine, start, build, and scale technology that can solve the world’s toughest challenges.
We are witnessing a new revolution in data—the age of decision automation. Amr Awadallah explains the historic importance of this next wave in automation and highlights the foundational capabilities required to enable it: machine learning and analytics optimized for the cloud.
Simply building a successful machine learning product is extremely challenging, and just as much effort is needed to turn that model into a customer-facing product. Drawing on their experience working on Zendesk's article recommendation product, Wai Yau and Jeffrey Theobald discuss design challenges and real-world problems you may encounter when building a machine learning product at scale.
As organizations turn to data-driven strategies, they are also increasingly exploring the creation of a data science or analytic center of excellence (COE). Benjamin Wright-Jones and Simon Lidberg outline the building blocks of a center of excellence and describe the value for organizations embarking on data-driven strategies.
Aki Ariga explains how to put your machine learning model into production, discusses common issues and obstacles you may encounter, and shares best practices and typical architecture patterns of deployment ML models with example designs from the Hadoop and Spark ecosystem using Cloudera Data Science Workbench.
Being a data-driven company means that we have to move fast and fail often. But how do we learn to not only be proud of our failures but also turn these fails into wins? Grace Tang explains how to set up experiments so that negative results become epic wins, saving your team time, effort, and money, instead of just being swept under the carpet.