Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Sophie Watson (Red Hat)
Recommender systems enhance user experience and business revenue every day. Sophie Watson demonstrates how to develop a robust recommendation engine using a microservice architecture.
Francesca Lazzeri (Microsoft), Jaya Mathew (Microsoft)
With the growing buzz around data science, many professionals want to learn how to become a data scientist—the role Harvard Business Review called the "sexiest job of the 21st century." Francesca Lazzeri and Jaya Mathew explain what it takes to become a data scientist and how artificial intelligence solutions have started to reinvent businesses.
Moty Fania (Intel), Sergei Kom (Intel)
Moty Fania and Sergei Kom share their experience and lessons learned implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming, and online actuation.
Milene Darnis (Uber)
Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze.
Ankit Jain (Uber)
Personalization is a common theme in social networks and ecommerce businesses. Personalization at Uber involves an understanding of how each driver and rider is expected to behave on the platform. Ankit Jain explains how Uber employs deep learning using LSTMs and its huge database to understand and predict the behavior of each and every user on the platform.
DD Dasgupta (Cisco)
DD Dasgupta explores the exciting development of the edge-cloud continuum, which is redefining business models and technology strategies while creating a vast array of new applications that will power the digital age. The continuum is also destroying what we know about the centralized data centers and cloud computing infrastructures that were so vital to the success of the previous computing eras.
Andrew Montalenti ( )
What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Mike Tung (Diffbot)
Mike Tung offers an overview of available open source and commercial knowledge graphs and explains how consumer and business applications are already taking advantage of them to provide intelligent experiences and enhanced business efficiency. Mike then discusses what's coming in the future.
Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture.
Atul Kale (Airbnb), Xiaohan Zeng (Airbnb)
Atul Kale and Xiaohan Zeng offer an overview of Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Built on Python, Spark, and Kubernetes, Bighead integrates popular libraries like TensorFlow, XGBoost, and PyTorch and is designed be used in modular pieces.
Joshua Laurito (Squarespace)
Joshua Laurito explores systems Squarespace built for acquiring and enforcing consistency on obtained data and for inferring conclusions from a company’s marketing and product initiatives. Joshua discusses the intricacies of gathering and evaluating marketing and user data, from raising awareness to driving purchases, and shares results of previous analyses.
Jonathan Ellis (DataStax)
Is open source Apache Cassandra still relevant in an era of hosted cloud databases? Jonathan Ellis discusses Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.
Paul Curtis (MapR Technologies)
Once the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? Paul Curtis explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment.
Roger Barga (Amazon Web Services), Sudipto Guha (Amazon Web Services), Kapil Chhabra (Amazon Web Services )
Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.
Andrew J Brust (ZDNet | Blue Badge Insights)
Data governance has grown from a set of mostly data management-oriented technologies in the data warehouse era to encompass catalogs, glossaries, and more in the data lake era. Now new requirements are emerging, and new products are rising to meet the challenge. Andrew Brust tracks data governance's past and present and offers a glimpse of the future.
Swetha Machanavajhala (Microsoft), Xiaoyong Zhu (Microsoft)
In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.
Wangda Tan (Hortonworks)
In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN.
Dr. Vijay Srinivas Agneeswaran (SapientRazorfish), Abhishek Kumar (SapientRazorfish)
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.
James Dreiss (Reuters)
James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in the world, The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.
Sanjeev Mohan (Gartner)
If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. But how do you do so if you don’t know what data is in the lake, the level of its quality, or the trustworthiness of models? Sanjeev Mohan explains why data governance is the linchpin to success.
Mark Donsky (Okera), Steven Ross (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) went into effect for firms doing business in the EU, but many companies still aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation.
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
JF Gagne (Element AI)
JF Gagne explains why the CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI governance team that is well staffed and deeply established in the company, in order to catch biases that can develop from faulty goals or flawed data.
Andreea Kremm (Netex Group), Mohammed Ibraaz Syed (UCLA)
Narrative economics studies the impact of popular narratives and stories on economic fluctuations in the context of human interests and emotions. Andreea Kremm and Mohammed Ibraaz Syed describe the use of emotion analysis, entity relationship extraction, and topic modeling in modeling narratives from written human communication.
Julien Le Dem (WeWork)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.
Janet Forbes, Danielle Leighton, and Lindsay Brin lead a primer on crafting well-conceived data science projects that uncover valuable business insights. Using case studies and hands-on skills development, Janet, Danielle, and Lindsay walk you through essential techniques for effecting real business change.
Mark Donsky (Okera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)
New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.
Shawn Terry (Komatsu Mining Corp)
Global heavy equipment manufacturer Komatsu is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Shawn Terry details the company's data journey and explains how it is using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.
Osman Sarood (Mist Systems)
Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million.
John Thuma (Arcadia Data)
Forget about the fake news; data and analytics in politics is what drives elections. John Thuma shares ethical dilemmas he faced while proposing analytical solutions to the RNC and DNC. Not only did he help causes he disagreed with, but he also armed politicians with real-time data to manipulate voters.
Ian Brooks (Hortonworks)
The power of big data continues to modernize traditional industries, including healthcare. Ian Brooks explains how to implement intelligent preventive screening for conditions by applying electronic medical records (EMR) to predictive analytics via supervised machine learning techniques.
Harish Doddi (Datatron Technologies), Jerry Xu (Datatron Technologies)
Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions.
Owen O'Malley (Hortonworks), Ryan Blue (Netflix)
Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.
Timothy Spann (DZone)
Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices.
Kevin Lu (PayPal), MAULIN VASAVADA (PayPal), Na Yang (PayPal)
PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing.
Michelle Casbon (Google)
Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project.
Yaroslav Tkachenko (Activision)
What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision.
Vartika Singh (Cloudera), Alan Silva (Cloudera), Alex Bleakley (Cloudera), Steven Totman (Cloudera), Mirko Kämpf (Cloudera), Syed Nasar (Cloudera)
Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.
Drew Paroski (MemSQL), Aatif Din (Fanatics)
Today’s successful businesses utilize data better than their competitors; however, data sprawl and inefficient data infrastructure restrict what’s possible. Blending the best of the past with the software innovations of today will solve future data challenges. Drew Paroski shares how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility.
Oleksii Kachaiev (Attendify)
When we talk about microservices, we usually focus on the communication layer. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity, such as structural and semantic changes, knowledge sharing, and data discovery. Join Alexey Kachayev to explore emerging technologies created to tackle these challenges.
Ben Lorica (O'Reilly Media)
As companies begin adopting machine learning, important considerations, including fairness, transparency, privacy, and security, need to be accounted for. Ben Lorica offers an overview of recent tools for building privacy-preserving and secure machine learning products and services.
Jennifer Lim (Cerner)
The use of data throughout Cerner had taxed the company's legacy operational data store, data warehouse, and enterprise reporting pipeline to the point where it would no longer scale to meet needs. Jennifer Lim explains how Cerner modernized its corporate data platform with the use of a hybrid cloud architecture.
Hilary Mason (Cloudera Fast Forward Labs)
Machine learning and artificial intelligence are exciting technologies, but real value comes from marrying those capabilities with the right business problems. Hilary Mason explores the current state of these technologies, investigates what's coming next in applied machine learning, and explains how to identify and execute on the right business opportunities at the right time.
Les McMonagle (BlueTalon)
Privacy by design is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. Les McMonagle outlines how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for noncompliance.
Julia Angwin (ProPublica)
Algorithms are increasingly arbiters of forgiveness. Julia Angwin discusses what she has learned about forgiveness in her series of articles on algorithmic accountability and the lessons we all need to learn for the coming AI future.
Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components.
Amro Alkhatib (National Health Insurance Company-Daman)
Processing claims is central to every insurance business. Amro Alkhatib shares a successful business case for automating claims processing, from idea to production. The machine learning-based claim automation model uses NLP methods on non-text data and allows auditable automated claims decisions to be made.
Bruno Gonçalves (JPMorgan Chase & Co.)
Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Join Bruno Gonçalves to learn how to use recurrent neural networks to model and forecast time series and discover the advantages and disadvantages of recurrent neural networks with respect to more traditional approaches.
Katharina Warzel (EveryMundo)
Airlines want to know what happens after a user interacts with their websites. Do they convert? Do they close the browser and come back later? Airlines traditionally have depended on analytics tools to prove value. Katharina Warzel explores how to implement a client-independent end-to-end tracking system.
Ramesh Krishnan (lmco), Steve Morgan (Lockheed Martin)
Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud.
Amber Case (MIT Media Lab)
Amber Case outlines several methods that product designers and managers can use to improve everyday interactions through an understanding and application of sound design.
Ted Dunning (MapR)
There’s real value in big data and more waiting when you add real-time, but to get the payoff, you need successful deployments of your AI and data-intensive applications. You need to be ready with your current applications in production but must have an architecture and infrastructure that are ready for the next ones as well. Ted Dunning explores how others have fared in this journey.
Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling, infrastructure, and more, Michelangelo D'Agostino shares concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value.
Ben Sharma (Zaloni)
Once, a company could live 60-70 years on the S&P 500. Now it averages 15 years. If companies were people, this would be an epidemic on par with the Black Plague. But the same things that dragged humanity out of that dark age can drag companies out of this one.
Ryan Blue (Netflix), Daniel Weeks (Netflix)
In the last few years, Netflix's data warehouse has grown to more than 100 PB in S3. Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3.
Anupam Singh (Cloudera), Brian Coyne (PNC)
Data volumes don’t translate to business value. What matters is your data platform’s ability to support unprecedented numbers of business users and use cases. Anupam Singh and Brian Coyne look at some of the challenges posed by data-hungry organizations and share new techniques to extract meaningful insights at the speed of today’s modern business.
Cassie Kozyrkov (Google)
Why do businesses fail at machine learning despite its tremendous potential and the excitement it generates? Is the answer always in data, algorithms, and infrastructure, or is there a subtler problem? Will things improve in the near future? Let's talk about some lessons learned at Google and what they mean for applied data science.
Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases.
Han Yang (Cisco Systems)
Data is the lifeblood of an enterprise, and it's being generated everywhere. To overcome the challenges of data gravity, data analytics, including machine learning, is best done where the data is located: ubiquitous machine learning. Han Yang explains how to overcome the challenges of machine learning everywhere.
Tim Walpole (BJSS)
Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform.
IBM Analytics’s Dinesh Nirmal solves school lunch and the struggle to keep ahead of regulations. With AI tech like deep learning and NLG, supplying meals to California’s kids leaps from enriching metadata for compliance to actionable insights for the business.
Patty Ryan (Microsoft), CY Yam (Microsoft), Elena Terenzi (Microsoft)
Large online fashion retailers must efficiently maintain catalogues of millions of items. Due to human error, it's not unusual that some items have duplicate entries. Since manually trawling such a large catalogue is next to impossible, how can you find these entries? Patty Ryan, CY Yam, and Elena Terenzi explain how they applied deep learning for image segmentation and background removal.
Fabian Hueske (data Artisans)
Fabian Hueske discusses why SQL is a great approach to unify batch and stream processing. He gives an update on Apache Flink's SQL support and shares some interesting use cases from large-scale production deployments. Finally, Fabian presents Flink's new query service that enables users and applications to submit streaming and batch SQL queries and retrieve low-latency updated results.
William Benton (Red Hat)
Containers are a hot technology for application developers, but they also provide key benefits for data scientists. William Benton details the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively.
Varant Zanoyan (Airbnb)
Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems.