Speaker Slides: Big data conference & machine learning training

Sophie Watson (Red Hat)

Download slides (PDF)

Recommender systems enhance user experience and business revenue every day. Sophie Watson demonstrates how to develop a robust recommendation engine using a microservice architecture.

A day in the life of a data scientist: How do we train our teams to get started with AI?

Francesca Lazzeri (Microsoft), Jaya Susan Mathew (Microsoft)

Download slides (PDF)

With the growing buzz around data science, many professionals want to learn how to become a data scientist—the role Harvard Business Review called the "sexiest job of the 21st century." Francesca Lazzeri and Jaya Mathew explain what it takes to become a data scientist and how artificial intelligence solutions have started to reinvent businesses.

A high-performance system for deep learning inference and visual inspection

Moty Fania (Intel), Sergei Kom (Intel)

Download slides (PDF)

Moty Fania and Sergei Kom share their experience and lessons learned implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming, and online actuation.

A/B testing at Uber: How we built a BYOM (bring your own metrics) platform

Milene Darnis (Uber)

Download slides (PDF)

Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze.

Achieving personalization with LSTMs

Ankit Jain (Uber)

Download slides (PDF)

Personalization is a common theme in social networks and ecommerce businesses. Personalization at Uber involves an understanding of how each driver and rider is expected to behave on the platform. Ankit Jain explains how Uber employs deep learning using LSTMs and its huge database to understand and predict the behavior of each and every user on the platform.

Agile for data science teams

Jennifer Prendki (Figure Eight)

Download slides (PDF)

Agile methodologies have been widely successful for software engineering teams but seem inappropriate for data science teams, because data science is part engineering, part research. Jennifer Prendki demonstrates how, with a minimum amount of tweaking, data science managers can adapt Agile techniques and establish best practices to make their teams more efficient.

AI, ML, and the IoT will destroy the data center and the cloud (just not in the way you think) (sponsored by Cisco)

DD Dasgupta (Cisco)

Watch the keynote

DD Dasgupta explores the exciting development of the edge-cloud continuum, which is redefining business models and technology strategies while creating a vast array of new applications that will power the digital age. The continuum is also destroying what we know about the centralized data centers and cloud computing infrastructures that were so vital to the success of the previous computing eras.

Analytics maturity: Industry trends and financial impacts

Bill Franks (International Institute For Analytics)

Download slides (PDF)

Drawing on a recent study of the analytics maturity level of large enterprises by the International Institute for Analytics, Bill Franks discusses how maturity varies by industry, shares key steps organizations can take to move up the maturity scale, and explains how the research correlates analytics maturity with a wide range of success metrics, including financial and reputational measures.

Applying petabyte-scale analytics and machine learning to billions of news reading sessions

Andrew Montalenti (Parse.ly )

Download slides (PDF)

What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how Parse.ly has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.

Architecting a data platform for enterprise use

Mark Madsen (Teradata), Todd Walter (Archimedata)

Download slides (PDF)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Architecting a next-generation data platform

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Download slides (PDF)

Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.

Automating business processes with large-scale knowledge graphs

Mike Tung (Diffbot)

Download slides (PDF)

Mike Tung offers an overview of available open source and commercial knowledge graphs and explains how consumer and business applications are already taking advantage of them to provide intelligent experiences and enhanced business efficiency. Mike then discusses what's coming in the future.

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)

Download slides (PDF)

Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture.

Bighead: Airbnb's end-to-end machine learning platform

Atul Kale (Airbnb), Xiaohan Zeng (Airbnb)

Download slides (PDF)

Atul Kale and Xiaohan Zeng offer an overview of Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Built on Python, Spark, and Kubernetes, Bighead integrates popular libraries like TensorFlow, XGBoost, and PyTorch and is designed be used in modular pieces.

Building it beautiful: Analyzing the effectiveness of platform products and marketing at scale

Josh Laurito (Squarespace)

Download slides (PDF)

Joshua Laurito explores systems Squarespace built for acquiring and enforcing consistency on obtained data and for inferring conclusions from a company’s marketing and product initiatives. Joshua discusses the intricacies of gathering and evaluating marketing and user data, from raising awareness to driving purchases, and shares results of previous analyses.

Cassandra versus cloud databases

Jonathan Ellis (DataStax)

Download slides (PDF)

Is open source Apache Cassandra still relevant in an era of hosted cloud databases? Jonathan Ellis discusses Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.

Circuit breakers to safeguard for garbage in, garbage out

Sandeep Uttamchandani (Intuit)

Download slides (PDF)

Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights.

Clouds and containers: Case studies for big data

Paul Curtis (Weaveworks)

Download slides (PPTX)

Once the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? Paul Curtis explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment.

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Mathew Lodge (Anaconda)

Download slides (PDF)

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Mathew Lodge (Anaconda)

Download slides (PDF)

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.

Continuous machine learning over streaming data: The story continues.

Roger Barga (Amazon Web Services), Sudipto Guha (Amazon Web Services), Kapil Chhabra (Amazon Web Services )

Download slides (1-PDF)

Download slides (2-PDF)

Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.

Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types

Barbara Eckman (Comcast)

Download slides (PDF)

Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro.

Data governance: A big job that's getting bigger

Andrew Brust (Blue Badge Insights | ZDNet)

Download slides (PPTX)

Data governance has grown from a set of mostly data management-oriented technologies in the data warehouse era to encompass catalogs, glossaries, and more in the data lake era. Now new requirements are emerging, and new products are rising to meet the challenge. Andrew Brust tracks data governance's past and present and offers a glimpse of the future.

Data operations problems created by deep learning and how to fix them (sponsored by MapR)

Jim Scott (NVIDIA)

Download slides (PDF)

Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of deep learning projects and solutions while walking you through a customer use case.

Deep learning on audio in Azure to detect sounds in real time

Swetha Machanavajhala (Microsoft), Xiaoyong Zhu (Microsoft)

Download slides (PDF)

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.

Deep learning on YARN: Running distributed TensorFlow, MXNet, Caffe, and XGBoost on Hadoop clusters

Wangda Tan (Cloudera)

Download slides (PPTX)

In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN.

Deep learning-based search and recommendation systems using TensorFlow

Vijay Agneeswaran (Walmart Labs), Abhishek Kumar (Publicis Sapient)

View slides

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Deep learning: Assessing analytics project feasibility and requirements (sponsored by NVIDIA)

Ward Eldred (NVIDIA)

Download slides (PDF)

Ward Eldred offers an overview of the types of analytical problems that can be solved using deep learning and shares a set of heuristics that can be used to evaluate the feasibility of analytical AI projects.

DIY versus designer approaches to deploying data center infrastructure for machine learning and analytics

Cory Minton (Dell EMC), Colm Moynihan (Cloudera)

View slides

Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble.

Document vectors in the wild: Building a content recommendation system for Reuters.com

James Dreiss (Reuters)

Download slides (PDF)

James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in the world, Reuters.com. The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.

Enacting Data Subject Access Rights for GDPR with data services and data management

Jean-Michel Franco (Talend)

Download slides (PDF)

GDPR is more than another regulation to be handled by your back office. Enacting the GDPR's Data Subject Access Rights (DSAR) requires practical actions. Jean-Michel Franco outlines the practical steps to deploy governed data services.

Executive Briefing: Best practices for human in the loop—The business case for active learning

Paco Nathan (derwen.ai)

View slides

Deep learning works well when you have large labeled datasets, but not every team has those assets. Paco Nathan offers an overview of active learning, an ML variant that incorporates human-in-the-loop computing. Active learning focuses input from human experts, leveraging intelligence already in the system, and provides systematic ways to explore and exploit uncertainty in your data.

Executive Briefing: Enhance your data lake with comprehensive data governance to improve adoption and meet compliance needs

Sanjeev Mohan (Gartner)

Download slides (PDF)

If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. But how do you do so if you don’t know what data is in the lake, the level of its quality, or the trustworthiness of models? Sanjeev Mohan explains why data governance is the linchpin to success.

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations

Mark Donsky (Okera), Steven Ross (Cloudera)

Download slides (PDF)

In May 2018, the General Data Protection Regulation (GDPR) went into effect for firms doing business in the EU, but many companies still aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Executive Briefing: Managing successful data projects—Technology selection and team building

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Download slides (PDF)

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation.

Executive Briefing: What you need to know about fast data

Dean Wampler (Anyscale)

View slides

Streaming data systems, so called "fast data," promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler shares what you need to know to exploit fast data successfully.

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it

David Talby (Pacific AI)

Download slides (PDF)

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

Feet on the ground, head in the clouds (sponsored by AtScale)

Mark Stange-Tregear (Ebates)

Download slides (PPTX)

Interested in how Ebates is using a hybrid on-premises and cloud implementation to scale out its centralized business intelligence and data hub? Mark Stange-Tregear shares the history, business context, and technical plan around Ebates’s hybrid Hadoop-AWS cloud approach.

From chaos to insight: Automatically derive value from your user-generated content

Stephanie Fischer (datanizing GmbH)

Download slides (PDF)

Whether customer emails, product reviews, company wikis, or support communities, user-generated content (UGC) as a form of unstructured text is everywhere, and it’s growing exponentially. Stephanie Fischer explains how to discover meaningful insights from the UGC of a famous New York discussion forum.

From data governance to AI governance: The CIO's new role

JF Gagne (Element AI)

Download slides (PDF)

JF Gagne explains why the CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI governance team that is well staffed and deeply established in the company, in order to catch biases that can develop from faulty goals or flawed data.

From emotion analysis and topic extraction to narrative modeling

Andreea Kremm (Netex Group), Mohammed Ibraaz Syed (UCLA)

Download slides (PDF)

Narrative economics studies the impact of popular narratives and stories on economic fluctuations in the context of human interests and emotions. Andreea Kremm and Mohammed Ibraaz Syed describe the use of emotion analysis, entity relationship extraction, and topic modeling in modeling narratives from written human communication.

From flat files to deconstructed database: The evolution and future of the big data ecosystem

Julien Le Dem (WeWork)

View slides

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.

From theory to data product: Applying data science methods to effect business change

Janet Forbes (T4G), Danielle Leighton (T4G), Lindsay Brin (T4G)

Download slides (1-PDF)

Download slides (2-PDF)

Janet Forbes, Danielle Leighton, and Lindsay Brin lead a primer on crafting well-conceived data science projects that uncover valuable business insights. Using case studies and hands-on skills development, Janet, Danielle, and Lindsay walk you through essential techniques for effecting real business change.

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step

Mark Donsky (Okera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)

Download slides (PDF)

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.

Governing your cloud-based enterprise data lake (sponsored by Zaloni)

Ben Sharma (Zaloni), Selwyn Collaco (TMX)

View slides

Selwyn Collaco and Ben Sharma share insights from their real-world experience and discuss best practices for architecture, technology, data management, and governance to enable centralized data services and explain how to leverage the Zaloni Data Platform (ZDP), an integrated self-service data platform, to operationalize the enterprise data lake .

How Komatsu is improving mining efficiencies using the IoT and machine learning

Shawn Terry (Komatsu Mining Corp)

Download slides (1-PDF)

Download slides (2-PDF)

Global heavy equipment manufacturer Komatsu is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Shawn Terry details the company's data journey and explains how it is using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.

How to cost-effectively and reliably build infrastructure for machine learning

Osman Sarood (Mist Systems)

Download slides (PDF)

Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million.

Hudi: Unifying storage and serving for batch and near-real-time analytics

Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)

Download slides (PPTX)

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.

If you thought politics was dirty, you should see the analytics behind it.

John Thuma (Arcadia Data)

View slides

Forget about the fake news; data and analytics in politics is what drives elections. John Thuma shares ethical dilemmas he faced while proposing analytical solutions to the RNC and DNC. Not only did he help causes he disagreed with, but he also armed politicians with real-time data to manipulate voters.

Improving patient screening by applying predictive analytics to electronic medical records.

Ian Brooks (Cloudera)

Download slides (PPTX)

The power of big data continues to modernize traditional industries, including healthcare. Ian Brooks explains how to implement intelligent preventive screening for conditions by applying electronic medical records (EMR) to predictive analytics via supervised machine learning techniques.

Infrastructure for deploying machine learning to production in large financial institutions: Lessons learned and best practices

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Download slides (PDF)

Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions.

Introducing Iceberg: Tables designed for object stores

Owen O'Malley (Cloudera), Ryan Blue (Netflix)

Download slides (PDF)

Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.

IoT edge processing with Apache NiFi, Apache MiniFi, and multiple deep learning libraries

TIMOTHY SPANN (Cloudera)

Download slides (ZIP)

Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices.

Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL

Guoqiong Song (Intel), Wenjing Zhan (Talroo), Jacob Eisinger (Talroo )

Download slides (PDF)

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? Guoqiong Song, Wenjing Zhan, and Jacob Eisinger demonstrate how to leverage distributed deep learning framework BigDL on Apache Spark to predict a candidate’s probability of applying to specific jobs based on their résumé.

Kafka at PayPal: Enabling 400 billion messages a day

Kevin Lu (PayPal), Maulin Vasavada (PayPal), Na Yang (PayPal)

Download slides (PPTX)

PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing.

Kubeflow explained: Portable machine learning on Kubernetes

Michelle Casbon (Google)

Download slides (PDF)

Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project.

Lessons learned building a scalable and extendable data pipeline for Call of Duty

Yaroslav Tkachenko (Activision)

Download slides (PDF)

What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision.

Leveraging Spark and deep learning frameworks to understand data at scale

Vartika Singh (Cloudera), Alan Silva (Cloudera), Alex Bleakley (Cloudera), Steven Totman (Cloudera), Mirko Kämpf (Cloudera), Syed Nasar (Cloudera)

Download slides (1-PDF)

Download slides (2-ODP)

Download slides (3-PDF)

Download slides (4-PPTX)

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.

Leveraging the best of the past to power a better future (sponsored by MemSQL)

Drew Paroski (MemSQL), Aatif Din (Fanatics)

Watch the keynote

Today’s successful businesses utilize data better than their competitors; however, data sprawl and inefficient data infrastructure restrict what’s possible. Blending the best of the past with the software innovations of today will solve future data challenges. Drew Paroski shares how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility.

Machine learning for nonstationary streaming data using Structured Streaming and StreamDM

Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)

View slides

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts).

Managing data chaos in the world of microservices

Oleksii Kachaiev (Attendify)

Download slides (PDF)

When we talk about microservices, we usually focus on the communication layer. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity, such as structural and semantic changes, knowledge sharing, and data discovery. Join Alexey Kachayev to explore emerging technologies created to tackle these challenges.

Managing data science in the enterprise

Joshua Poduska (Domino Data Lab), Patrick Harrison (S&P Global)

Download slides (PDF)

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Joshua Poduska and Patrick Harrison detail how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage

Managing risk in machine learning

Ben Lorica (O'Reilly)

Watch the keynote

As companies begin adopting machine learning, important considerations, including fairness, transparency, privacy, and security, need to be accounted for. Ben Lorica offers an overview of recent tools for building privacy-preserving and secure machine learning products and services.

Mapping India

S Anand (Gramener)

Download slides (PPTX)

Answering simple questions about India's geography can be a nightmare. Official shape files are not publicly available. Worse, each ministry uses their own maps. But an active group of volunteers are crafting open maps. Anand S explains what it takes for a grass-roots initiative to transform a country's data infrastructure.

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)

Download slides (PDF)

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.

MLflow: An open platform to simplify the machine learning lifecycle

Mani Parkhe (Databricks), Andrew Chen (Databricks)

Download slides (PPTX)

Successfully building and deploying a machine learning model is difficult to do once. Enabling other data scientists to reproduce your pipeline, compare the results of different versions, track what's running where, and redeploy and rollback updated models is much harder. Mani Parkhe and Andrew Chen offer an overview of MLflow—a new open source project from Databricks that simplifies this process.

Modernizing operational architecture with big data: Creating and implementing a modern data strategy

Jennifer Lim (Cerner)

Download slides (PPT)

The use of data throughout Cerner had taxed the company's legacy operational data store, data warehouse, and enterprise reporting pipeline to the point where it would no longer scale to meet needs. Jennifer Lim explains how Cerner modernized its corporate data platform with the use of a hybrid cloud architecture.

Practical ML today and tomorrow

Hilary Mason (Cloudera Fast Forward Labs)

Watch the keynote

Machine learning and artificial intelligence are exciting technologies, but real value comes from marrying those capabilities with the right business problems. Hilary Mason explores the current state of these technologies, investigates what's coming next in applied machine learning, and explains how to identify and execute on the right business opportunities at the right time.

Predicting residential occupancy and hot water usage from high-frequency, multivector utilities data

Cris Lowery (Baringa Partners), Marc Warner (ASI)

Download slides (PDF)

In EU households, heating and hot water alone account for 80% of energy usage. Cristobal Lowery and Marc Warner explain how future home energy management systems could improve their energy efficiency by predicting resident needs through utilities data, with a particular focus on the key data features, the need for data compression, and the data quality challenges.

Privacy by design: Building in data privacy and protection versus bolting it on later

Les McMonagle (BlueTalon)

Download slides (PPTX)

Privacy by design is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. Les McMonagle outlines how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for noncompliance.

Progress for big data in Kubernetes

Ted Dunning (MapR, now part of HPE)

Download slides (PPTX)

Stateful containers are a well-known anti-pattern, but the standard solution—managing state in a separate storage tier—is costly and complex. Recent developments have changed things dramatically for the better. In particular, you can now manage a high-performance software-defined-storage tier entirely in Kubernetes. Ted Dunning describes what's new and how it makes big data easier on Kubernetes.

Quantifying forgiveness

Julia Angwin (ProPublica)

Watch the keynote

Algorithms are increasingly arbiters of forgiveness. Julia Angwin discusses what she has learned about forgiveness in her series of articles on algorithmic accountability and the lessons we all need to learn for the coming AI future.

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

Mauricio Aristizabal (Impact)

Download slides (BIN)

Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components.

Real-time automated claim processing: The surprising utility of NLP methods on non-text data

Amro Alkhatib (National Health Insurance Company-Daman)

Download slides (PPTX)

Processing claims is central to every insurance business. Amro Alkhatib shares a successful business case for automating claims processing, from idea to production. The machine learning-based claim automation model uses NLP methods on non-text data and allows auditable automated claims decisions to be made.

Recurrent neural networks for time series analysis

Bruno Goncalves (Data For Science)

Download slides (PDF)

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Join Bruno Gonçalves to learn how to use recurrent neural networks to model and forecast time series and discover the advantages and disadvantages of recurrent neural networks with respect to more traditional approaches.

Running multidisciplinary big data workloads in the cloud

Sudhanshu Arora (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon Freeman (Cloudera), Jason Wang (Cloudera), Shravan Pabba (Cloudera)

Download slides (PDF)

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.

Scaling data infrastructure in the fashion world; or, “What is this? Business intelligence for ants?”

Francesco Mucio (Francescomuc.io)

Download slides (PDF)

Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead.

Self-reliant, secure, end-to-end data, activity, and revenue analytics: A roadmap for the airline industry

Katharina Warzel (EveryMundo)

Download slides (PDF)

Airlines want to know what happens after a user interacts with their websites. Do they convert? Do they close the browser and come back later? Airlines traditionally have depended on analytics tools to prove value. Katharina Warzel explores how to implement a client-independent end-to-end tracking system.

Self-service modern analytics on the GovCloud

Ramesh Krishnan (lmco), Steven Morgan (Lockheed Martin)

Download slides (ZIP)

Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud.

Sound design and the future of experience

Amber Case (MIT Media Lab)

Watch the keynote

Amber Case outlines several methods that product designers and managers can use to improve everyday interactions through an understanding and application of sound design.

The answer to life, the universe, and everything: But can you get that into production? (sponsored by MapR)

Ted Dunning (MapR, now part of HPE)

Watch the keynote

There’s real value in big data and more waiting when you add real-time, but to get the payoff, you need successful deployments of your AI and data-intensive applications. You need to be ready with your current applications in production but must have an architecture and infrastructure that are ready for the next ones as well. Ted Dunning explores how others have fared in this journey.

The care and feeding of data scientists: Concrete tips for retaining your data science team

Michelangelo D'Agostino (ShopRunner)

Download slides (PDF)

Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling, infrastructure, and more, Michelangelo D'Agostino shares concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value.

The data imperative (sponsored by Zaloni)

Ben Sharma (Zaloni)

Watch the keynote

Once, a company could live 60-70 years on the S&P 500. Now it averages 15 years. If companies were people, this would be an epidemic on par with the Black Plague. But the same things that dragged humanity out of that dark age can drag companies out of this one.

The evolution of Netflix's S3 data warehouse

Ryan Blue (Netflix), Daniel Weeks (Netflix)

Download slides (PDF)

In the last few years, Netflix's data warehouse has grown to more than 100 PB in S3. Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3.

The future of data warehousing

Anupam Singh (Cloudera), brian coyne (PNC)

Watch the keynote

Data volumes don’t translate to business value. What matters is your data platform’s ability to support unprecedented numbers of business users and use cases. Anupam Singh and Brian Coyne look at some of the challenges posed by data-hungry organizations and share new techniques to extract meaningful insights at the speed of today’s modern business.

The Missing Piece

Cassie Kozyrkov (Google)

Watch the keynote

Why do businesses fail at machine learning despite its tremendous potential and the excitement it generates? Is the answer always in data, algorithms, and infrastructure, or is there a subtler problem? Will things improve in the near future? Let's talk about some lessons learned at Google and what they mean for applied data science.

Tracking data lineage at Stitch Fix

Neelesh Salian (Stitch Fix)

Download slides (PDF)

Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases.

TuneIn: How to get your jobs tuned while you are sleeping

Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn)

Download slides (PPTX)

Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage.

Ubiquitous machine learning (sponsored by Cisco)

Chiang Yang (Cisco)

Download slides (PDF)

Data is the lifeblood of an enterprise, and it's being generated everywhere. To overcome the challenges of data gravity, data analytics, including machine learning, is best done where the data is located: ubiquitous machine learning. Han Yang explains how to overcome the challenges of machine learning everywhere.

Using big data to unlock the delivery of personalized, multilingual real-time chat services for global financial service organizations

Timothy Walpole (BJSS)

Download slides (PDF)

Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform.

Using machine learning to drive intelligence at the edge

Dave Shuman (Cloudera), Bryan Dean (Red Hat)

Download slides (PDF)

The focus on the IoT is turning increasingly to the edge, and the way to make the edge more intelligent is by building machine learning models in the cloud and pushing them back out to the edge. Dave Shuman and Bryan Dean explain how Cloudera and Red Hat executed this architecture at one of Europe's leading manufacturers, along with a demo highlighting this architecture.

Using the blockchain in the enterprise

Jim Scott (NVIDIA)

Download slides (PDF)

Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures.

Wait. . .pizza is a vegetable? Decoding regulations using machine learning (sponsored by IBM)

Dinesh Nirmal (IBM)

Watch the keynote

IBM Analytics’s Dinesh Nirmal solves school lunch and the struggle to keep ahead of regulations. With AI tech like deep learning and NLG, supplying meals to California’s kids leaps from enriching metadata for compliance to actionable insights for the business.

What's the Hadoop-la about Kubernetes?

Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)

Download slides (PPTX)

Kubernetes (K8s)—the open source container orchestration system for modern big data workloads—is increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s.

When Tiramisu meets online fashion retail

Patty Ryan (Microsoft), CY Yam (Microsoft), Elena Terenzi (Microsoft)

Download slides (PPTX)

Large online fashion retailers must efficiently maintain catalogues of millions of items. Due to human error, it's not unusual that some items have duplicate entries. Since manually trawling such a large catalogue is next to impossible, how can you find these entries? Patty Ryan, CY Yam, and Elena Terenzi explain how they applied deep learning for image segmentation and background removal.

Why and how to leverage the power and simplicity of SQL on Apache Flink

Fabian Hueske (Ververica)

Download slides (PDF)

Fabian Hueske discusses why SQL is a great approach to unify batch and stream processing. He gives an update on Apache Flink's SQL support and shares some interesting use cases from large-scale production deployments. Finally, Fabian presents Flink's new query service that enables users and applications to submit streaming and batch SQL queries and retrieve low-latency updated results.

Why data scientists should love Linux containers

William Benton (Red Hat)

Download slides (PDF)

Containers are a hot technology for application developers, but they also provide key benefits for data scientists. William Benton details the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively.

Zipline: Airbnb's data management platform for machine learning

Varant Zanoyan (Airbnb)

Download slides (PDF)

Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems.

Speaker slides & video

Sponsorship Opportunities

Partner Opportunities

Contact Us