Speaker slides and video: Big data conference & machine learning training

20 Netflix-style principles and practices to get the most out of your data platform

Kurt Brown (Netflix)

View slides

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.

20 Netflix-style principles and practices to get the most out of your data platform

Kurt Brown (Netflix)

View slides

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.

A/B testing at scale: Accelerating software innovation

Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)

View slides

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.

Accelerating analytics and AI from the edge to the cloud (sponsored by Intel)

Kevin Huiskes (Intel), Radhika Rangarajan (Intel)

Download slides (PDF)

Advanced analytics and AI workloads require a scalable and optimized architecture, from hardware and storage to software and applications. Kevin Huiskes and Radhika Rangarajan share best practices for accelerating analytics and AI and explain how businesses globally are leveraging Intel’s technology portfolio, along with optimized frameworks and libraries, to build AI workloads at scale.

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

Sergey Ermolin (Intel), Shivaram Venkataraman (Microsoft Research)

Download slides (PPTX)

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.

Accelerating development velocity of production ML systems with Docker

Kinnary Jangla (Pinterest)

Download slides (PDF)

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.

Achieving GDPR compliance and data privacy using blockchain technology

Ajay Kumar Mothukuri (Sapient), Vijay Agneeswaran (Walmart Labs)

Download slides (PDF)

Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.

AI-powered crime prediction

Or Herman-Saffar (Dell), Ran Taig (Dell EMC)

Download slides (PPTX)

What if we could predict when and where crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future.

Analytics in real time, the (Grey's) anatomy of event streaming (sponsored by MemSQL)

Adam Ahringer (Disney-ABC TV Digital Media)

Download slides (PDF)

Adam Ahringer explains how Disney-ABC TV leverages Amazon Kinesis and MemSQL to provide real-time insights based on user telemetry as well as the platform for traditional data warehousing activities.

Apache Spark programming

Brooke Wenig (Databricks)

View slides

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.

Approximation data structures in streaming data processing

Debasish Ghosh (Lightbend)

Download slides (PDF)

Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures.

Architecting an edge-to-cloud data pipeline to unify multiple data sources and processing engines (sponsored by NetApp)

Santosh Rao (NetApp)

Download slides (PPT)

Santosh Rao explores the architecture of a data pipeline from edge to core to cloud and across various data sources and processing engines and explains how to build a solution architecture that enables businesses to maximize the competitive differentiation with the ability to unify data insights in compelling yet efficient ways.

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark

Jiao(Jennie) Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)

Download slides (PDF)

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.

Automating business insights through artificial intelligence

Wayde Fleener (General Mills)

Download slides (FILE)

Decision makers are busy. Businesses can hire people to analyze data for them, but most companies are resource constrained and can’t hire a small army to look through all their data. Wayde Fleener explains how General Mills implemented automation to enable decision makers to quickly focus on the metrics that matter and cut through everything else that does not.

Automation and analytics enablement in life insurance

Divya Ramachandran (Captricity)

Download slides (PDF)

Divya Ramachandran explains how top insurance companies have used handwriting transcription powered by deep learning to achieve a more than 70% reduction in daily operational processing time, develop a best-in-industry predictive model for assessing mortality risk from decades of archived forms, and enable a smarter claims leakage review, which led to a 10x ROI in its first year.

Best practices for productionizing Apache Spark MLlib models

Joseph Bradley (Databricks)

Download slides (PDF)

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving.

Big data analytics and machine learning techniques to drive and grow business

Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)

Download slides (PDF)

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.

Big data insights equal big money: Stories from the trenches at GoDaddy

Felix Gorodishter (GoDaddy)

Download slides (PDF)

GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email.

Bladder cancer diagnosis using deep learning

Mauro Damo (Dell EMC), Wei Lin (Dell EMC)

Download slides (PDF)

Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive.

Building a data science idea factory: How to prioritize the portfolio of a large, diverse, and opinionated data science team

Katie Malone (Civis Analytics), Skipper Seabold (Civis Analytics)

Download slides (PDF)

A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup.

Building career advisory tools for the tech sector using machine learning

Simon Hughes (Dice.com), Yuri Bykov (Dice.com)

Download slides (ZIP)

Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production.

Classifying job execution using deep learning

Ash Munshi (Pepperdata)

Download slides (PPTX)

Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series.

Code Property Graph: A modern, queryable data storage for source code

Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)

View slides

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.

Continuous machine learning over streaming data

Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)

Download slides (PDF)

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.

Crisis Text Line data usage and insights

Nancy Lublin (Crisis Text Line)

Watch the keynote

Nancy Lublin shares insights from Crisis Text Line.

Cuttlefish: Lightweight primitives for online tuning

Tomer Kaftan (University of Washington)

Download slides (PDF)

Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.

Data science in practice: Examining events in social media

Jennifer Webb (SuprFanz)

Download slides (PPTX)

Jennifer Webb explains how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.

Data science in the cloud

Alex Smola (Amazon)

Watch the keynote

In this talk Alex will discuss lessons learned from AWS SageMaker, an integrated framework for handling all stages of analysis. AWS uses open source components such as Jupyter, Docker containers, Python and well established deep learning frameworks such as Apache MxNet and TensorFlow for an easy to learn workflow.

Data-driven fuel management at Ryanair

Marcin Pilarczyk (Ryanair)

Download slides (ZIP)

Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions.

Deep learning for domain-specific entity extraction from unstructured text

Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)

Download slides (PPT)

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.

Deep learning-based search and recommendation systems using TensorFlow

Abhishek Kumar (Publicis Sapient), Vijay Agneeswaran (Walmart Labs)

Download slides (PDF)

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.

Defining responsible data practices: A community-driven approach

Natalie Evans Harris (BrightHive)

Watch the keynote

Natalie Evans Harris explores the Community Principles on Ethical Data Practices (CPEDP), a community-driven code of ethics for data collection, sharing, and utilization that provides people in the data science community a standard set of easily digestible, recognizable principles for guiding their behaviors.

Deploying deep learning with TensorFlow

Ron Bodkin (Google), Brian Foo (Google)

Download slides (PDF)

TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.

Detecting time series anomalies at Uber scale with recurrent neural networks

Andrea Pasqua (Uber), Anny Chen (Uber)

Download slides (PDF)

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.

Differentiating via data science

Eric Colson (Stitch Fix)

Watch the keynote

While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization.

Distributed clinical models: Inference without sharing patient data

Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)

Download slides (ZIP)

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.

Distributed deep learning with containers on heterogeneous GPU clusters

dong meng (MapR)

Download slides (PDF)

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.

Effectively once in Apache Pulsar, the next-generation messaging system

Matteo Merli (Streamlio)

Download slides (PDF)

Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty.

Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists

Stephen O'Sullivan (Data Whisperers)

View slides

Stephen O'Sullivan takes you along the data science journey, from onboarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team.

Executive Briefing: BI on big data

Mark Madsen (Teradata), Shant Hovsepian (Arcadia Data)

Download slides (PDF)

There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian outline the trade-offs between a number of architectures that provide self-service access to data and discuss the pros and cons of architectures, deployment strategies, and examples of BI on big data.

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations

Mark Donsky (Okera), Steven Ross (Cloudera)

Download slides (PDF)

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

Explorer Graph Algothrims for Data Science with Neo4j

View slides

This talk provides a very quick overview of Graph theory and the graph Algothrims available in Neo4j.

From the presidential campaign trail to the enterprise: Building effective data-driven teams

Katie Malone (Civis Analytics)

Download slides (PDF)

The 2012 Obama campaign ran the first personalized presidential campaign in history. The data team was made up of people from diverse backgrounds who embraced data science in service of the goal. Civis Analytics emerged from this team and today enables organizations to use the same methods outside politics. Katie Malone shares lessons learned from these experiences for building effective teams.

Get a farm-to-table view of your data: Track data lineage from source to analytics (sponsored by Syncsort)

Tendu Yogurtcu (Syncsort)

Download slides (PDF)

Chefs must be able to trust the authenticity, quality, and origin of their ingredients; data analysts must be able to do the same of their data—and what happens to it along the way. Tendü Yoğurtçu explains how to seamlessly track the lineage and quality of your data—on and off the cluster, on-premises or in the cloud—to deliver meaningful insights and meet regulatory compliance requirements.

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments

Mark Donsky (Okera), Andre Araujo (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera)

Download slides (ZIP)

New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR.

HDFS on Kubernetes: Tech deep dive on locality and security

Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)

Download slides (PPTX)

There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support.

Hive as a service

Szehon Ho (Criteo), Pawel Szostek (Criteo)

Download slides (PPTX)

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)

Download slides (PDF)

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.

How to protect big data in a containerized environment

Thomas Phelan (HPE BlueData)

Download slides (PPT)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them.

How to use Impala's query plan and profile to fix performance issues

Juan Yu (Cloudera)

Download slides (PDF)

Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.

Human in the loop: Bayesian rules enabling explainable AI

Pramit Choudhary (h2o.ai)

Download slides (PDF)

Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation.

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Sergey Ermolin (Intel), Suqiang Song (Mastercard)

Download slides (PDF)

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach.

Inclusivity for the greater good

Ajey Gore (GO-JEK)

Watch the keynote

Ajey Gore details GO-JEK's evolution from a small bike-hailing startup to a technology-focused unicorn in the areas of transportation, lifestyle, payments, and social enterprise and explains how the company is focusing its attention beyond urban Indonesia to impact more than a million people across the country's rural areas.

Introducing the #DataResponsibility Initiative

Download slides (PPTX)

In the era of device addiction, an attention economy & surveillance capitalism, this initiative will enable new levels of organizational Data Transparency and user Data Control. Formation of a meta-community focused on existing and new open source projects throughout a modern Data pipeline. (streaming, queuing, transformation, integration, database, analytics, post-processing, storage, networking)

Kafka streaming applications with Akka Streams and Kafka Streams

Dean Wampler (Anyscale)

Download slides (PDF)

Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead.

Lessons in Google Search data

Seth Stephens-Davidowitz (New York Times)

Watch the keynote

Seth Stephens-Davidowitz explains how to use Google searches to uncover behaviors or attitudes that may be hidden from traditional surveys, such as racism, sexuality, child abuse, and abortion.

Lessons learned deploying machine learning and deep learning models in production at major tech companies

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Download slides (PDF)

Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory.

Lessons on driving data science and analytics transformation

Chris Chapo (Gap Inc.)

Download slides (PDF)

Chris Chapo walks you through real-world examples of companies that are driving transformational change by leveraging data science and analytics, paying particular attention to established organizations where these capabilities are newer concepts.

Lyft's analytics pipeline: From Redshift to Apache Hive and Presto

Shenghu Yang (Lyft)

Download slides (PDF)

Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits.

Machine learning versus machine learning in production

Manu Mukerji (8x8)

View slides

Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more.

Machine learning with PyTorch

Delip Rao (AI Foundation), Brian McMahan (Wells Fargo)

View slides

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.

Machine-learned model quality monitoring in fast data and streaming applications

Emre Velipasaoglu (Lightbend)

Download slides (PDF)

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications.

Magellan: Scalable and fast geospatial analytics

Ram Sriharsha (Databricks)

View slides

How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity.

Merging human and machine learning for everyday solutions

Li Fan (Pinterest)

Watch the keynote

Li Fan shares insights into how Pinterest improves products based on usage and explains how the company is using AI to predict what’s in an image, what a user wants, and what they’ll want next, answering subjective questions better than machines or humans alone could achieve.

Metrics-driven tuning of Apache Spark at scale

Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)

View slides

Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.

Modern real-time streaming architectures

Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (StreamNative), Arun Kejariwal (Independent)

View slides

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Henry Cai (Pinterest), Yi Yin (Pinterest)

Download slides (PDF)

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.

Not your parents' machine learning: How to ship an XGBoost churn prediction app in under four weeks

Goodman Gu (Cogito)

Download slides (PDF)

Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker.

Operationalize deep learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios

Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)

Download slides (PDF)

Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.

Operationalizing machine learning (sponsored by IBM)

Dinesh Nirmal (IBM)

Watch the keynote

Machine learning research and incubation projects are everywhere, but less common, and far more valuable, is the innovation unlocked once you bring machine learning out of research and into production. Dinesh Nirmal explains how real-world machine learning reveals assumptions embedded in business processes and in the models themselves that cause expensive and time-consuming misunderstandings.

Playing well together: Big data beyond the JVM with Spark and friends

Holden Karau (Independent), Rachel B Warren (Salesforce Einstein)

View slides

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).

Privacy in the age of machine learning

Ben Lorica (O'Reilly)

Watch the keynote

Ben Lorica shares emerging security best practices for business intelligence, machine learning, and mobile computing products and explores new tools, methods, and products that can help ease the way for companies interested in deploying secure and privacy-preserving analytics.

Progressive data governance for emerging technologies

Anne Buff (SAS)

Download slides (PPTX)

Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement.

Radically modular data ingestion APIs in Apache Beam

Eugene Kirpichov (Google)

Download slides (PDF)

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn.

Show me the money: Understanding causality for ad attribution

April Chen (Civis Analytics)

Download slides (PDF)

Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. April Chen details the shortcomings of these models and proposes a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.

Smart agriculture: Blending IoT sensor data with visual analytics

Mike Prorock (mesur.io)

Download slides (PDF)

Mike Prorock offers an overview of mesur.io, a game-changing climate awareness solution that combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market. Mesur.io enables growers to monitor areas of concern, providing immediate benefits to crop yield, supply costs, farm labor overhead, and water consumption.

Spark on Kubernetes: A case study from JD.com

Zhen Fan (JD.com), Wei Ting Chen (Intel Corporate)

Download slides (PDF)

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.

sparklyr, implyr, and more: dplyr interfaces to large-scale data

Ian Cook (Cloudera)

Download slides (PDF)

The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

Billy Liu (Kyligence)

Download slides (PDF)

As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data.

Sprouted clams and stanky bean: When machine learning makes mistakes

Janelle Shane (aiweirdness.com)

Watch the keynote

At AIweirdness.com Janelle Shane posts the results of neural network experiments gone delightfully wrong. But machine learning mistakes can also be very embarrassing or even dangerous. Using silly datasets as examples, Janelle talks about some ways that algorithms fail.

Stream processing with Kafka

Tim Berglund (Confluent)

Download slides (PDF)

Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.

Streaming applications as microservices using Kafka, Akka Streams, and Kafka Streams

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Download slides (PDF)

Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead.

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber

Fabian Hueske (data Artisans), Shuyi Chen (Uber)

Download slides (PDF)

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.

Supply chain evolution from horseless buggies to driverless cars

Valentin Bercovici (PencilDATA)

Download slides (PDF)

Valentin Bercovici explores the challenges in securing, maintaining, and repairing the dynamic, heterogeneous software supply chain for modern self-driving cars, from levels 0 to 5. Along the way, Valentin reviews implementation options, from centralized certificate authority-based architectures to decentralized blockchains networked over a fleet of cars.

The mathematical corporation: A new leadership mindset for the machine intelligence era

Stephanie Beben (Booz Allen Hamilton)

Download slides (PPTX)

How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Stephanie Beben shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.”

The secret sauce behind LinkedIn's self-managing Kafka clusters

Jiangjie Qin (LinkedIn)

Download slides (PDF)

LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.

The state of Postgres

Umur Cubukcu (Citus Data)

Download slides (PDF)

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases.

To a hammer, everything is a nail: Choosing the right tool for your business problems (sponsored by Microsoft)

Tobias Ternstrom (Microsoft)

Watch the keynote

The emergence of the cloud combined with open source software ushered in an explosive use of a broad range of technologies. Tobias Ternstrom explains why you should step back and attempt to objectively evaluate the problem you are trying to solve before choosing the tool to fix it.

Trapped by the present: Estimating long-term impact from A/B experiments

Brian Karfunkel (Pinterest)

Download slides (PDF)

When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users.

Using deep learning to solve challenging problems

Jeff Dean (Google)

Download slides (PDF)

The Google Brain team conducts research on difficult problems in artificial intelligence and builds large-scale computer systems for machine learning research, both of which have been applied to dozens of Google products. Jeff Dean highlights some of Google Brain's projects with an eye toward how they can be used to solve challenging problems.

Using ML to improve UX and literacy for young poets

Ann Nguyen (Whole Whale)

Download slides (PDF)

Power Poetry is the largest online platform for young poets, with over 350K users. Ann Nguyen explains how Power Poetry is extending the learning potential with machine learning and covers the technical elements of the Poetry Genome, a series of ML tools to analyze and break down similarity scores of the poems added to the site.

Vectorized query processing using Apache Arrow

Siddharth Teotia (Dremio)

Download slides (PDF)

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.

What is the relationship between social influence and the NBA?

Noah Gift (UC Davis)

Download slides (PDF)

Noah Gift uses data science and machine learning to explore NBA team valuation and attendance as well as individual player performance. Questions include: What drives the valuation of teams (attendance, the local real estate market, etc.)? Does winning bring more fans to games? Does salary correlate with social media performance?

What separates the clouds? (sponsored by Google Cloud)

William Vambenepe (Google)

Watch the keynote

William Vambenepe explains how a pivot toward machine learning and artificial intelligence has created clearer separation among clouds than ever before. William walks you through an interesting use case of machine learning in action and discusses the central role AI will play in big data analysis moving forward.

What's new in Hadoop 3.0

Daniel Templeton (Cloudera), Andrew Wang (Cloudera)

Download slides (PDF)

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.

Why nobody cares about your anomaly detection

Baron Schwartz (VividCortex)

View slides

Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view.

Word embeddings under the hood: How neural networks learn from language

Patrick Harrison (S&P Global)

View slides

Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through how it works its magic. Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more.

Working with the data of sports

Thomas Miller (Northwestern University)

Download slides (PDF)

Sports analytics today is more than a matter of analyzing box scores and play-by-play statistics. Faced with detailed on-field or on-court data from every game, sports teams face challenges in data management, data engineering, and analytics. Thomas Miller details the challenges faced by a Major League Baseball team as it sought competitive advantage through data science and deep learning.

Your enterprise AI is only as good as your data.

Joe Dumoulin (Next IT)

Download slides (ZIP)

AI is transformative for business, but it’s not magic; it’s data. Joe Dumoulin shares how Next IT's global enterprise customers have transformed their businesses with AI solutions and outlines how companies should build AI strategies, utilize data to develop and evolve conversational intelligence and business intents, and ultimately increase ROI.

Speaker slides & video

Sponsorship Opportunities

Partner Opportunities

Contact Us