Data science and machine learning: Big data conference & machine learning training

Wednesday March 7: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45 \| Location: Salon 1&2 Strata Data Conference Keynotes
10:30am Morning break

Thursday March 8: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45 \| Location: Salon 1&2 Strata Data Conference Keynotes
10:30am Morning break

9:00am - 5:00pm Monday, March 5 & Tuesday, March 6

Apache Spark programming

Location: 212 A-B

Brooke Wenig (Databricks)

Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs. Read more.

9:00am - 5:00pm Monday, March 5 & Tuesday, March 6

Data science and machine learning with Apache Spark

Location: 212 C

Brian Bloechle Bloechle (Cloudera), Glynn Durham (Cloudera)

Average rating:

(5.00, 1 rating)

Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib. Read more.

9:00am - 5:00pm Monday, March 5 & Tuesday, March 6

Machine learning with PyTorch

Location: 111

Delip Rao (AI Foundation), Brian McMahan (Wells Fargo)

Average rating:

(5.00, 1 rating)

PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems. Read more.

9:00am - 5:00pm Monday, March 5 & Tuesday, March 6

Machine learning with TensorFlow

Location: San Jose Ballroom (salon 1&2), Marriott

Robert Schroll (The Data Incubator)

The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs, making it ideal for implementing neural networks and other machine learning algorithms. Robert Schroll demonstrates TensorFlow's capabilities and walks you through building machine learning models on real-world data. Read more.

9:00am - 5:00pm Monday, March 5 & Tuesday, March 6

Hands-on data science with Python

Location: Willow Glen (1&2), Marriott

Zachary Glassman (The Data Incubator)

Zachary Glassman demonstrates how to build intelligent business applications using machine learning, taking you through each step in developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend your knowledge by building two applications from real-world datasets. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Spark camp: Apache Spark 2.0 for analytics and text mining with Spark ML

Location: LL20 D

Joseph Kambourakis (databricks)

Join Joseph Kambourakis for an introduction to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Learning PyTorch by building a recommender system

Location: LL21 A

Secondary topics: Graphs and Time-series

Mo Patel (Independent), Neejole Patel (Virginia Tech)

Average rating:

(2.50, 4 ratings)

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Using R and Python for scalable data science, machine learning, and AI

Location: LL21 C/D

Mario Inchiosa (Microsoft), Vanja Paunic (Microsoft), Robert Horton (Microsoft), Debraj GuhaThakurta (Microsoft), Ali-Kazim Zaidi (Microsoft), Tomas Singliar (Microsoft), John-Mark Agosta (Microsoft)

Average rating:

(4.00, 4 ratings)

R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Getting started with TensorFlow

Location: LL21 E/F

Martin Görner (Google)

Average rating:

(5.00, 3 ratings)

Martin Görner walks you through training and deploying a machine learning system using popular open source library TensorFlow. Martin takes you from a conceptual overview all the way to building complex classifiers and explains how you can apply deep learning to complex problems in science and industry. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Big data analytics and machine learning techniques to drive and grow business

Location: 210 A/E

Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)

Average rating:

(4.44, 9 ratings)

Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Data Case Studies

Location: LL20 A

Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Jennie Shin (Kaiser Permanente), Valentin Bercovici (PencilDATA), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin Martin (O'Reilly Media), Divya Ramachandran (Captricity)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Media and Ad Tech Day

Location: LL20 B

David Boyle (Audience Strategies), Violeta Hennessey (Warner Bros.), April Chen (Civis Analytics), Sridhar Alla (BlueWhale), Noah Gift (UC Davis), Blake Irvine (Netflix), Kevin Lyons (Nielsen Marketing Cloud), Jennifer Webb (SuprFanz), Rizwan Patel (Caesars Entertainment), Anthony Accardo (Disney), Amanda Gerdes (Blizzard Entertainment), Violeta Hennessey (Warner Bros.), Aneesh Karve (Quilt), David Boyle (Audience Strategies), Pete Skomoroch (Workday)

Hear from innovators in ad tech, measurement, automation, and audience engagement about where the media industry is today—and where it's likely to go next. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Natural language understanding at scale with spaCy and Spark NLP

Location: LL20 C

David Talby (Pacific AI), Claudiu Branzan (Accenture), Alex Thomas (John Snow Labs)

Average rating:

(5.00, 1 rating)

Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

A/B testing at scale: Accelerating software innovation

Location: LL21 C/D

Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)

Average rating:

(4.00, 3 ratings)

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Deep learning-based search and recommendation systems using TensorFlow

Location: LL21 E/F

Abhishek Kumar (Publicis Sapient), Vijay Agneeswaran (Walmart Labs)

Average rating:

(4.00, 3 ratings)

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Custom interactive visualizations and dashboards for one billion datapoints on a laptop in 30 lines of Python

Location: 210 D/H

James Bednar (Anaconda), Philipp Rudiger (Anaconda)

Average rating:

(4.50, 2 ratings)

Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code. Read more.

9:30am–9:40am Wednesday, March 7, 2018

Privacy in the age of machine learning

Location: Grand Ballroom 220

Ben Lorica (O'Reilly)

Average rating:

(4.00, 8 ratings)

Ben Lorica shares emerging security best practices for business intelligence, machine learning, and mobile computing products and explores new tools, methods, and products that can help ease the way for companies interested in deploying secure and privacy-preserving analytics. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Interpretable machine learning products

Location: LL20 A

Mike Lee Williams (Cloudera Fast Forward Labs)

Average rating:

(4.86, 7 ratings)

Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of the open source, model-agnostic tool LIME. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Breaking up the block: Using heterogenous population modeling to drive growth

Location: LL20 C

Daniel Lurie (1989)

All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify, and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth. Read more.

11:00am–11:40am Wednesday, March 7, 2018

How does a big data professional get started with AI?

Location: LL20 D

Wee Hyong Tok (Microsoft), Danielle Dean (iRobot)

Average rating:

(3.50, 2 ratings)

Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI. Read more.

11:00am–11:40am Wednesday, March 7, 2018

The current state of TensorFlow and where it's headed in 2018

Location: LL21 B

Rajat Monga (Google)

Average rating:

(4.40, 5 ratings)

Rajat Monga offers an overview of TensorFlow's progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Being smarter than dinosaurs: How NASA uses deep learning for planetary defense

Location: Expo Hall 1

Secondary topics: Expo Hall

Siddha Ganju (NVIDIA)

Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Deploying and monitoring interactive machine learning applications with Clipper

Location: LL20 A

Dan Crankshaw (UC Berkeley RISELab)

Average rating:

(4.25, 4 ratings)

Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Who are we? The largest-scale study of professional data scientists

Location: LL20 C

Miryung Kim (UCLA), Muhammad Gulzar (UCLA)

Average rating:

(3.50, 2 ratings)

Even though we know that there are more data scientists in the workforce today, neither what those data scientists actually do nor what we even mean by data scientists has been studied quantitatively. Miryung Kim and Muhammad Gulzar share the results of a large-scale survey with 793 professional data scientists and detail several trends about data scientists in the software engineering context. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Want to build a better chatbot? Start with your data.

Location: LL20 D

Andrew Mattarella-Micke (Intuit)

Average rating:

(5.00, 1 rating)

When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Andrew Mattarella-Micke shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices he's learned along the way. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

Location: LL21 B

Sergey Ermolin (Intel), Shivaram Venkataraman (Microsoft Research)

Average rating:

(3.00, 1 rating)

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Big data, big problems: Predicting climate change

Location: 210 D/H

Ari Gesher (Kairos Aerospace)

A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Spark NLP in action: Improving patient flow forecasting at Kaiser Permanente

Location: Expo Hall 1

Secondary topics: Expo Hall

David Talby (Pacific AI), Santosh Kulkarni (Kaiser Permanente)

Average rating:

(3.50, 2 ratings)

David Talby and Santosh Kulkarni explain how Kaiser Permanente uses the open source NLP library for Apache Spark to tackle one of the most common challenges with applying natural language process in practice: integrating domain-specific NLP as part of a scalable, performant, measurable, and reproducible machine learning pipeline. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Machine learning to tackle industrial data fusion

Location: LL20 A

Secondary topics: Graphs and Time-series

Alexandra Gunderson (Arundo Analytics)

Average rating:

(5.00, 1 rating)

Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Writing distributed graph algorithms

Location: LL20 C

Secondary topics: Graphs and Time-series

Andrew Ray (Sam’s Club Technology)

Average rating:

(3.00, 3 ratings)

Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Spark ML optimization at Intel: A case study

Location: LL20 D

Weisheng Xie (Orange Financial), Peng Meng (Intel)

Average rating:

(5.00, 1 rating)

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on with Spark ML and introduce the methodology behind Intel's work on SparkML optimization. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Deep credit risk ranking with LSTM

Location: LL21 B

Secondary topics: Graphs and Time-series

Kyle Grove (Teradata)

Average rating:

(5.00, 5 ratings)

Kyle Grove explains how Teradata and some of world’s largest financial institutions are innovating credit risk ranking with deep learning techniques and AnalyticOps. With the AnalyticOps framework, these organizations have built models with increased accuracy to drive more profitable lending decisions while being explainable to regulators. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Lessons learned deploying machine learning and deep learning models in production at major tech companies

Location: Expo Hall 1

Secondary topics: Expo Hall

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Average rating:

(4.00, 3 ratings)

Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Why nobody cares about your anomaly detection

Location: LL20 A

Secondary topics: Graphs and Time-series

Baron Schwartz (VividCortex)

Average rating:

(4.80, 5 ratings)

Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Taming deep learning

Location: LL20 C

Evan Sparks (Determined AI)

Average rating:

(5.00, 1 rating)

Deep learning has shown tremendous improvements in a number of areas and has justifiably generated enormous excitement. However, several key challenges—from prohibitive hardware requirements to immature software offerings—are impeding widespread enterprise adoption. Evan Sparks details fundamental challenges facing organizations looking to adopt deep learning and shares possible solutions. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Best practices for productionizing Apache Spark MLlib models

Location: LL20 D

Joseph Bradley (Databricks)

Average rating:

(5.00, 1 rating)

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Deep learning for domain-specific entity extraction from unstructured text

Location: LL21 B

Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)

Average rating:

(3.50, 2 ratings)

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Machine learning applications for the industrial internet

Location: LL20 A

Secondary topics: Graphs and Time-series

Joseph Richards (GE Digital)

Average rating:

(5.00, 1 rating)

Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Code Property Graph: A modern, queryable data storage for source code

Location: LL20 C

Secondary topics: Graphs and Time-series

Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)

Average rating:

(4.00, 1 rating)

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Transforming a machine learning prototype to a deployable solution leveraging Spark in healthcare

Location: LL20 D

Rachita Chandra (IBM Watson Health)

Average rating:

(3.00, 1 rating)

Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Detecting time series anomalies at Uber scale with recurrent neural networks

Location: LL21 B

Secondary topics: Graphs and Time-series

Andrea Pasqua (Uber), Anny Chen (Uber)

Average rating:

(4.60, 5 ratings)

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Continuous machine learning over streaming data

Location: LL20 A

Secondary topics: Graphs and Time-series

Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)

Average rating:

(5.00, 8 ratings)

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Fast and effective natural language understanding

Location: LL20 C

Mike Conover (SkipFlag)

Average rating:

(5.00, 4 ratings)

Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Distributed clinical models: Inference without sharing patient data

Location: LL20 D

Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)

Average rating:

(3.00, 2 ratings)

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Location: LL21 B

Sergey Ermolin (Intel), Suqiang Song (Mastercard)

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Small pieces, loosely joined: A skater's code

Location: Expo Hall 1

Secondary topics: Expo Hall

Rodney Mullen (Almost Skateboards)

Average rating:

(5.00, 2 ratings)

The essence of modern skating is learning tricks that couple with specific terrain. Activision’s video game franchise testifies to the nearly endless possibilities. Rodney Mullen offers a nuanced look at how skaters nudge the endpoints of disparate submovements to create new combinations that may shine a different light on ideas in machine learning—plus it’s a lot of fun. Read more.

11:00am–11:40am Thursday, March 8, 2018

The limits of inference: What data scientists can learn from the reproducibility crisis in science

Location: LL20 A

Clare Gollnick (NS1)

Average rating:

(4.86, 7 ratings)

At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project. Read more.

11:00am–11:40am Thursday, March 8, 2018

Explaining machine learning models

Location: LL20 C

Evan Kriminger (ZestFinance)

Average rating:

(4.40, 5 ratings)

What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements. Read more.

11:00am–11:40am Thursday, March 8, 2018

Data science at Slack

Location: LL20 D

Josh Wills (Slack)

Average rating:

(4.00, 3 ratings)

Josh Wills describes recent data science and machine learning projects at Slack. Read more.

11:00am–11:40am Thursday, March 8, 2018

Using computer vision to combat stolen credit card fraud

Location: LL21 B

Karthik Ramasamy (Google), Lenny Evans (Uber)

Average rating:

(5.00, 1 rating)

Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities. Read more.

11:00am–11:40am Thursday, March 8, 2018

Graph analysis of 200,000 tweets from Russian Twitter trolls

Location: LL20 B

Secondary topics: Graphs and Time-series

Ryan Boyd (Neo4j)

Average rating:

(5.00, 1 rating)

Ryan Boyd explains how he and his team reconstructed a subset of the Twitter network of Russian troll accounts and applied graph analytics to the data using the Neo4j graph database to uncover how these accounts were spreading fake news. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Failed experiments in infrastructure security analytics and lessons learned from fixing them

Location: LL20 A

Secondary topics: Graphs and Time-series

Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))

Average rating:

(4.00, 1 rating)

How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists

Location: LL20 C

Stephen O'Sullivan (Data Whisperers)

Average rating:

(4.25, 4 ratings)

Stephen O'Sullivan takes you along the data science journey, from onboarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Approaching the pricing problem at Lyft

Location: LL20 D

Ashivni Shekhawat (Lyft)

Average rating:

(3.00, 3 ratings)

Ashivni Shekhawat explains how Lyft uses a mix of online learning, optimization, and control theory to operate its ride-sharing marketplace at an efficient price point. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark

Location: LL21 B

Jiao(Jennie) Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Magellan: Scalable and fast geospatial analytics

Location: LL20 A

Ram Sriharsha (Databricks)

Average rating:

(4.75, 4 ratings)

How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Humans versus the machines: Using human-based computation to improve machine learning

Location: LL20 C

Veronica Mapes (Pinterest), Garner Chung (Pinterest)

Average rating:

(5.00, 3 ratings)

Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

The science of patchy data

Location: LL20 D

Jennifer Prendki (Figure Eight)

Average rating:

(3.00, 1 rating)

Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

The real-time journey from raw streaming data to AI-based analytics

Location: Expo Hall 1

Secondary topics: Expo Hall, Graphs and Time-series

Roy Ben Alta (Amazon Web Services), Ira Cohen (Anodot)

Average rating:

(5.00, 1 rating)

Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

sparklyr, implyr, and more: dplyr interfaces to large-scale data

Location: LL20 A

Ian Cook (Cloudera)

Average rating:

(4.75, 4 ratings)

The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Human in the loop: Bayesian rules enabling explainable AI

Location: LL20 C

Pramit Choudhary (h2o.ai)

Average rating:

(5.00, 3 ratings)

Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Building career advisory tools for the tech sector using machine learning

Location: LL20 D

Simon Hughes (Dice.com), Yuri Bykov (Dice.com)

Average rating:

(4.00, 1 rating)

Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Word embeddings under the hood: How neural networks learn from language

Location: LL21 B

Patrick Harrison (S&P Global)

Average rating:

(4.33, 3 ratings)

Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through how it works its magic. Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Building ML and AI pipelines with Spark and TensorFlow

Location: Expo Hall 1

Secondary topics: Expo Hall

Chris Fregly (Amazon Web Services)

Average rating:

(5.00, 1 rating)

Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

HDFS on Kubernetes: Tech deep dive on locality and security

Location: LL21 C/D

Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)

Average rating:

(5.00, 1 rating)

There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Cataloging the visible universe through Bayesian inference at petascale in Julia

Location: LL20 A

Keno Fischer (Julia Computing)

Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkeley, LBNL, and Julia Computing. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Not your parents' machine learning: How to ship an XGBoost churn prediction app in under four weeks

Location: LL20 D

Goodman Gu (Cogito)

Average rating:

(5.00, 3 ratings)

Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Using deep learning to solve challenging problems

Location: LL21 B

Jeff Dean (Google)

Average rating:

(4.89, 9 ratings)

The Google Brain team conducts research on difficult problems in artificial intelligence and builds large-scale computer systems for machine learning research, both of which have been applied to dozens of Google products. Jeff Dean highlights some of Google Brain's projects with an eye toward how they can be used to solve challenging problems. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Big data insights equal big money: Stories from the trenches at GoDaddy

Location: 210 C/G

Felix Gorodishter (GoDaddy)

Average rating:

(3.00, 2 ratings)

GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.

Data Science & Machine Learning

If you're in data, you need to understand machine learning

Sponsorship Opportunities

Partner Opportunities

Contact Us