Schedule: Big data and data science in the cloud sessions: Big data conference & machine learning training

9:00am - 5:00pm Monday, March 5 & Tuesday, March 6

Real-time systems with Spark Streaming and Kafka

Location: 114

Jesse Anderson (Big Data Institute)

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Learning PyTorch by building a recommender system

Location: LL21 A

Secondary topics: Graphs and Time-series

Mo Patel (Independent), Neejole Patel (Virginia Tech)

Average rating:

(2.50, 4 ratings)

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Building your first big data application on AWS

Location: LL21 B

Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)

Average rating:

(4.50, 2 ratings)

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

A deep dive into running data analytic workloads in the cloud

Location: 210 D/H

Jason Wang (Cloudera), Mala Ramakrishnan (Cloudera), Stefan Salandy (Cloudera), Aishwarya Venkataraman (Cloudera), Vinithra Varadharajan (Cloudera), Aaron Myers (Cloudera, Inc.)

Average rating:

(3.25, 4 ratings)

Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Data Case Studies

Location: LL20 A

Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Jennie Shin (Kaiser Permanente), Valentin Bercovici (PencilDATA), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin Martin (O'Reilly Media), Divya Ramachandran (Captricity)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

A/B testing at scale: Accelerating software innovation

Location: LL21 C/D

Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)

Average rating:

(4.00, 3 ratings)

Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Cloud, multicloud, and the data refinery

Location: LL21 C/D

Tags:

Tom Fisher (MapR Technologies)

The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations. Read more.

11:00am–11:40am Wednesday, March 7, 2018

How does a big data professional get started with AI?

Location: LL20 D

Wee Hyong Tok (Microsoft), Danielle Dean (iRobot)

Average rating:

(3.50, 2 ratings)

Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Powering robotics clouds with Alluxio

Location: LL21 C/D

Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)

Average rating:

(4.00, 1 rating)

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Machine learning versus machine learning in production

Location: LL21 E/F

Manu Mukerji (8x8)

Average rating:

(4.22, 9 ratings)

Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Streaming big data in the cloud: What to consider and why

Location: 230 A

Secondary topics: Graphs and Time-series

Bill Chambers (Databricks), michael dddd (Databricks)

Average rating:

(4.60, 5 ratings)

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

Location: LL21 B

Sergey Ermolin (Intel), Shivaram Venkataraman (Microsoft Research)

Average rating:

(3.00, 1 rating)

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Big data, big problems: Predicting climate change

Location: 210 D/H

Ari Gesher (Kairos Aerospace)

A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Radically modular data ingestion APIs in Apache Beam

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Eugene Kirpichov (Google)

Average rating:

(4.75, 4 ratings)

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Spark on Kubernetes: A case study from JD.com

Location: LL21 E/F

Zhen Fan (JD.com), Wei Ting Chen (Intel Corporate)

Average rating:

(4.00, 4 ratings)

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Machine learning to tackle industrial data fusion

Location: LL20 A

Secondary topics: Graphs and Time-series

Alexandra Gunderson (Arundo Analytics)

Average rating:

(5.00, 1 rating)

Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Vectorized query processing using Apache Arrow

Location: 230 C

Siddharth Teotia (Dremio)

Average rating:

(5.00, 1 rating)

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Location: 230 A

Henry Cai (Pinterest), Yi Yin (Pinterest)

Average rating:

(3.00, 1 rating)

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Why nobody cares about your anomaly detection

Location: LL20 A

Secondary topics: Graphs and Time-series

Baron Schwartz (VividCortex)

Average rating:

(4.80, 5 ratings)

Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Best practices for productionizing Apache Spark MLlib models

Location: LL20 D

Joseph Bradley (Databricks)

Average rating:

(5.00, 1 rating)

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Deep learning for domain-specific entity extraction from unstructured text

Location: LL21 B

Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)

Average rating:

(3.50, 2 ratings)

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Presto query gate: Identifying and stopping rogue queries

Location: 230 C

Ritesh Agrawal (Uber), Anirban Deb (Uber)

Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Semi-automated analytic pipeline creation and validation using active learning

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Sean Ma (Trifacta)

Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Real-time deep link analytics: The next stage of graph analytics

Location: Expo Hall 1

Secondary topics: Expo Hall, Graphs and Time-series

Yu Xu (TigerGraph)

Average rating:

(5.00, 2 ratings)

Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Pirelli Connesso: Where the road meets the cloud

Location: LL21 C/D

Carlo Torniai (Pirelli Tyre)

Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Code Property Graph: A modern, queryable data storage for source code

Location: LL20 C

Secondary topics: Graphs and Time-series

Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)

Average rating:

(4.00, 1 rating)

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Distributed clinical models: Inference without sharing patient data

Location: LL20 D

Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)

Average rating:

(3.00, 2 ratings)

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Location: LL21 B

Sergey Ermolin (Intel), Suqiang Song (Mastercard)

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Pipeline testing with Great Expectations

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Abe Gong (Superconductive Health), James Campbell (USG)

Average rating:

(5.00, 4 ratings)

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.

11:00am–11:40am Thursday, March 8, 2018

Analytics in the cloud: Building a modern cloud-based big data warehouse

Location: LL21 E/F

Greg Rahn (Cloudera)

Average rating:

(3.40, 5 ratings)

For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud. Read more.

11:00am–11:40am Thursday, March 8, 2018

Understanding metadata

Location: 210 C/G

Secondary topics: Graphs and Time-series

Michael Schrenk (Self-Employed)

Average rating:

(4.00, 5 ratings)

Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Distributed deep learning with containers on heterogeneous GPU clusters

Location: LL21 C/D

Tags:

dong meng (MapR)

Average rating:

(3.33, 3 ratings)

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Hive as a service

Location: LL21 E/F

Szehon Ho (Criteo), Pawel Szostek (Criteo)

Average rating:

(4.50, 2 ratings)

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark

Location: LL21 B

Jiao(Jennie) Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Magellan: Scalable and fast geospatial analytics

Location: LL20 A

Ram Sriharsha (Databricks)

Average rating:

(4.75, 4 ratings)

How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Continuous delivery for NLP on Kubernetes: Lessons learned

Location: LL21 C/D

Michelle Casbon (Google)

Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Data reflections: Making data fast and easy to use without making copies

Location: 230 C

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)

Average rating:

(5.00, 3 ratings)

Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Building ML and AI pipelines with Spark and TensorFlow

Location: Expo Hall 1

Secondary topics: Expo Hall

Chris Fregly (Amazon Web Services)

Average rating:

(5.00, 1 rating)

Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Lyft's analytics pipeline: From Redshift to Apache Hive and Presto

Location: LL21 E/F

Shenghu Yang (Lyft)

Average rating:

(5.00, 1 rating)

Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Not your parents' machine learning: How to ship an XGBoost churn prediction app in under four weeks

Location: LL20 D

Goodman Gu (Cogito)

Average rating:

(5.00, 3 ratings)

Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Cuttlefish: Lightweight primitives for online tuning

Location: 230 C

Tomer Kaftan (University of Washington)

Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Big data insights equal big money: Stories from the trenches at GoDaddy

Location: 210 C/G

Felix Gorodishter (GoDaddy)

Average rating:

(3.00, 2 ratings)

GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.

Schedule: Big data and data science in the cloud sessions

Sponsorship Opportunities

Partner Opportunities

Contact Us