Sessions: Big data conference & machine learning training

11:00am–11:40am Wednesday, March 7, 2018

Cloud, multicloud, and the data refinery

Location: LL21 C/D

Tags:

Tom Fisher (MapR Technologies)

The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Accelerating development velocity of production ML systems with Docker

Location: LL21 E/F

Kinnary Jangla (Pinterest)

Average rating:

(2.25, 8 ratings)

Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Using machine learning to simplify Kafka operations

Location: 230 A

Secondary topics: Graphs and Time-series

Shivnath Babu (Duke University | Unravel Data Systems), mdhruvg goel (Microsoft)

Average rating:

(4.50, 2 ratings)

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Interpretable machine learning products

Location: LL20 A

Mike Lee Williams (Cloudera Fast Forward Labs)

Average rating:

(4.86, 7 ratings)

Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of the open source, model-agnostic tool LIME. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Breaking up the block: Using heterogenous population modeling to drive growth

Location: LL20 C

Daniel Lurie (1989)

All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify, and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth. Read more.

11:00am–11:40am Wednesday, March 7, 2018

How does a big data professional get started with AI?

Location: LL20 D

Wee Hyong Tok (Microsoft), Danielle Dean (iRobot)

Average rating:

(3.50, 2 ratings)

Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI. Read more.

11:00am–11:40am Wednesday, March 7, 2018

The current state of TensorFlow and where it's headed in 2018

Location: LL21 B

Rajat Monga (Google)

Average rating:

(4.40, 5 ratings)

Rajat Monga offers an overview of TensorFlow's progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas. Read more.

11:00am–11:40am Wednesday, March 7, 2018

What's new in Hadoop 3.0

Location: 230 C

Daniel Templeton (Cloudera), Andrew Wang (Cloudera)

Average rating:

(4.67, 6 ratings)

Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Executive Briefing: BI on big data

Location: 210 A/E

Mark Madsen (Teradata), Shant Hovsepian (Arcadia Data)

Average rating:

(3.29, 7 ratings)

There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian outline the trade-offs between a number of architectures that provide self-service access to data and discuss the pros and cons of architectures, deployment strategies, and examples of BI on big data. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Progressive data governance for emerging technologies

Location: 210 C/G

Anne Buff (SAS)

Average rating:

(4.50, 2 ratings)

Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Bladder cancer diagnosis using deep learning

Location: 210 D/H

Mauro Damo (Dell EMC), Wei Lin (Dell EMC)

Average rating:

(3.50, 2 ratings)

Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive. Read more.

11:00am–11:40am Wednesday, March 7, 2018

The future of ETL isn’t what it used to be

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Gwen Shapira (Confluent)

Average rating:

(4.93, 14 ratings)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Digital transformation demands faster, more productive data science (sponsored by DataScience.com)

Location: LL20 B

Ian Swanson (DataScience.com)

Average rating:

(4.50, 2 ratings)

Ian Swanson shares strategies for leading more productive data science teams, along with steps you can take today to meet growing demands for AI and machine learning use cases. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Being smarter than dinosaurs: How NASA uses deep learning for planetary defense

Location: Expo Hall 1

Secondary topics: Expo Hall

Siddha Ganju (NVIDIA)

Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Architecting an edge-to-cloud data pipeline to unify multiple data sources and processing engines (sponsored by NetApp)

Location: LL21 A

Santosh Rao (NetApp)

Average rating:

(5.00, 1 rating)

Santosh Rao explores the architecture of a data pipeline from edge to core to cloud and across various data sources and processing engines and explains how to build a solution architecture that enables businesses to maximize the competitive differentiation with the ability to unify data insights in compelling yet efficient ways. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Focus on your business: Case studies on building data solutions that meet your needs (sponsored by Microsoft)

Location: 230 B

Tobias Ternstrom (Microsoft)

Average rating:

(4.00, 1 rating)

Tobias Ternstrom leads a deep dive into case studies from three Microsoft customers who put technology before solutions. Tobias examines the decisions that brought them there and outlines how they got back on track and solved their business problems. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Data and ethics : Brainstorming Session

Location: 210 B/F

Natalie Evans Harris (BrightHive)

Average rating:

(5.00, 2 ratings)

Join Natalie Evans Harris for a brainstorming session on data and ethics. You'll cover the current Community Principles on Ethical Data Practices (CPEDP) and next steps, existing tools that support ethical data practices, how the community can support the needs of the individual, and whether or not the community needs to be held accountable to regulations (or something more like fiduciary duty). Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Powering robotics clouds with Alluxio

Location: LL21 C/D

Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)

Average rating:

(4.00, 1 rating)

Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Machine learning versus machine learning in production

Location: LL21 E/F

Manu Mukerji (8x8)

Average rating:

(4.22, 9 ratings)

Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Streaming big data in the cloud: What to consider and why

Location: 230 A

Secondary topics: Graphs and Time-series

Bill Chambers (Databricks), michael dddd (Databricks)

Average rating:

(4.60, 5 ratings)

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Deploying and monitoring interactive machine learning applications with Clipper

Location: LL20 A

Dan Crankshaw (UC Berkeley RISELab)

Average rating:

(4.25, 4 ratings)

Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Who are we? The largest-scale study of professional data scientists

Location: LL20 C

Miryung Kim (UCLA), Muhammad Gulzar (UCLA)

Average rating:

(3.50, 2 ratings)

Even though we know that there are more data scientists in the workforce today, neither what those data scientists actually do nor what we even mean by data scientists has been studied quantitatively. Miryung Kim and Muhammad Gulzar share the results of a large-scale survey with 793 professional data scientists and detail several trends about data scientists in the software engineering context. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Want to build a better chatbot? Start with your data.

Location: LL20 D

Andrew Mattarella-Micke (Intuit)

Average rating:

(5.00, 1 rating)

When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Andrew Mattarella-Micke shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices he's learned along the way. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Accelerating deep learning on Apache Spark using BigDL with coarse-grained scheduling

Location: LL21 B

Sergey Ermolin (Intel), Shivaram Venkataraman (Microsoft Research)

Average rating:

(3.00, 1 rating)

The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Metrics-driven tuning of Apache Spark at scale

Location: 230 C

Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)

Average rating:

(4.00, 4 ratings)

Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Executive Briefing: The conversational AI revolution

Location: 210 A/E

Yishay Carmiel (IntelligentWire)

Average rating:

(4.00, 3 ratings)

One of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean and to take particular actions based on that information. This goal is the essence of conversational AI. Yishay Carmiel explores the latest breakthroughs and revolutions in this field and the challenges still to come. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Managing data science at scale

Location: 210 C/G

Matthew Granade (Domino Data Lab)

Average rating:

(2.00, 1 rating)

Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams need to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Big data, big problems: Predicting climate change

Location: 210 D/H

Ari Gesher (Kairos Aerospace)

A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Radically modular data ingestion APIs in Apache Beam

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Eugene Kirpichov (Google)

Average rating:

(4.75, 4 ratings)

Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Spark NLP in action: Improving patient flow forecasting at Kaiser Permanente

Location: Expo Hall 1

Secondary topics: Expo Hall

David Talby (Pacific AI), Santosh Kulkarni (Kaiser Permanente)

Average rating:

(3.50, 2 ratings)

David Talby and Santosh Kulkarni explain how Kaiser Permanente uses the open source NLP library for Apache Spark to tackle one of the most common challenges with applying natural language process in practice: integrating domain-specific NLP as part of a scalable, performant, measurable, and reproducible machine learning pipeline. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Journey to digital (sponsored by IBM)

Location: LL21 A

Seth Dobrin, PhD (IBM)

Average rating:

(3.00, 1 rating)

Companies that want to become truly digital must take a journey of three steps: data transformation, data science transformation, and digital transformation. This also requires transforming the business with machine learning to fundamentally change the relationship with customers. Seth Dobrin explains the detailed steps along the way to digital transformation—and the pitfalls. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Analytics in real time, the (Grey's) anatomy of event streaming (sponsored by MemSQL)

Location: 230 B

Adam Ahringer (Disney-ABC TV Digital Media)

Average rating:

(3.20, 5 ratings)

Adam Ahringer explains how Disney-ABC TV leverages Amazon Kinesis and MemSQL to provide real-time insights based on user telemetry as well as the platform for traditional data warehousing activities. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

The four elements of modern analytics (sponsored by MicroStrategy)

Location: LL20 B

Vijay Kotu (Oath)

Average rating:

(4.67, 3 ratings)

Vijay Kotu details how Oath is using MicroStrategy to combine elements of data science, enterprise mobility, information design, and data lakes in its transformation into an intelligent enterprise. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Speed up mission-critical analytics in the cloud (sponsored by Kyligence)

Location: 210 B/F

Billy Liu (Kyligence)

As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

20 Netflix-style principles and practices to get the most out of your data platform

Location: LL21 C/D

Kurt Brown (Netflix)

Average rating:

(4.19, 16 ratings)

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Spark on Kubernetes: A case study from JD.com

Location: LL21 E/F

Zhen Fan (JD.com), Wei Ting Chen (Intel Corporate)

Average rating:

(4.00, 4 ratings)

Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Approximation data structures in streaming data processing

Location: 230 A

Debasish Ghosh (Lightbend)

Average rating:

(3.33, 3 ratings)

Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Machine learning to tackle industrial data fusion

Location: LL20 A

Secondary topics: Graphs and Time-series

Alexandra Gunderson (Arundo Analytics)

Average rating:

(5.00, 1 rating)

Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Writing distributed graph algorithms

Location: LL20 C

Secondary topics: Graphs and Time-series

Andrew Ray (Sam’s Club Technology)

Average rating:

(3.00, 3 ratings)

Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Spark ML optimization at Intel: A case study

Location: LL20 D

Weisheng Xie (Orange Financial), Peng Meng (Intel)

Average rating:

(5.00, 1 rating)

Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on with Spark ML and introduce the methodology behind Intel's work on SparkML optimization. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Deep credit risk ranking with LSTM

Location: LL21 B

Secondary topics: Graphs and Time-series

Kyle Grove (Teradata)

Average rating:

(5.00, 5 ratings)

Kyle Grove explains how Teradata and some of world’s largest financial institutions are innovating credit risk ranking with deep learning techniques and AnalyticOps. With the AnalyticOps framework, these organizations have built models with increased accuracy to drive more profitable lending decisions while being explainable to regulators. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Vectorized query processing using Apache Arrow

Location: 230 C

Siddharth Teotia (Dremio)

Average rating:

(5.00, 1 rating)

Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Executive Briefing: Building effective heterogeneous data communities—Driving organizational outcomes with broad-based data science

Location: 210 A/E

Frances Haugen (Pinterest), Patrick Phelps (Pinterest)

Average rating:

(4.67, 3 ratings)

Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

The rise of big data governance: Insight on this emerging trend from active open source initiatives

Location: 210 C/G

John Mertic (Linux Foundation), Maryna Strelchuk (ING)

John Mertic and Maryna Strelchuk detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share how companies like ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Reinventing healthcare: Early detection of Alzheimer’s disease with deep learning

Location: 210 D/H

Ayin Vala (DeepMD | Foundation for Precision Medicine)

Average rating:

(4.33, 3 ratings)

Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

How to build leakproof stream processing pipelines with Apache Kafka and Apache Spark

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)

Average rating:

(4.25, 4 ratings)

When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Accelerating analytics and AI from the edge to the cloud (sponsored by Intel)

Location: 230 B

Kevin Huiskes (Intel), Radhika Rangarajan (Intel)

Advanced analytics and AI workloads require a scalable and optimized architecture, from hardware and storage to software and applications. Kevin Huiskes and Radhika Rangarajan share best practices for accelerating analytics and AI and explain how businesses globally are leveraging Intel’s technology portfolio, along with optimized frameworks and libraries, to build AI workloads at scale. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Lessons learned deploying machine learning and deep learning models in production at major tech companies

Location: Expo Hall 1

Secondary topics: Expo Hall

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Average rating:

(4.00, 3 ratings)

Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Data at scale and speed: Real-world use cases (sponsored by MapR)

Location: LL20 B

Tags:

Ted Dunning (MapR, now part of HPE)

Average rating:

(4.67, 3 ratings)

Getting value from data at large scale and on a variety of time scales is hard. True, it's not as hard as it used to be, but you still don’t win by default. Ted Dunning explains why it takes good design, the right technology, and a pragmatic approach to succeed. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Dogfooding data at Lyft

Location: LL21 C/D

Mark Grover (Lyft), Arup Malakar (Lyft)

Average rating:

(4.00, 2 ratings)

Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

DataOps: An Agile methodology for data-driven organizations

Location: LL21 E/F

Tags:

Ellen Friedman (Independent)

Average rating:

(4.43, 7 ratings)

DataOps—a culture and practice for building data-intensive applications, including machine learning pipelines—expands DevOps philosophy to include data-heavy roles such as data engineering and data science. DataOps is based on cross-functional collaboration resulting in fast time to value and an agile workflow. Ellen Friedman offers an overview of DataOps and explains how to implement it. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Moving the needle of the pin: Streaming hundreds of terabytes of pins from MySQL to S3/Hadoop continuously

Location: 230 A

Henry Cai (Pinterest), Yi Yin (Pinterest)

Average rating:

(3.00, 1 rating)

With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Why nobody cares about your anomaly detection

Location: LL20 A

Secondary topics: Graphs and Time-series

Baron Schwartz (VividCortex)

Average rating:

(4.80, 5 ratings)

Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Taming deep learning

Location: LL20 C

Evan Sparks (Determined AI)

Average rating:

(5.00, 1 rating)

Deep learning has shown tremendous improvements in a number of areas and has justifiably generated enormous excitement. However, several key challenges—from prohibitive hardware requirements to immature software offerings—are impeding widespread enterprise adoption. Evan Sparks details fundamental challenges facing organizations looking to adopt deep learning and shares possible solutions. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Best practices for productionizing Apache Spark MLlib models

Location: LL20 D

Joseph Bradley (Databricks)

Average rating:

(5.00, 1 rating)

Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Deep learning for domain-specific entity extraction from unstructured text

Location: LL21 B

Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)

Average rating:

(3.50, 2 ratings)

Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Presto query gate: Identifying and stopping rogue queries

Location: 230 C

Ritesh Agrawal (Uber), Anirban Deb (Uber)

Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Executive Briefing: Artificial intelligence—The next digital frontier?

Location: 210 A/E

Michael Chui (McKinsey Global Institute)

Average rating:

(4.83, 6 ratings)

After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Building a data science idea factory: How to prioritize the portfolio of a large, diverse, and opinionated data science team

Location: 210 C/G

Katie Malone (Civis Analytics), Skipper Seabold (Civis Analytics)

Average rating:

(5.00, 1 rating)

A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

AI-powered crime prediction

Location: 210 D/H

Or Herman-Saffar (Dell), Ran Taig (Dell EMC)

Average rating:

(1.67, 3 ratings)

What if we could predict when and where crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Semi-automated analytic pipeline creation and validation using active learning

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Sean Ma (Trifacta)

Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Real-time deep link analytics: The next stage of graph analytics

Location: Expo Hall 1

Secondary topics: Expo Hall, Graphs and Time-series

Yu Xu (TigerGraph)

Average rating:

(5.00, 2 ratings)

Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Managing the intelligent data pipeline and the connected enterprise (sponsored by Hitachi Vantara)

Location: LL21 A

Chuck Yarbrough (Hitachi Vantara)

Intelligently managing the data pipeline is the key to driving business acceleration and reducing costs. Chuck Yarbrough outlines ways to gain control over the data pipeline. Along the way, you’ll learn how cloud, big data, and machine learning models intersect and how streaming and cloud integration can help create the connected enterprise. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

The Snowflake data warehouse: How Sharethrough analyzes petabytes of event data in a SQL database (sponsored by Snowflake)

Location: 230 B

Dave Abercrombie (Sharethrough)

Average rating:

(3.50, 2 ratings)

Dave Abercrombie explains how Sharethrough used Snowflake to build an analytic and reporting platform that handles petabyte-scale data with ease. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Building machine learning systems for scale: Amazon insights and best practices (sponsored by Amazon Web Services)

Location: LL20 B

Guy Ernest (Amazon Web Services)

Average rating:

(4.50, 4 ratings)

Amazon SageMaker is platform to build, train, and deploy machine learning models at any scale. Guy Ernest explores the scalable algorithms that SageMaker provides, distributed training with Apache MXNet and TensorFlow, automatic tuning of hyperparameters, and model deployments. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Pirelli Connesso: Where the road meets the cloud

Location: LL21 C/D

Carlo Torniai (Pirelli Tyre)

Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Personalization at scale: Mastering the challenges of personalization to create compelling user experiences

Location: LL21 E/F

Rahim Daya (Pinterest)

Average rating:

(3.50, 4 ratings)

Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Stream storage with Apache BookKeeper

Location: 230 A

Secondary topics: Graphs and Time-series

Sijie Guo (StreamNative)

Average rating:

(3.67, 3 ratings)

Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Machine learning applications for the industrial internet

Location: LL20 A

Secondary topics: Graphs and Time-series

Joseph Richards (GE Digital)

Average rating:

(5.00, 1 rating)

Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Code Property Graph: A modern, queryable data storage for source code

Location: LL20 C

Secondary topics: Graphs and Time-series

Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)

Average rating:

(4.00, 1 rating)

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Transforming a machine learning prototype to a deployable solution leveraging Spark in healthcare

Location: LL20 D

Rachita Chandra (IBM Watson Health)

Average rating:

(3.00, 1 rating)

Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Detecting time series anomalies at Uber scale with recurrent neural networks

Location: LL21 B

Secondary topics: Graphs and Time-series

Andrea Pasqua (Uber), Anny Chen (Uber)

Average rating:

(4.60, 5 ratings)

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

NoSQL no more: SQL on Druid with Apache Calcite

Location: 230 C

Gian Merlino (Imply)

Average rating:

(4.00, 2 ratings)

Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Executive Briefing: Managing successful data projects—Technology selection and team building

Location: 210 A/E

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Average rating:

(4.67, 3 ratings)

Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Make data work: A VC panel discussion on prospectives and trends

Location: 210 C/G

Moderated by:

Lisha Li (Amplify Partners)

Panelists:

Katherine Boyle (General Catalyst), Wayne Hu (SignalFire), Andrew Parker (Spark Capital), Brandon Reeves (Lux Capital)

Average rating:

(4.00, 1 rating)

To anticipate who will succeed and to invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, top-tier VCs look over the horizon to consider the big trends in how data is being put to work in startups and share what they think the field will look like in a few years (or more). Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

If you can’t measure it, you can’t improve it: How reporting and experimentation fuel product innovation at LinkedIn

Location: 210 D/H

Kapil Surlaker (LinkedIn), Ya Xu (LinkedIn)

Average rating:

(5.00, 3 ratings)

Metrics measurement and experimentation play crucial roles in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Building a flexible ML pipeline at a B2B AI startup

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Dorna Bandari (Jetlore)

Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Leveraging live data to realize the smart cities vision

Location: Expo Hall 1

Secondary topics: Expo Hall

Arun Kejariwal (Independent), Roman Smolgovsky (MZ)

One of the key application domains leveraging live data is smart cities, but success depends on the availability of generic platforms that support high throughput and ultralow latency. Arun Kejariwal and Francois Orsini offer an overview of Satori's live data platform and walk you through a country-scale case study of its implementation. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

BI and big data convergence in modern cloud architecture (sponsored by Arcadia Data)

Location: LL20 B

terry mcfadden (P&G)

Procter & Gamble relies heavily on data, particularly for BI. Running compute where the data lives is critical for performance, and the company has found added benefits to this architecture, which complements its Hadoop and BI needs. Terry McFadden offers an overview of P&G's modern analytics architecture and explains how it differs from traditional approaches. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Winning the big data war pays big dividends for Wargaming (sponsored by SAS)

Location: 230 B

Alexander Ryabov (Wargaming), Jonathan Crow (Wargaming)

Alexander Ryabov and Jonathan Crow explain how Wargaming is winning the battle for bigger profits in the virtual world of online gaming using a best-in-class business intelligence solution to equip its business units with decision-making tools. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

How to protect big data in a containerized environment

Location: LL21 C/D

Thomas Phelan (HPE BlueData)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Better machine learning logistics with the rendezvous architecture

Location: LL21 E/F

Tags:

Ted Dunning (MapR, now part of HPE)

Average rating:

(5.00, 1 rating)

Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Streaming SQL to unify batch and stream processing: Theory and practice with Apache Flink at Uber

Location: 230 A

Fabian Hueske (data Artisans), Shuyi Chen (Uber)

Average rating:

(5.00, 1 rating)

Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Continuous machine learning over streaming data

Location: LL20 A

Secondary topics: Graphs and Time-series

Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)

Average rating:

(5.00, 8 ratings)

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Fast and effective natural language understanding

Location: LL20 C

Mike Conover (SkipFlag)

Average rating:

(5.00, 4 ratings)

Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Distributed clinical models: Inference without sharing patient data

Location: LL20 D

Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)

Average rating:

(3.00, 2 ratings)

Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Improving user-merchant propensity modeling using neural collaborative filtering and wide and deep models on Spark BigDL at scale

Location: LL21 B

Sergey Ermolin (Intel), Suqiang Song (Mastercard)

Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Executive Briefing: Legal best practices for making data work

Location: 210 A/E

Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)

Average rating:

(5.00, 1 rating)

Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.” Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

The mathematical corporation: A new leadership mindset for the machine intelligence era

Location: 210 C/G

Stephanie Beben (Booz Allen Hamilton)

How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Stephanie Beben shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.” Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

How to avoid pitfalls when reasoning with data

Location: 210 D/H

Derek Ruths (CAI)

Unreasonable sales forecasts, badly overstocked inventory, misguided investments . . . bad analyses happen all the time, leading to bad decisions and costing businesses millions of dollars. Derek Ruths shares the five most common issues that lead to bad data-informed thinking. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Pipeline testing with Great Expectations

Location: 212 A-B

Secondary topics: Data Integration and Data Pipelines

Abe Gong (Superconductive Health), James Campbell (USG)

Average rating:

(5.00, 4 ratings)

Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Classifying job execution using deep learning

Location: 230 C

Ash Munshi (Pepperdata)

Average rating:

(5.00, 1 rating)

Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Small pieces, loosely joined: A skater's code

Location: Expo Hall 1

Secondary topics: Expo Hall

Rodney Mullen (Almost Skateboards)

Average rating:

(5.00, 2 ratings)

The essence of modern skating is learning tricks that couple with specific terrain. Activision’s video game franchise testifies to the nearly endless possibilities. Rodney Mullen offers a nuanced look at how skaters nudge the endpoints of disparate submovements to create new combinations that may shine a different light on ideas in machine learning—plus it’s a lot of fun. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Bringing AI into the IoT (sponsored by SAS)

Location: 230 B

Evan Guarnaccia (SAS)

Average rating:

(3.00, 1 rating)

As the internet of things grows, there is an increasing need for sophisticated but lightweight analytics at the edge. Evan Guarnaccia walks you through a multiphase analytics approach to IoT data, analyzing data at rest to discover patterns of interest and develop analytical models that can be easily deployed into a streaming analytics engine out at the edge, in the fog, or in the cloud. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

The future of ETL isn’t what it used to be

Location: 210 B/F

Secondary topics: Data Integration and Data Pipelines

Gwen Shapira (Confluent)

Average rating:

(5.00, 3 ratings)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.

11:00am–11:40am Thursday, March 8, 2018

Operationalize deep learning: How to deploy and consume your LSTM networks for predictive maintenance scenarios

Location: LL21 C/D

Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)

Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance. Read more.

11:00am–11:40am Thursday, March 8, 2018

Analytics in the cloud: Building a modern cloud-based big data warehouse

Location: LL21 E/F

Greg Rahn (Cloudera)

Average rating:

(3.40, 5 ratings)

For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud. Read more.

11:00am–11:40am Thursday, March 8, 2018

Foundations of streaming SQL; or, How I learned to love stream and table theory

Location: 230 A

Tyler Akidau (Google)

Average rating:

(5.00, 4 ratings)

What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.

11:00am–11:40am Thursday, March 8, 2018

The limits of inference: What data scientists can learn from the reproducibility crisis in science

Location: LL20 A

Clare Gollnick (NS1)

Average rating:

(4.86, 7 ratings)

At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project. Read more.

11:00am–11:40am Thursday, March 8, 2018

Explaining machine learning models

Location: LL20 C

Evan Kriminger (ZestFinance)

Average rating:

(4.40, 5 ratings)

What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements. Read more.

11:00am–11:40am Thursday, March 8, 2018

Data science at Slack

Location: LL20 D

Josh Wills (Slack)

Average rating:

(4.00, 3 ratings)

Josh Wills describes recent data science and machine learning projects at Slack. Read more.

11:00am–11:40am Thursday, March 8, 2018

Using computer vision to combat stolen credit card fraud

Location: LL21 B

Karthik Ramasamy (Google), Lenny Evans (Uber)

Average rating:

(5.00, 1 rating)

Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities. Read more.

11:00am–11:40am Thursday, March 8, 2018

The secret sauce behind LinkedIn's self-managing Kafka clusters

Location: 230 C

Jiangjie Qin (LinkedIn)

Average rating:

(4.00, 3 ratings)

LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention. Read more.

11:00am–11:40am Thursday, March 8, 2018

Executive Briefing: Machine learning—Why you need it, why it's hard, and what to do about it

Location: 210 A/E

Mike Olson (Cloudera)

Average rating:

(4.75, 4 ratings)

Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them. Read more.

11:00am–11:40am Thursday, March 8, 2018

Understanding metadata

Location: 210 C/G

Secondary topics: Graphs and Time-series

Michael Schrenk (Self-Employed)

Average rating:

(4.00, 5 ratings)

Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers. Read more.

11:00am–11:40am Thursday, March 8, 2018

Fighting sex trafficking with data science

Location: 210 D/H

Ruben van der Dussen (Thorn)

Average rating:

(2.00, 1 rating)

Sugreev Chawla offers an overview of Spotlight, a tool created by Thorn, a nonprofit that uses technology to fight online child sexual exploitation. It allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking, using graph analysis, time series analysis, and NLP techniques to surface important networks of ads and characterize their behavior over time. Read more.

11:00am–11:40am Thursday, March 8, 2018

Kafka streaming applications with Akka Streams and Kafka Streams

Location: Expo Hall 1

Secondary topics: Expo Hall

Dean Wampler (Anyscale)

Average rating:

(5.00, 1 rating)

Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.

11:00am–11:40am Thursday, March 8, 2018

Ask Me Anything: Big data and machine learning techniques to drive and grow business

Location: 212 A-B

Burcu Baran (LinkedIn), Wei Di (LinkedIn)

Join Burcu Baran and Wei Di to discuss big data in business analytics, machine learning in business analytics, and achieving actionable insights from big data. Read more.

11:00am–11:40am Thursday, March 8, 2018

Graph analysis of 200,000 tweets from Russian Twitter trolls

Location: LL20 B

Secondary topics: Graphs and Time-series

Ryan Boyd (Neo4j)

Average rating:

(5.00, 1 rating)

Ryan Boyd explains how he and his team reconstructed a subset of the Twitter network of Russian troll accounts and applied graph analytics to the data using the Neo4j graph database to uncover how these accounts were spreading fake news. Read more.

11:00am–11:40am Thursday, March 8, 2018

Building the bridge from big data to machine learning and artificial intelligence (sponsored by Google Cloud)

Location: 230 B

Ryan Lippert (Google Cloud)

Average rating:

(5.00, 2 ratings)

If your company isn't good at analytics, it's not ready for AI. Ryan Lippert explains how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value. Read more.

11:00am–11:40am Thursday, March 8, 2018

The changing role of the CDO: Three keys for success (sponsored by MapR)

Location: LL21 A

Tags:

Jim Scott (NVIDIA)

Average rating:

(2.00, 1 rating)

The value of data is not strictly a function of its size but rather is in the value that can be extracted from it. Jim Scott explains how to identify the right data to leverage to monitor the pulse of fast changing business environments, the best way to integrate analytics into your business processes, and the importance of cross-application data flows. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Distributed deep learning with containers on heterogeneous GPU clusters

Location: LL21 C/D

Tags:

dong meng (MapR)

Average rating:

(3.33, 3 ratings)

Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Hive as a service

Location: LL21 E/F

Szehon Ho (Criteo), Pawel Szostek (Criteo)

Average rating:

(4.50, 2 ratings)

Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Effectively once, exactly once, and more in Heron

Location: 230 A

Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)

Average rating:

(4.00, 1 rating)

Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and demonstrate how your applications will benefit from using them. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Failed experiments in infrastructure security analytics and lessons learned from fixing them

Location: LL20 A

Secondary topics: Graphs and Time-series

Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))

Average rating:

(4.00, 1 rating)

How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Enough data engineering for a data scientist; or, How I learned to stop worrying and love the data scientists

Location: LL20 C

Stephen O'Sullivan (Data Whisperers)

Average rating:

(4.25, 4 ratings)

Stephen O'Sullivan takes you along the data science journey, from onboarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Approaching the pricing problem at Lyft

Location: LL20 D

Ashivni Shekhawat (Lyft)

Average rating:

(3.00, 3 ratings)

Ashivni Shekhawat explains how Lyft uses a mix of online learning, optimization, and control theory to operate its ride-sharing marketplace at an efficient price point. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Automatic 3D MRI knee damage classification with 3D CNN using BigDL on Spark

Location: LL21 B

Jiao(Jennie) Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)

Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Building a contacts graph from activity data

Location: 230 C

Secondary topics: Graphs and Time-series

Alexis Roos (Salesforce), Noah Burbank (Salesforce)

Average rating:

(3.00, 1 rating)

In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it

Location: 210 A/E

David Talby (Pacific AI)

Average rating:

(3.50, 4 ratings)

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Human in the loop: A design pattern for managing teams working with machine learning

Location: 210 C/G

Paco Nathan (derwen.ai)

Average rating:

(4.25, 4 ratings)

Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Architecting an open source enterprise data lake

Location: 210 D/H

Sagar Kewalramani (Cloudera)

Average rating:

(5.00, 2 ratings)

With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components. Read more.

11:50am–12:30pm Thursday, March 8, 2018

The state of Postgres

Location: LL20 B

Umur Cubukcu (Citus Data)

Average rating:

(4.00, 3 ratings)

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Ask Me Anything: Deep learning-based search and recommendation systems using TensorFlow

Location: 212 A-B

Vijay Agneeswaran (Walmart Labs), Abhishek Kumar (Publicis Sapient)

Join Vijay Srinivas Agneeswaran and Abhishek Kumar to discuss recommender systems—particularly deep learning-based recommender systems in TensorFlow—or ask any other questions you have about deep learning. Read more.

11:50am–12:30pm Thursday, March 8, 2018

20 Netflix-style principles and practices to get the most out of your data platform

Location: LL21 A

Kurt Brown (Netflix)

Average rating:

(5.00, 2 ratings)

Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Machine-learned model quality monitoring in fast data and streaming applications

Location: LL21 C/D

Emre Velipasaoglu (Lightbend)

Average rating:

(4.00, 1 rating)

Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Crafting data products for the augmented writing experience

Location: LL21 E/F

Chris Harland (Textio)

The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

TimescaleDB: Reengineering PostgreSQL as a time series database

Location: 230 A

Secondary topics: Graphs and Time-series

Michael Freedman (TimescaleDB)

Average rating:

(4.50, 4 ratings)

Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Magellan: Scalable and fast geospatial analytics

Location: LL20 A

Ram Sriharsha (Databricks)

Average rating:

(4.75, 4 ratings)

How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Humans versus the machines: Using human-based computation to improve machine learning

Location: LL20 C

Veronica Mapes (Pinterest), Garner Chung (Pinterest)

Average rating:

(5.00, 3 ratings)

Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

The science of patchy data

Location: LL20 D

Jennifer Prendki (Figure Eight)

Average rating:

(3.00, 1 rating)

Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Playing well together: Big data beyond the JVM with Spark and friends

Location: 230 C

Holden Karau (Independent), Rachel Warren (Salesforce Einstein)

Average rating:

(3.40, 5 ratings)

Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations

Location: 210 A/E

Mark Donsky (Okera), Steven Ross (Cloudera)

In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Workplace culture in the age of algorithmic management: The information networks Uber drivers built

Location: 210 C/G

Ar Ro (Data & Society Research Institute )

Average rating:

(5.00, 1 rating)

Ride-hail drivers work alone, but they’re banding together online to compare notes, uncover new policies, and help each other navigate a workplace characterized by information scarcity. Alex Rosenblat explores how ride-hail workers are using online forums to create their own workplace culture as employment relationships grow more remote and algorithms replace human managers. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Lessons on driving data science and analytics transformation

Location: 210 D/H

Chris Chapo (Gap Inc.)

Average rating:

(4.20, 5 ratings)

Chris Chapo walks you through real-world examples of companies that are driving transformational change by leveraging data science and analytics, paying particular attention to established organizations where these capabilities are newer concepts. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

The real-time journey from raw streaming data to AI-based analytics

Location: Expo Hall 1

Secondary topics: Expo Hall, Graphs and Time-series

Roy Ben Alta (Amazon Web Services), Ira Cohen (Anodot)

Average rating:

(5.00, 1 rating)

Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Data-driven ecosystems in the automotive industry

Location: LL20 B

Josef Viehhauser (BMW Group), Tobias Burger (BMW Group)

Average rating:

(5.00, 1 rating)

The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Josef Viehhauser and Tobias Bürger discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Ask Me Anything: Managing data science in the enterprise

Location: 212 A-B

Nick Elprin (Domino Data Lab)

Join Nick Elprin to discuss the challenges associated with evolving from random acts of data science to data science as a core competency, common pitfalls and best practices for implementing process, hiring people, and deploying diverse technology, designing and running data science organizations, and more. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

Harnessing the cloud to enable connected systems and self-service and accelerate business growth (sponsored by Talend)

Location: LL21 A

Jeff Smits (RingCentral)

Jeff Smits explains how RingCentral is utilizing the cloud, data integration, self-service, and APIs to harvest the immense potential of connected systems. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Continuous delivery for NLP on Kubernetes: Lessons learned

Location: LL21 C/D

Michelle Casbon (Google)

Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Achieving GDPR compliance and data privacy using blockchain technology

Location: LL21 E/F

Ajay Kumar Mothukuri (Sapient), Vijay Agneeswaran (Walmart Labs)

Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Unified and elastic batch and stream processing with Pravega and Apache Flink

Location: 230 A

Secondary topics: Graphs and Time-series

Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)

Average rating:

(3.33, 3 ratings)

Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

sparklyr, implyr, and more: dplyr interfaces to large-scale data

Location: LL20 A

Ian Cook (Cloudera)

Average rating:

(4.75, 4 ratings)

The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Human in the loop: Bayesian rules enabling explainable AI

Location: LL20 C

Pramit Choudhary (h2o.ai)

Average rating:

(5.00, 3 ratings)

Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Building career advisory tools for the tech sector using machine learning

Location: LL20 D

Simon Hughes (Dice.com), Yuri Bykov (Dice.com)

Average rating:

(4.00, 1 rating)

Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Word embeddings under the hood: How neural networks learn from language

Location: LL21 B

Patrick Harrison (S&P Global)

Average rating:

(4.33, 3 ratings)

Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through how it works its magic. Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Data reflections: Making data fast and easy to use without making copies

Location: 230 C

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)

Average rating:

(5.00, 3 ratings)

Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Executive Briefing: The rise of the ecosystem

Location: 210 A/E

Anjali Thakur (Accenture)

Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Anjali Thakur shares examples of teaming models and leading practices for accelerating value from your ecosystem strategy. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Trapped by the present: Estimating long-term impact from A/B experiments

Location: 210 C/G

Brian Karfunkel (Pinterest)

Average rating:

(4.50, 2 ratings)

When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Detecting retail fraud with data wrangling and machine learning

Location: 210 D/H

Matt Derda (Trifacta), Harrison Lynch (Consensus Corporation)

Average rating:

(2.00, 1 rating)

Matt Derda and Harrison Lynch explain how Consensus leverages the combined power of data wrangling and machine learning to more efficiently identify and reduce retail fraud and how adopting data wrangling technology has helped Trifacta reduce time spent data wrangling from six weeks to one week. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

On-device deep learning: Trends, technologies, and challenges (sponsored by TalkingData)

Location: 230 B

Andreas Pfadler (TalkingData)

Andreas Pfadler offers an overview of current technological trends for on-device deep learning and edge computing. Along the way, Andreas explores major players and platforms and computational challenges and solutions. Andreas concludes with a discussion of TalkingData's vision for the future of mobile deep learning. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Get a farm-to-table view of your data: Track data lineage from source to analytics (sponsored by Syncsort)

Location: LL21 A

Tendu Yogurtcu (Syncsort)

Average rating:

(1.00, 1 rating)

Chefs must be able to trust the authenticity, quality, and origin of their ingredients; data analysts must be able to do the same of their data—and what happens to it along the way. Tendü Yoğurtçu explains how to seamlessly track the lineage and quality of your data—on and off the cluster, on-premises or in the cloud—to deliver meaningful insights and meet regulatory compliance requirements. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Building ML and AI pipelines with Spark and TensorFlow

Location: Expo Hall 1

Secondary topics: Expo Hall

Chris Fregly (Amazon Web Services)

Average rating:

(5.00, 1 rating)

Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Ask Me Anything: Streaming architectures and applications (Kafka, Spark, Akka, and microservices)

Location: 212 A-B

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Average rating:

(5.00, 1 rating)

Join Dean Wampler and Boris Lublinsky to discuss all things streaming, from architecture and implementation to streaming engines and frameworks. Be sure to bring your questions about techniques for serving machine learning models in production, traditional big data systems, or software architecture in general. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

When tests cry wolf (sponsored by Pure Storage)

Location: LL20 B

Ivan Jibaja (Pure Storage)

Pure Storage redefined QA testing. Using open source technologies like Spark and Kafka, the company deployed a streaming big data analytics pipeline that processes over 70 billion events per day to prioritize, classify, deduplicate, and understand test failures. Ivan Jibaja discusses use cases for big data analytics technologies, the underlying elastic infrastructure, and lessons learned. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

HDFS on Kubernetes: Tech deep dive on locality and security

Location: LL21 C/D

Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)

Average rating:

(5.00, 1 rating)

There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Lyft's analytics pipeline: From Redshift to Apache Hive and Presto

Location: LL21 E/F

Shenghu Yang (Lyft)

Average rating:

(5.00, 1 rating)

Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Effectively once in Apache Pulsar, the next-generation messaging system

Location: 230 A

Matteo Merli (Streamlio)

Average rating:

(1.00, 1 rating)

Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Cataloging the visible universe through Bayesian inference at petascale in Julia

Location: LL20 A

Keno Fischer (Julia Computing)

Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkeley, LBNL, and Julia Computing. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Not your parents' machine learning: How to ship an XGBoost churn prediction app in under four weeks

Location: LL20 D

Goodman Gu (Cogito)

Average rating:

(5.00, 3 ratings)

Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Using deep learning to solve challenging problems

Location: LL21 B

Jeff Dean (Google)

Average rating:

(4.89, 9 ratings)

The Google Brain team conducts research on difficult problems in artificial intelligence and builds large-scale computer systems for machine learning research, both of which have been applied to dozens of Google products. Jeff Dean highlights some of Google Brain's projects with an eye toward how they can be used to solve challenging problems. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Cuttlefish: Lightweight primitives for online tuning

Location: 230 C

Tomer Kaftan (University of Washington)

Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Executive Briefing: What does an exec need to know about architecture and why

Location: 210 A/E

Jesse Anderson (Big Data Institute)

Average rating:

(4.00, 1 rating)

There's been an explosion of new architectures, but is this because engineers love new things or is there a good business reason for these changes? Jesse Anderson explores new architectures and the actual business problems they solve. You may find out that your team would be far more productive if you moved to these architectures. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Big data insights equal big money: Stories from the trenches at GoDaddy

Location: 210 C/G

Felix Gorodishter (GoDaddy)

Average rating:

(3.00, 2 ratings)

GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.

4:20pm–5:00pm Thursday, March 8, 2018

Data-driven fuel management at Ryanair

Location: 210 D/H

Marcin Pilarczyk (Ryanair)

Average rating:

(5.00, 2 ratings)

Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions. Read more.

Sessions

Wednesday, March 7

Thursday, March 8

Sponsorship Opportunities

Partner Opportunities

Contact Us