San FranciscoLondonNew York

Presented By
O’Reilly + Cloudera

Make Data Work

29 April–2 May 2019
London, UK

Schedule List View Grid View

Monday, 29/04/2019

7:30

7:30–9:00 Monday, 29/04/2019

Location: Capital Suite Foyer

Early morning break (1h 30m)

9:00

Hands-on data science with Python

9:00–17:00 Monday, 29/04/2019

Training

Data Science, Machine Learning & AI
Location: Capital Suite 1

Secondary topics: Data preparation, data governance, and data lineage

Robert Schroll (The Data Incubator)

Average rating:

(4.75, 4 ratings)

Robert Schroll walks you through all the steps of developing a machine learning pipeline from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and then extend these models into two applications from real-world datasets. All work will be done in Python. Read more.

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow

9:00–17:00 Monday, 29/04/2019

Training

Data Science, Machine Learning & AI
Location: Capital Suite 7

Secondary topics: Deep Learning

Ian Cook (Cloudera)

Average rating:

(4.33, 3 ratings)

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools. Read more.

AI for managers

9:00–17:00 Monday, 29/04/2019

Training

Strata Business Summit
Location: Capital Suite 16

Secondary topics: AI and machine learning in the enterprise

Nijma Khan (Faculty ai), Alberto Favaro (Faculty)

Average rating:

(1.86, 7 ratings)

Nijma Khan and Alberto Favaro offer a condensed introduction to key AI and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization. Read more.

Large-scale ML with MLflow, deep learning, and Apache Spark

9:00–17:00 Monday, 29/04/2019

Training

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: Deep Learning, Model lifecycle management

Amir Issaei (Databricks)

Average rating:

(5.00, 1 rating)

Join Amir Issaei to explore neural network fundamentals and learn how to build distributed Keras/TensorFlow models on top of Spark DataFrames. You'll use Keras, TensorFlow, Deep Learning Pipelines, and Horovod to build and tune models and MLflow to track experiments and manage the machine learning lifecycle. This course is taught entirely in Python. Read more.

Professional Kafka development

9:00–17:00 Monday, 29/04/2019

Training

Data Engineering and Architecture
Location: London Suite 2

Secondary topics: Data Integration and Data Pipelines, Streaming and realtime analytics

Jesse Anderson (Big Data Institute)

Average rating:

(5.00, 1 rating)

Jesse Anderson offers an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it as well as how to create consumers and publishers. Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL. Read more.

Building a serverless big data application on AWS

9:00–17:00 Monday, 29/04/2019

Training

Data Engineering and Architecture
Location: London Suite 3

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Jorge Lopez (Amazon Web Services), Nikki Rouda (Amazon Web Services), Damon Cortesi (Amazon Web Services), Sven Hansen (Amazon Web Services), Manos Samatas (Amazon Web Services), Alket Memushaj (Amazon Web Services)

Average rating:

(3.50, 2 ratings)

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.

Machine learning from scratch in TensorFlow

9:00–17:00 Monday, 29/04/2019

Training

Data Science, Machine Learning & AI
Location: Capital Suite 9

Secondary topics: Deep Learning

Ana Hocevar (The Data Incubator)

Average rating:

(4.38, 8 ratings)

The TensorFlow library provides for the use of computational graphs, with automatic parallelization across resources. This architecture is ideal for implementing neural networks. Ana Hocevar offers an intro to TensorFlow's capabilities in Python, taking you from building machine learning algorithms piece by piece to using the Keras API provided by TensorFlow with several hands-on applications. Read more.

10:30

10:30–11:00 Monday, 29/04/2019

Location: Capital Suite Foyer

Morning break (30m)

12:30

12:30–13:30 Monday, 29/04/2019

Location: Capital Suite Foyer

Lunch (1h)

15:00

15:00–15:30 Monday, 29/04/2019

Location: Capital Suite Foyer

Afternoon break (30m)

Tuesday, 30/04/2019

7:30

7:30–9:00 Tuesday, 30/04/2019

Location: Capital Suite Foyer

Early morning coffee (1h 30m)

9:00

Architecting a data platform for enterprise use

9:00–12:30 Tuesday, 30/04/2019

Tutorial

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Platforms

Mark Madsen (Teradata), Todd Walter (Archimedata)

Average rating:

(3.71, 7 ratings)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

Serverless machine learning with TensorFlow: Part I

9:00–12:30 Tuesday, 30/04/2019

Tutorial

Data Science, Machine Learning & AI
Location: Capital Suite 2/3

Secondary topics: AI and Data technologies in the cloud, Deep Learning

Melinda King (ROI Training)

Average rating:

(3.00, 8 ratings)

Melinda King offers an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hands-on labs, you’ll learn machine learning (ML) and TensorFlow concepts, and develop skills in developing, evaluating, and productionizing ML models. Read more.

Using AWS serverless technologies to analyze large datasets

9:00–12:30 Tuesday, 30/04/2019

Tutorial

Data Science, Machine Learning & AI
Location: Capital Suite 4

Secondary topics: AI and Data technologies in the cloud, Data preparation, data governance, and data lineage, Health and Medicine

Krishnan Saidapet (REAN Cloud, A Hitachi Vantara company)

Average rating:

(3.43, 7 ratings)

Krishnan Saidapet offers an overview of the latest big data and machine learning serverless technologies from Amazon Web Services (AWS) and leads a deep dive into using them to process and analyze two different datasets: the publicly available Bureau of Labor Statistics dataset and the Chest X-Ray Image Data dataset. Read more.

Findata Day

9:00–17:00 Tuesday, 30/04/2019

Location: Capital Suite 13

Alistair Croll (Solve For Interesting), Nicolette Bullivant (Santander UK Technology), Charlotte Werger (Van Lanschot Kempen), Daniel First (QuantumBlack), Yiannis Kanellopoulos (Code4Thought), Romi Mahajan (Quantarium), Rashed Iqbal (Investment and Development Office), Martin Leijen (Rabobank / Digital Transformation Office), Tal Doron (GigaSpaces), Alistair Croll (Solve For Interesting), Chris Taggart (OpenCorporates), Jan Novotny (Deutsche Bank)

From analyzing risk and detecting fraud to predicting payments and improving customer experience, take a deep dive into the ways data technologies are transforming the financial industry. Read more.

Continuous intelligence: Moving machine learning into production reliably

9:00–12:30 Tuesday, 30/04/2019

Tutorial

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Model lifecycle management

Danilo Sato (ThoughtWorks), Christoph Windheuser (ThoughtWorks)

Average rating:

(4.31, 13 ratings)

Danilo Sato and Christoph Windheuser walk you through applying continuous delivery (CD), pioneered by ThoughtWorks, to data science and machine learning. Join in to learn how to make changes to your models while safely integrating and deploying them into production, using testing and automation techniques to release reliably at any time and with a high frequency. Read more.

Data Case Studies

9:00–17:00 Tuesday, 30/04/2019

Location: Capital Suite 12

Paco Nathan (derwen.ai), Ganes Kesari (Gramener), Alicia Williams (Google), Semih Kumluk (Turkcell), Simon Moritz (Ericsson), Samuel Cristóbal (Innaxis), Volker Schnecke (Novo Nordisk), Julia Butter (Scout24), Cecilia Marchi (Jakala), Caroline Goulard (Dataveyes), Marc Rind (ADP), Juan Bengochea (Royal Caribbean Cruise Lines), Aaronpal Dhanda (EasyJet )

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

Foundations for successful data projects

9:00–12:30 Tuesday, 30/04/2019

Tutorial

Data Engineering and Architecture
Location: Capital Suite 8

Secondary topics: Financial Services

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Average rating:

(3.50, 12 ratings)

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.

Getting ready for GDPR and CCPA: Securing and governing hybrid, cloud, and on-premises big data deployments

9:00–12:30 Tuesday, 30/04/2019

Tutorial

Data Engineering and Architecture
Location: Capital Suite 10

Secondary topics: Security and Privacy

Mark Donsky (Okera), Ifigeneia Derekli (Cloudera), Lars George (Okera), Michael Ernest (Dataiku)

Average rating:

(4.00, 2 ratings)

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Ifigeneia Derekli, Lars George, and Michael Ernest share hands-on best practices for meeting these challenges, with special attention paid to CCPA. Read more.

Real-time SQL stream processing at scale with Apache Kafka and KSQL

9:00–12:30 Tuesday, 30/04/2019

Tutorial

Data Engineering and Architecture
Location: Capital Suite 11

Secondary topics: Streaming and realtime analytics

Robin Moffatt (Confluent)

Average rating:

(5.00, 5 ratings)

Robin Moffatt walks you through the architectural reasoning for Apache Kafka and the benefits of real-time integration. You'll then build a streaming data pipeline using nothing but your bare hands, Kafka Connect, and KSQL. Read more.

Cross-cloud model training and serving with Kubeflow

9:00–12:30 Tuesday, 30/04/2019

Tutorial

Data Science, Machine Learning & AI
Location: Capital Suite 15

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management

Holden Karau (Independent), Trevor Grant (IBM), Francesca Lazzeri (Microsoft)

Average rating:

(4.43, 7 ratings)

Holden Karau, Francesca Lazzeri, and Trevor Grant offer an overview of Kubeflow and walk you through using it to train and serve models across different cloud environments (and on-premises). You'll use a script to do the initial setup work, so you can jump (almost) straight into training a model on one cloud and then look at how to set up serving in another cluster/cloud. Read more.

10:30

10:30–11:00 Tuesday, 30/04/2019

Location: Capital Suite Foyer

Morning break (30m)

12:30

12:30–13:30 Tuesday, 30/04/2019

Location: Hall N11

Lunch (1h)

13:30

Architecture and algorithms for end-to-end streaming data processing

13:30–17:00 Tuesday, 30/04/2019

Tutorial

Data Engineering and Architecture, Streaming and IoT
Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Streaming and realtime analytics, Temporal data and time-series

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

Average rating:

(3.00, 10 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

Time series forecasting with Azure Machine Learning

13:30–17:00 Tuesday, 30/04/2019

Tutorial

Data Science, Machine Learning & AI
Location: Capital Suite 2/3

Secondary topics: AI and Data technologies in the cloud, Deep Learning, Financial Services, Temporal data and time-series

Francesca Lazzeri (Microsoft), Aashish Bhateja (Microsoft)

Average rating:

(4.25, 4 ratings)

Time series modeling and forecasting is fundamentally important to various practical domains; in the past few decades, machine learning model-based forecasting has become very popular in both private and public decision-making processes. Francesca Lazzeri walks you through using Azure Machine Learning to build and deploy your time series forecasting models. Read more.

Running multidisciplinary big data workloads in the cloud

13:30–17:00 Tuesday, 30/04/2019

Tutorial

Data Engineering and Architecture
Location: Capital Suite 4

Secondary topics: AI and Data technologies in the cloud

Colm Moynihan (Cloudera), Jonathan Seidman (Cloudera), Michael Kohs (Cloudera)

Average rating:

(4.00, 2 ratings)

Moving to the cloud poses a number of challenges. Join Colm Moynihan, Jonathan Seidman, and Michael Kohs to explore cloud architecture and challenges and learn how to use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.

Natural language understanding at scale with Spark NLP

13:30–17:00 Tuesday, 30/04/2019

Tutorial

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Deep Learning, Text and Language processing and analysis

Alexander Thomas (John Snow Labs), Claudiu Branzan (Accenture)

Average rating:

(4.00, 4 ratings)

Alex Thomas and Claudiu Branzan lead a hands-on introduction to scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working code base that you can change and improve. Read more.

Your data strategy: It should be concise, actionable, and understandable by business and IT

13:30–17:00 Tuesday, 30/04/2019

Tutorial

Strata Business Summit
Location: Capital Suite 8

Secondary topics: AI and machine learning in the enterprise

Peter Aiken (Data BluePrint | DAMA International | Virginia Commonwealth University)

Average rating:

(3.43, 14 ratings)

Peter Aiken offers a more operational perspective on the use of data strategy, which is especially useful for organizations just getting started with data Read more.

Hands-on machine learning with Kafka-based streaming pipelines

13:30–17:00 Tuesday, 30/04/2019

Tutorial

Streaming and IoT
Location: Capital Suite 10

Secondary topics: Model lifecycle management, Streaming and realtime analytics

Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)

Average rating:

(4.20, 5 ratings)

Boris Lublinsky and Dean Wampler walk you through using ML in streaming data pipelines and doing periodic model retraining and low-latency scoring in live streams. You'll explore using Kafka as a data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.

Serverless machine learning with TensorFlow: Part II

13:30–17:00 Tuesday, 30/04/2019

Tutorial

Data Science, Machine Learning & AI
Location: Capital Suite 11

Secondary topics: AI and Data technologies in the cloud, Deep Learning

Melinda King (ROI Training)

Average rating:

(3.12, 8 ratings)

Melinda King offers an introduction to designing and building machine learning models on Google Cloud Platform. Through a combination of presentations, demos, and hands-on labs, you’ll learn machine learning (ML) and TensorFlow concepts and develop skills in developing, evaluating, and productionizing ML models. Read more.

Learning Presto: SQL on anything

13:30–17:00 Tuesday, 30/04/2019

Tutorial

Data Engineering and Architecture
Location: Capital Suite 15

Secondary topics: AI and Data technologies in the cloud

Matt Fuller (Starburst)

Average rating:

(5.00, 2 ratings)

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.

15:00

15:00–15:30 Tuesday, 30/04/2019

Location: Capital Suite Foyer

Afternoon Break (30m)

17:00

Opening Reception

17:00–18:00 Tuesday, 30/04/2019

Event

Location: Expo Hall

Average rating:

(4.00, 2 ratings)

Join us after tutorials on Tuesday in the Expo Hall. Grab a drink and mingle with fellow Strata attendees while you check out all of the exhibitors. Read more.

Wednesday, 1/05/2019

8:00

8:00–9:00 Wednesday, 1/05/2019

Location: Level 0 - Blvd

Early morning coffee sponsored by Immuta (1h)

10:45–11:15 Wednesday, 1/05/2019

Location: Expo Hall

Morning break (30m)

11:15

Predicting real-time transaction fraud using supervised learning

11:15–11:55 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: Deep Learning, Financial Services, Temporal data and time-series

Sami Niemi (Barclays)

Average rating:

(4.62, 16 ratings)

Predicting transaction fraud of debit and credit card payments in real time is an important challenge, which state-of-art supervised machine learning models can help to solve. Sami Niemi offers an overview of the solutions Barclays has been developing and testing and details how well models perform in variety of situations like card present and card not present debit and credit card transactions. Read more.

Recommending and searching at Spotify

11:15–11:55 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall (Capital Hall N24)

Secondary topics: Media, Marketing, Advertising, Retail and e-commerce

Mounia Lalmas (Spotify)

Average rating:

(4.16, 19 ratings)

Spotify's mission is "to match fans and artists in a personal and relevant way." Mounia Lalmas shares some of the (research) work the company is doing to achieve this, from using machine learning to metric validation, illustrated through examples within the context of home and search. Read more.

The Presto Cost-Based Optimizer for interactive SQL on anything

11:15–11:55 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Wojciech Biela (Starburst), Piotr Findeisen (Starburst)

Average rating:

(3.12, 8 ratings)

Presto is a popular open source–distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3, Azure ADSL, RDBMS, NoSQL, etc). Wojciech Biela and Piotr Findeisen offer an overview of the Cost-Based Optimizer (CBO) for Presto, which brings a great performance boost. Join in to learn about CBO internals, the motivating use cases, and observed improvements. Read more.

Protecting sensitive data in huge datasets: Cloud tools you can use

11:15–11:55 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: AI and Data technologies in the cloud, Open Data, Data Generation and Data Networks, Security and Privacy

Felipe Hoffa (Google)

Average rating:

(3.50, 4 ratings)

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa explores how to handle massive public datasets, taking you from theory to real life as he showcases newly available tools that help with PII detection and bring concepts like k-anonymity and l-diversity to the practical realm. Read more.

Model governance and model ops in the enterprise

11:15–11:55 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 2/3

Secondary topics: Model lifecycle management

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Average rating:

(5.00, 1 rating)

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them. Read more.

India's data dilemma with India Stack

11:15–11:55 Wednesday, 1/05/2019

Session

Law and Ethics
Location: Capital Suite 4

Secondary topics: Ethics, Security and Privacy

Sundeep Reddy Mallu (Gramener)

Average rating:

(5.00, 4 ratings)

Answering the simple question of what rights Indian citizens have over their data is a nightmare. The rollout of India Stack technology-based solutions has added fuel to fire. Sundeep Reddy Mallu explains, with on-the-ground examples, how businesses and citizens in India's booming digital economy are navigating the India Stack ecosystem while dealing with data privacy, security, and ethics. Read more.

Building a sales AI platform: Key principles and lessons learned

11:15–11:55 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: AI and machine learning in the enterprise, Data Platforms, Deep Learning, Text and Language processing and analysis

Moty Fania (Intel)

Average rating:

(3.83, 6 ratings)

Moty Fania shares his experience implementing a sales AI platform that handles processing of millions of website pages and sifts through millions of tweets per day. The platform is based on unique open source technologies and was designed for real-time data extraction and actuation. Read more.

Serverless for data and AI

11:15–11:55 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 10/11

Secondary topics: AI and Data technologies in the cloud

Avner Braverman (Binaris)

Average rating:

(2.71, 7 ratings)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.

Executive Briefing: From the edge to AI—Taking control of your data for fun and profit

11:15–11:55 Wednesday, 1/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: AI and Data technologies in the cloud, AI and machine learning in the enterprise, IoT and its applications

Mick Hollison (Cloudera)

Average rating:

(3.33, 3 ratings)

Managing your data securely is difficult, as is choosing the right machine learning tools and managing models and applications in compliance with regulation and law. Mick Hollison covers the risks and the issues that matter most and explains how to address them with an enterprise data cloud and by embracing your data center and the public cloud in combination. Read more.

Spark NLP in action: How Indeed applies NLP to standardize résumé content at scale

11:15–11:55 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Deep Learning, Media, Marketing, Advertising, Text and Language processing and analysis

Alexander Thomas (John Snow Labs), Alexis Yelton (Indeed)

Average rating:

(4.67, 3 ratings)

Alexander Thomas and Alexis Yelton demonstrate how to use Spark NLP and Apache Spark to standardize semistructured text, illustrated by Indeed's standardization process for résumé content. Read more.

Building a secure and transparent ML pipeline using open source technologies

11:15–11:55 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: Ethics, Security and Privacy

Nick Pentreath (IBM)

Average rating:

(4.75, 4 ratings)

The application of AI algorithms in domains such as criminal justice, credit scoring, and hiring holds unlimited promise. At the same time, it raises legitimate concerns about algorithmic fairness. There's a growing demand for fairness, accountability, and transparency from machine learning (ML) systems. Nick Pentreath explains how to build just such a pipeline leveraging open source tools. Read more.

Implementing enterprise data management in industrial and scientific organizations

11:15–11:55 Wednesday, 1/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 12

Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)

Average rating:

(4.00, 9 ratings)

To succeed in implementing enterprise data management in industrial and scientific organizations and realize business value, the worlds of business data, facilities data, and scientific data—which have long been managed separately—must be brought together. Sun Maria Lehmann and Jane McConnell explore the cultural and organizational differences and the data management requirements to succeed. Read more.

Stream, stream, stream: Different streaming methods with Spark and Kafka

11:15–11:55 Wednesday, 1/05/2019

Session

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Media, Marketing, Advertising, Streaming and realtime analytics

Itai Yaffe (Nielsen)

Average rating:

(4.45, 11 ratings)

NMC (Nielsen Marketing Cloud) provides customers (both marketers and publishers) with real-time analytics tools to profile their target audiences. To achieve that, the company needs to ingest billions of events per day into its big data stores in a scalable, cost-efficient way. Itai Yaffe explains how NMC continuously transforms its data infrastructure to support these goals. Read more.

12:05

Sequence-to-sequence modeling for time series

12:05–12:45 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: Deep Learning, Temporal data and time-series

Arun Kejariwal (Independent), Ira Cohen (Anodot)

Average rating:

(4.00, 5 ratings)

Sequence-to-sequence modeling (seq2seq) is now being used for applications based on time series data. Arun Kejariwal and Ira Cohen offer an overview seq2seq and explore its early use cases. They then walk you through leveraging seq2seq modeling for these use cases, particularly with regard to real-time anomaly detection and forecasting. Read more.

Agile NLP workflows with spaCy and Prodigy

12:05–12:45 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall (Capital Hall N24)

Secondary topics: AI and machine learning in the enterprise, Text and Language processing and analysis

Matthew Honnibal (Explosion AI)

Average rating:

(4.00, 4 ratings)

Matthew Honnibal shares "one weird trick" that can give your NLP project a better chance of success: avoid a waterfall methodology where data definition, corpus construction, modeling, and deployment are performed as separate phases of work. Read more.

Running SQL-based workloads in the cloud at 20x–200x lower cost using Apache Arrow

12:05–12:45 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Jacques Nadeau (Dremio)

Average rating:

(4.75, 4 ratings)

Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. Jacques Nadeau explains how to accelerate TPC workloads, invisible to client apps, and how to use Apache Arrow, Parquet, and Calcite to provide a scalable, high-performance solution optimized for cloud deployments while significantly reducing operational costs. Read more.

Leveraging metadata for automating delivery and operations of advanced data platforms

12:05–12:45 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: Automation in data science and big data, Data preparation, data governance, and data lineage

Peter Billen (Accenture)

Average rating:

(4.50, 6 ratings)

Peter Billen explains how to use metadata to automate delivery and operations of a data platform. By injecting automation into the delivery processes, you shorten the time to market while improving the quality of the initial user experience. Typical examples include data profiling and prototyping, test automation, continuous delivery and deployment, and automated code creation. Read more.

How retailers can leverage data to stay competitive in an ever-changing digital landscape (sponsored by Data Reply)

12:05–12:45 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 2/3

Luca Piccolo (Data Reply), Michele Miraglia (Data Reply)

Average rating:

(3.33, 3 ratings)

Retailers are facing a daunting challenge: remaining competitive in an ever-changing landscape that is becoming increasingly digital—which requires them to overcome rifts in internal systems and seamlessly leverage their data to generate business value. Luca Piccolo and Michele Miraglia outline Data Reply's approach, distilled while supporting retailers in successfully tackling these challenges. Read more.

Is it possible to regulate machine learning? Dream versus R&D (sponsored by AXA)

12:05–12:45 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 4

Marcin Detyniecki (AXA)

Average rating:

(4.60, 5 ratings)

Marcin Detyniecki offers an overview of the machine learning backend and its possible applications for the insurance business and other businesses based on the power of research merged with business. Read more.

The changing face of ETL: Event-driven architectures for data engineers

12:05–12:45 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: Data Integration and Data Pipelines

Robin Moffatt (Confluent)

Average rating:

(4.21, 14 ratings)

Robin Moffatt discusses the concepts of events, their relevance to software and data engineers, and their ability to unify architectures in a powerful way. Join in to learn why analytics, data integration, and ETL fit naturally into a streaming world. Along the way, Robin will lead a hands-on demonstration of these concepts in practice and commentary on the design choices made. Read more.

Responsible AI innovation

12:05–12:45 Wednesday, 1/05/2019

Session

Law and Ethics, Strata Business Summit
Location: Capital Suite 10/11

Secondary topics: AI and machine learning in the enterprise, Ethics

Laila Paszti (GTC Law Group PC & Affiliates)

Average rating:

(3.50, 2 ratings)

As companies commercialize novel applications of AI in areas such as finance, hiring, and public policy, there's concern that these automated decision-making systems may unconsciously duplicate social biases, with unintended societal consequences. Laila Paszti shares practical advice for companies to counteract such prejudices through a legal- and ethics-based approach to innovation. Read more.

Executive Briefing: Why managing machines is harder than you think

12:05–12:45 Wednesday, 1/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: AI and machine learning in the enterprise, Open Data, Data Generation and Data Networks

Pete Skomoroch (Workday)

Average rating:

(4.70, 10 ratings)

In the next decade, companies that understand how to apply machine intelligence will scale and win their markets. Others will fail to ship successful AI products that matter to customers. Pete Skomoroch details how to combine product design, machine learning, and executive strategy to create a business where every product interaction benefits from your investment in machine intelligence. Read more.

Dealing with data scarcity in natural language processing

12:05–12:45 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Text and Language processing and analysis

Yves Peirsman (NLP Town)

Average rating:

(4.57, 7 ratings)

In this age of big data, NLP professionals are all too often faced with a lack of data: written language is abundant, but labeled text is much harder to come by. Yves Peirsman outlines the most effective ways of addressing this challenge, from the semiautomatic construction of labeled training data to transfer learning approaches that reduce the need for labeled training examples. Read more.

Visually communicating statistical and machine learning methods

12:05–12:45 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI, Visualization and UX
Location: Capital Suite 15/16

Secondary topics: Visualization, Design, and UX

Michael Freeman (University of Washington)

Average rating:

(4.18, 11 ratings)

Statistical and machine learning techniques are only useful when they're understood by decision makers. While implementing these techniques is easier than ever, communicating about their assumptions and mechanics is not. Michael Freeman details a design process for crafting visual explanations of analytical techniques and communicating them to stakeholders. Read more.

Insights from engineering Europe's largest marketing platform for fashion

12:05–12:45 Wednesday, 1/05/2019

Session

Case studies, Strata Business Summit
Location: Capital Suite 12

Secondary topics: Data Platforms, Retail and e-commerce

Dirk Petzoldt (Zalando SE)

Average rating:

(4.18, 11 ratings)

Dirk Petzoldt shares a case study from Europe’s leading online fashion platform Zalando illustrating its journey to a scalable, personalized machine learning–based marketing platform. Read more.

Report card on streaming microservices

12:05–12:45 Wednesday, 1/05/2019

Session

Data Engineering and Architecture, Expo Hall, Streaming and IoT
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: Streaming and realtime analytics

Ted Dunning (MapR, now part of HPE)

Average rating:

(4.67, 6 ratings)

As a community, we have been pushing streaming architectures, particularly microservices, for several years now. But what are the results in the field? Ted Dunning shares several (anonymized) case histories, describing the good, the bad, and the ugly. In particular, Ted covers how several teams who were new to big data fared by skipping MapReduce and jumping straight into streaming. Read more.

12:45

Wednesday Topic Tables at Lunch

12:45–14:05 Wednesday, 1/05/2019

Event

Location: Expo Hall

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.

14:05

The unreasonable effectiveness of transfer learning on NLP

14:05–14:45 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: Deep Learning, Text and Language processing and analysis

David Low (Pand.ai)

Average rating:

(3.57, 7 ratings)

Transfer learning has been proven to be a tremendous success in computer vision—a result of the ImageNet competition. In the past few months, there have been several breakthroughs in natural language processing with transfer learning, namely ELMo, OpenAI Transformer, and ULMFit. David Low demonstrates how to use transfer learning on an NLP application with SOTA accuracy. Read more.

Fair, privacy-preserving, and secure ML

14:05–14:45 Wednesday, 1/05/2019

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall (Capital Hall N24)

Secondary topics: Security and Privacy

Mikio Braun (Zalando)

Average rating:

(5.00, 3 ratings)

Mikio Braun explores techniques and concepts around fairness, privacy, and security when it comes to machine learning models. Read more.

Picking Parquet: Improved performance for selective queries in Impala, Hive, and Spark

14:05–14:45 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Anna Szonyi (Cloudera), Zoltán Borók-Nagy (Cloudera)

Average rating:

(4.20, 10 ratings)

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. Anna Szonyi and Zoltán Borók-Nagy share the technical details of the design and its implementation along with practical tips to help data architects leverage these new capabilities in their schema design and performance results for common workloads. Read more.

Disrupting data discovery

14:05–14:45 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Mark Grover (Lyft)

Average rating:

(4.64, 11 ratings)

Mark Grover discusses how Lyft has reduced the time it takes to discover data by 10 times by building its own data portal, Amundsen. Mark gives a demo of Amundsen, leads a deep dive into its architecture, and discusses how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. Mark closes with a future roadmap, unsolved problems, and collaboration model. Read more.

Intelligent design patterns for cloud-based analytics and BI (sponsored by Arcadia Data)

14:05–14:45 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 2/3

Shant Hovsepian (Arcadia Data)

Average rating:

(4.00, 4 ratings)

With cloud object storage, you may expect business intelligence (BI) applications to benefit from the scale of data and real-time analytics, but traditional BI in the cloud faces not-so-obvious challenges. Shant Hovsepian discusses considerations for service-oriented cloud design and shows how native cloud BI provides analytic depth, low cost, and high performance. Read more.

How a LiveData strategy breaks down barriers to overcome data gravity (sponsored by WANdisco)

14:05–14:45 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 4

Joel Horwitz (WANdisco)

Joel Horwitz shares best practices WANdisco clients have taken to evolve their data architecture to become a LiveData company. Read more.

Building the data infrastructure for the internet of things at zettabyte scale

14:05–14:45 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: Data Platforms, IoT and its applications, Retail and e-commerce, Temporal data and time-series

JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)

Average rating:

(3.33, 3 ratings)

Jian Chang and Sanjian Chen share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, and discuss lessons learned from years of development and continuous improvement. Read more.

Using data for evil V: The AI strikes back

14:05–14:45 Wednesday, 1/05/2019

Session

Law and Ethics, Strata Business Summit
Location: Capital Suite 10/11

Secondary topics: AI and machine learning in the enterprise, Ethics

Duncan Ross (Times Higher Education), Francine Bennett (Mastodon C)

Average rating:

(4.83, 12 ratings)

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis (and possibly AI). Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse. Read more.

Executive Briefing: Big data in the era of heavy worldwide privacy regulations

14:05–14:45 Wednesday, 1/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: Data preparation, data governance, and data lineage, Security and Privacy

Mark Donsky (Okera), Nikki Rouda (Amazon Web Services)

Average rating:

(4.67, 3 ratings)

The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance. Read more.

The evolution of data science skill sets: An analysis using exponential family embeddings

14:05–14:45 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Media, Marketing, Advertising, Text and Language processing and analysis

Maryam Jahanshahi (TapRecruit)

Average rating:

(4.00, 3 ratings)

Maryam Jahanshahi explores exponential family embeddings: methods that extend the idea behind word embeddings to other data types. You'll learn how TapRecruit used dynamic embeddings to understand how data science skill sets have transformed over the last three years, using its large corpus of job descriptions, and more generally, how these models can enrich analysis of specialized datasets. Read more.

Using machine learning for stock picking

14:05–14:45 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: Financial Services, Temporal data and time-series

Alun Biffin (Van Lanschot Kempen), David Dogon (Van Lanschot Kempen)

Average rating:

(4.45, 11 ratings)

Alun Biffin and David Dogon explain how machine learning revolutionized the stock-picking process for portfolio managers at Kempen Capital Management by filtering the vast small-cap investment universe down to a handful of optimal stocks. Read more.

An Innovation Architecture industrializes AI from PoCs to production

14:05–14:45 Wednesday, 1/05/2019

Session

Culture and organization
Location: Capital Suite 12

Teresa Tung (Accenture), Jean-Luc Chatelain (Accenture)

Average rating:

(2.33, 6 ratings)

Innovation is abundant as companies reimagine themselves as data-driven and AI-powered businesses. How do enterprises organize to move beyond numerous, often similar proofs of concept (PoCs) into production-quality products and services? Teresa Tung and Jean-Luc Chatelain explore Accenture’s Innovation Architecture, which manages PoCs and pilots through embedding into scalable, saleable solutions. Read more.

Nielsen presents: Fun with Kafka, Spark, and offset management

14:05–14:45 Wednesday, 1/05/2019

Session

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, Media, Marketing, Advertising, Streaming and realtime analytics

Simona Meriam (Nielsen)

Average rating:

(4.57, 7 ratings)

Simona Meriam explains how Nielsen Marketing Cloud (NMC) used to manage its Kafka consumer offsets against Spark-Kafka 0.8 consumer and why the company decided to upgrade from Spark-Kafka 0.8 to 0.10 consumer. Simona reviews the problems encountered during the upgrade and details the process that led to the solution. Read more.

14:55

Synthetic video generation: Why seeing should not always be believing

14:55–15:35 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: Deep Learning, Media, Marketing, Advertising, Security and Privacy

Alexander Adam (Faculty)

Average rating:

(4.00, 1 rating)

The advent of "fake news" has led us to doubt the truth of online media, and advances in machine learning give us an even greater reason to question what we are seeing. Despite the many beneficial applications of this technology, it's also potentially very dangerous. Alex Adam explains how synthetic videos are created and how they can be detected. Read more.

TensorFlow for everyone

14:55–15:35 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall (Capital Hall N24)

Secondary topics: Deep Learning

Wolff Dobson (Google, Inc.)

Average rating:

(3.83, 6 ratings)

Wolff Dobson covers the latest in TensorFlow. Whether you're a beginner or are migrating from 1.x to 2.0, you'll learn the best ways to set up your model, feed your data to it, and distribute it for fast training. You'll also discover how TensorFlow has been recently upgraded to be more intuitive. Read more.

Improving Spark downscaling; Or, Not throwing away all of our work

14:55–15:35 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Holden Karau (Independent), Mikayla Konst (Google), Ben Sidhom (Google)

Average rating:

(3.75, 4 ratings)

As more workloads move to severless-like environments, the importance of properly handling downscaling increases. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers—everything from how to schedule jobs to the location of blocks and their impact. Read more.

Model serving via Pulsar functions

14:55–15:35 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(3.00, 1 rating)

Arun Kejariwal and Karthik Ramasamy walk you through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same. Read more.

Augment your recommender system with transfer learning on images (sponsored by Dataiku)

14:55–15:35 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 2/3

Larry Orimoloye (Dataiku)

Average rating:

(1.00, 1 rating)

Recommender systems are tools that provide suggestions that best suit the customers' needs, even when they're not aware of it. Larry Orimoloye explains how Dataiku helped one of the world's leading vacation retailers drive customers toward better recommendations. Read more.

Build your own data lake with AWS Glue and Amazon Athena (sponsored by Amazon Web Services)

14:55–15:35 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 4

Damon Cortesi (Amazon Web Services)

Average rating:

(4.71, 7 ratings)

Damon Cortesi demonstrates how to use AWS Glue and Amazon Athena to implement an end-to-end pipeline. Read more.

The Lyft data platform: Now and in the future

14:55–15:35 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: Data Integration and Data Pipelines, Data Platforms, Data preparation, data governance, and data lineage, Model lifecycle management, Security and Privacy, Transportation and Logistics

Mark Grover (Lyft), Deepak Tiwari (Lyft)

Average rating:

(4.69, 13 ratings)

Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future. Read more.

Integrated Business Intelligence Suite: How Uber built a platform to convert raw data into knowledge

14:55–15:35 Wednesday, 1/05/2019

Session

Law and Ethics, Strata Business Summit
Location: Capital Suite 10/11

Secondary topics: Data Platforms, Transportation and Logistics, Visualization, Design, and UX

Shailesh Chauhan (Uber)

Average rating:

(4.11, 9 ratings)

Shailesh Chauhan explains how Uber built its business intelligence platform, detailing why the company took a platform approach rather than adding features in a piecemeal fashion. Read more.

Executive Briefing: Overview of data governance

14:55–15:35 Wednesday, 1/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: AI and machine learning in the enterprise, Data preparation, data governance, and data lineage

Paco Nathan (derwen.ai)

Average rating:

(4.14, 7 ratings)

Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa. Read more.

Solving data cleaning and unification using human-guided machine learning

14:55–15:35 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Data preparation, data governance, and data lineage

Ihab Ilyas (University of Waterloo)

Average rating:

(4.71, 7 ratings)

Last year, Ihab Ilyas covered two primary challenges in applying machine learning to data curation: entity consolidation and using probabilistic inference to suggest data repair for identified errors and anomalies. This year, he explores these limitations in greater detail and explains why data unification projects quickly require human-guided machine learning and a probabilistic model. Read more.

Explainable machine learning in fintech

14:55–15:35 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: Ethics, Financial Services, Health and Medicine

Eitan Anzenberg (Bill.com)

Average rating:

(4.50, 4 ratings)

Machine learning applications balance interpretability and performance. Linear models provide formulas to directly compare the influence of the input variables, while nonlinear algorithms produce more accurate models. Eitan Anzenberg explores a solution that utilizes what-if scenarios to calculate the marginal influence of features per prediction and compare with standardized methods such as LIME. Read more.

Signal processing, machine learning, and video tell the truth

14:55–15:35 Wednesday, 1/05/2019

Session

Case studies, Strata Business Summit
Location: Capital Suite 12

David Maman (Binah)

Average rating:

(3.67, 3 ratings)

David Maman demonstrates how the combination of a mere few minutes of video, signal processing, remote heart-rate monitoring, machine learning, and data science can identify a person’s emotions, health condition and performance. Financial institutions and potential employers can now analyze whether you have good or bad intentions. Read more.

Processing 10M samples a second to drive smart maintenance in complex IIoT systems

14:55–15:35 Wednesday, 1/05/2019

Session

Data Engineering and Architecture, Expo Hall, Streaming and IoT
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, IoT and its applications, Streaming and realtime analytics

Geir Engdahl (Cognite), Daniel Bergqvist (Google)

Average rating:

(4.00, 2 ratings)

Geir Engdahl and Daniel Bergqvist explain how Cognite is developing IIoT smart maintenance systems that can process 10M samples a second from thousands of sensors. You'll explore an architecture designed for high performance, robust streaming sensor data ingest, and cost-effective storage of large volumes of time series data as well as best practices learned along the way. Read more.

15:35

15:35–16:35 Wednesday, 1/05/2019

Location: Expo Hall

Afternoon break (1h)

16:35

LSTM-based time series anomaly detection using Analytics Zoo for Spark and BigDL

16:35–17:15 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: Deep Learning, Temporal data and time-series

Guoqiong Song (Intel)

Average rating:

(3.40, 5 ratings)

Collecting and processing massive time series data (e.g., logs, sensor readings, etc.) and detecting the anomalies in real time is critical for many emerging smart systems, such as industrial, manufacturing, AIOps, and the IoT. Guoqiong Song explains how to detect anomalies in time series data using Analytics Zoo and BigDL at scale on a standard Spark cluster. Read more.

Opening the black box: Explainable AI (XAI)

16:35–17:15 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall (Capital Hall N24)

Secondary topics: Ethics, Security and Privacy

Maren Eckhoff (QuantumBlack)

Average rating:

(4.50, 4 ratings)

The success of machine learning algorithms in a wide range of domains has led to a desire to leverage their power in ever more areas. Maren Eckhoff discusses modern explainability techniques that increase the transparency of black box algorithms, drive adoption, and help manage ethical, legal, and business risks. Many of these methods can be applied to any model without limiting performance. Read more.

Scalability-aware autoscaling of a Spark application

16:35–17:15 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Anirudha Beria (Qubole), Rohit Karlupia (Qubole)

Average rating:

(3.67, 3 ratings)

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs at the same time. Scalability-aware autoscaling uses historical information to make better scaling decisions. Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs. Read more.

Continuous intelligence: Keeping your AI application in production

16:35–17:15 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: Model lifecycle management

Arif Wider (ThoughtWorks), Emily Gorcenski (ThoughtWorks)

Average rating:

(3.90, 10 ratings)

Machine learning can be challenging to deploy and maintain. Any delays in moving models from research to production mean leaving your data scientists' best work on the table. Arif Wider and Emily Gorcenski explore continuous delivery (CD) for AI/ML along with case studies for applying CD principles to data science workflows. Read more.

Engineering ML to improve the shopping experience (sponsored by Zara Tech)

16:35–17:15 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 2/3

Julio Lopez (Inditex)

Julio López explains how Zara Tech uses indirect observation and ML engineering to augment the understanding of Zara's processes in order to improve them. Join in to learn how Zara Tech built and uses a Spark ML pipeline to provide KPIs to improve the shopping experience. Read more.

Data catalogs are changing the nature of working with data (sponsored by Alation)

16:35–17:15 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 4

Debora Seys (.)

Average rating:

(2.67, 6 ratings)

Deb Seys shares the results of a study that she oversaw at eBay in collaboration with the Kellogg School of Management at Northwestern University. Examining the work of 2,000 analysts and almost 80,000 queries, the study revealed that a data catalog can be used as a learning platform that increases analyst productivity and creates a more collaborative approach to discovery and innovation. Read more.

How do you evolve your data infrastructure?

16:35–17:15 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: Data Platforms, Data preparation, data governance, and data lineage, Retail and e-commerce

Neelesh Salian (Stitch Fix)

Average rating:

(4.25, 4 ratings)

Developing data infrastructure is not trivial; neither is changing it. It takes effort and discipline to make changes that can affect your team. Neelesh Salian discusses how Stitch Fix's data platform team maintains and innovates its infrastructure for the company's data scientists. Read more.

Empathy: The secret ingredient in the design of engaging data products and analytics tools

16:35–17:15 Wednesday, 1/05/2019

Session

Strata Business Summit, Visualization and UX
Location: Capital Suite 10/11

Secondary topics: Visualization, Design, and UX

Brian O'Neill (Designing for Analytics)

Average rating:

(3.60, 5 ratings)

Brian O'Neill explains how design is fundamentally improving the bottom line of business and can help data teams uncover the real problems and needs of customers and business stakeholders. Join in to learn and practice a key aspect of good design: how to properly interview stakeholders and users. Read more.

Executive Briefing: 5 things every executive should NOT know

16:35–17:15 Wednesday, 1/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Ellen Friedman (Independent)

Average rating:

(4.20, 5 ratings)

A surprising fact of modern technology is that not knowing some things can make you better at what you do. This isn’t just lack of distraction or being too delicate to face reality. It’s about separation of concerns, with a techno flavor. Ellen Friedman outlines five things that best practice with emerging technologies and new architectures can give us ways to not know—and why that’s important. Read more.

From BI to big data; Or, There and back again

16:35–17:15 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 14

Francesco Mucio (Francescomuc.io)

Average rating:

(4.43, 7 ratings)

Francesco Mucio shares the basic tools he and his team had to learn (or relearn) moving from the coziness of their database to the big world of Spark, cloud, distributed systems, and continuous applications. It was an unexpected journey that ended exactly where it started: with an SQL query. Read more.

A Magic 8 Ball for optimal cost and resource allocation for the big data stack

16:35–17:15 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: Automation in data science and big data, Temporal data and time-series

Shivnath Babu (Unravel Data Systems | Duke University), Alkis Simitsis (Micro Focus)

Average rating:

(5.00, 1 rating)

Cost and resource provisioning are critical components of the big data stack. Shivnath Babu and Alkis Simitsis detail how to build a Magic 8 Ball for the big data stack—a decomposable time series model for optimal cost and resource allocation that offers enterprises a glimpse into their future needs and enables effective and cost-efficient project and operational planning. Read more.

The vindication of big data: How Santander UK uses Hadoop to defend privacy

16:35–17:15 Wednesday, 1/05/2019

Session

Case studies, Strata Business Summit
Location: Capital Suite 12

Secondary topics: Data preparation, data governance, and data lineage, Financial Services, Security and Privacy

Maurício Lins (everis NTT DATA UK), Lidia Crespo (Santander UK)

Average rating:

(4.50, 4 ratings)

Big data is usually regarded as a menace to data privacy. But with data privacy principles and a customer-first mindset, it can be a game changer. Maurício Lins and Lidia Crespo explain how Santander UK applied this model to comply with GDPR, using graph technology, Hadoop, Spark, and Kudu to drive data obscuring, data portability, and machine learning exploration. Read more.

Deploying your real-time apps on thousands of servers and still being able to breathe

16:35–17:15 Wednesday, 1/05/2019

Session

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, Automation in data science and big data

Constantin Muraru (Adobe), Dan Popescu (Adobe)

Average rating:

(5.00, 2 ratings)

With the current crop of cloud providers, obtaining servers to run your real-time application has never been easier. But what happens, though, when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers, in a fast, reliable way, with minimal human intervention? Constantin Muraru and Dan Popescu tell you how to tackle this challenge. Read more.

17:25

Creating a data engineering culture

17:25–18:05 Wednesday, 1/05/2019

Session

Culture and organization, Strata Business Summit
Location: Capital Suite 17

Jesse Anderson (Big Data Institute)

Average rating:

(4.71, 7 ratings)

In this talk, we will cover the most common reasons why data engineering teams fail and how to correct them. This will include ways to get your management to understand that data engineering is really complex and time consuming. It is not data warehousing with new names. Management needs to understand that you can’t compare a data engineering team to the web development team, for example. Read more.

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber

17:25–18:05 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Secondary topics: Data Platforms, Transportation and Logistics

Felix Cheung (Uber)

Average rating:

(4.42, 12 ratings)

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame. Read more.

Information architecture for an enterprise data cloud

17:25–18:05 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: AI and Data technologies in the cloud, Data Platforms

Mark Samson (Cloudera), Phillip Radley (BT)

Average rating:

(5.00, 2 ratings)

It's now possible to build a modern data platform capable of storing, processing, and analyzing a wide variety of data across multiple public and private cloud platforms and on-premises data centers. Mark Samson and Phillip Radley outline an information architecture for such a platform, informed by working with multiple large organizations that have built such platforms over the last five years. Read more.

Augmented OLAP for big data from on-premises to multicloud (sponsored by Kyligence)

17:25–18:05 Wednesday, 1/05/2019

Session

Sponsored
Location: Capital Suite 2/3

Luke Han (Kyligence)

Average rating:

(2.00, 1 rating)

Augmenting data management and analytics platforms with artificial intelligence and machine learning is game changing for analysts, engineers, and other users. It enables companies to optimize their storage, speed, and spending. Luke Han explains how the Kyligence platform is evolving to the next level, with augmented capabilities such as intelligent modeling, smart pushdowns, and more. Read more.

Infinite retention using storage offloading with Apache Pulsar

17:25–18:05 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 4

Secondary topics: Streaming and realtime analytics

Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

This talk discusses how Apache Pulsar provides infinite retention of events in topics. We will discuss how the segment oriented architecture allows unlimited topic growth, how you can keep costs down by using tiered storage and how you can run ad-hoc queries on the topic using SQL. Read more.

Mass production of AI solutions

17:25–18:05 Wednesday, 1/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: Data Platforms

Nate Keating (Google)

Average rating:

(4.00, 5 ratings)

AI will change how we live in the next 30 years, but it's still currently limited to a small group of companies. In order to scale the impact of AI across the globe, we need to reduce the cost of building AI solutions, but how? Nate Keating explains how to apply lessons learned from other industries—specifically, the automobile industry, which went through a similar cycle. Read more.

Science-fictional user interfaces

17:25–18:05 Wednesday, 1/05/2019

Session

Strata Business Summit, Visualization and UX
Location: Capital Suite 10/11

Secondary topics: Visualization, Design, and UX

Mars Geldard (University of Tasmania), Paris Buttfield-Addison (Secret Lab)

Average rating:

(4.67, 12 ratings)

Science fiction has been showcasing complex, AI-driven interfaces for decades. As TV, movies, and video games have become more capable of visualizing a possible future, the grandeur of these imagined science fictional interfaces has increased. Mars Geldard and Paris Buttfield-Addison investigate what we can learn from Hollywood UX. Is there a useful takeaway? Does sci-fi show the future of AI UX? Read more.

Executive Briefing: Using a domain knowledge graph to manage AI at scale

17:25–18:05 Wednesday, 1/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: AI and machine learning in the enterprise, Financial Services, Graph technologies and analytics

Teresa Tung (Accenture), Jean-Luc Chatelain (Accenture)

Average rating:

(2.67, 3 ratings)

How do enterprises scale moving beyond one-off AI projects to making it reusable? Teresa Tung and Jean-Luc Chatelain explain how domain knowledge graphs—the technology behind today's internet search—can bring the same democratized experience to enterprise AI. They then explore other applications of knowledge graphs in oil and gas, financial services, and enterprise IT. Read more.

Federated learning: Machine learning with privacy on the edge

17:25–18:05 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Security and Privacy

Chris Wallace (Cloudera)

Average rating:

(5.00, 4 ratings)

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. Chris Wallace discusses the algorithmic solutions and the product opportunities. Read more.

Reading China: Predicting policy change with machine learning

17:25–18:05 Wednesday, 1/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: Text and Language processing and analysis

Weifeng Zhong (Mercatus Center at George Mason University)

Average rating:

(4.75, 4 ratings)

Weifeng Zhong shares a machine learning algorithm built to “read” the People’s Daily (the official newspaper of the Communist Party of China) and predict changes in China’s policy priorities. The output of this algorithm, named the Policy Change Index (PCI) of China, turns out to be a leading indicator of the actual policy changes in China since 1951. Read more.

Why is it so hard to do AI for good?

17:25–18:05 Wednesday, 1/05/2019

Session

Law and Ethics, Strata Business Summit
Location: Capital Suite 12

Secondary topics: Ethics

Duncan Ross (Times Higher Education), giselle cory (DataKind UK)

Average rating:

(4.50, 4 ratings)

DataKind UK has been working in data for good since 2013, helping over 100 UK charities to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations; others haven't. Duncan Ross and Giselle Cory explain how to identify the right data for good projects and how this can act as a framework for avoiding the same problems across industry. Read more.

Mastering streaming and pipelines: Designing and supporting the nervous system of your company

17:25–18:05 Wednesday, 1/05/2019

Session

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: Data Integration and Data Pipelines, Financial Services, Streaming and realtime analytics

Ted Malaska (Capital One)

Average rating:

(4.12, 8 ratings)

The world of data is all about building the best path to support time and quality to value. 80% to 90% of the work is getting the data into the hands and tools that can create value. Ted Malaska takes you on a journey to investigate strategies and designs that can change the way your company looks and approaches data. Read more.

18:05

Expo Hall Reception

18:05–19:05 Wednesday, 1/05/2019

Event

Location: Expo Hall

Unwind after a long day of sessions with small bites and drinks while networking with Strata attendees, exhibitors, and sponsors. Read more.

19:05

19:05–20:00 Wednesday, 1/05/2019

Location: On Your Own

Dinner (55m)

20:00

Data After Dark

20:00–22:00 Wednesday, 1/05/2019

Event

Location: Madison London: One New Change, St Paul’s, London

Average rating:

(5.00, 2 ratings)

Join us for Data After Dark, the official attendee party for Strata in London, which promises to be an unforgettable evening. Take in breathtaking views of London as you enjoy delicious food, drinks, and fun at the Madison London, near St. Paul's Cathedral. Don't miss it. Read more.

Thursday, 2/05/2019

8:00

8:00–9:00 Thursday, 2/05/2019

Privacy, identity, and autonomy in the age of big data and AI

10:15–10:35 Thursday, 2/05/2019

Keynote

Location: Auditorium

Secondary topics: Security and Privacy

Sandra Wachter (University of Oxford)

Average rating:

(4.65, 20 ratings)

Big data analytics and AI draw nonintuitive and unverifiable inferences about the behaviors, preferences, and lives of individuals. These inferences draw on diverse and feature-rich data of unpredictable value and create new opportunities for discriminatory, biased, and invasive decision making. Sandra Wachter discusses how this expands potential victims of discrimination and potential harm. Read more.

10:45

10:45–11:15 Thursday, 2/05/2019

Location: Expo Hall

Morning break (30m)

11:15

Deep learning for speech synthesis: The good news, the bad news, and the fake news

11:15–11:55 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: Deep Learning, Graph technologies and analytics, Security and Privacy

Scott Stevenson (Faculty)

Average rating:

(5.00, 4 ratings)

Modern deep learning systems allow us to build speech synthesis systems with the naturalness of a human speaker. While there are myriad benevolent applications, this also ushers in a new era of fake news. Scott Stevenson explores the danger of such systems and details how deep learning can also be used to build countermeasures to protect against political disinformation. Read more.

How to keep ethical with machine learning

11:15–11:55 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall (Capital Hall N24)

Secondary topics: Ethics

Jerry Overton (DXC)

Average rating:

(5.00, 3 ratings)

Machine learning (ML) algorithms are good at learning new behaviors but bad at identifying when those behaviors are harmful or don’t make sense. Bias, ethics, and fairness are big risk factors in ML. However, we creators have a lot of experience dealing with intelligent beings—one another. Jerry Overton uses this common sense to build a checklist for protecting against ethical violations with ML. Read more.

Scaling Impala: Common mistakes and best practices

11:15–11:55 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Manish Maheshwari (Cloudera)

Average rating:

(5.00, 1 rating)

Apache Impala is an MPP SQL query engine for planet-scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. Manish Maheshwari explains how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and anti-patterns for end users or BI applications. Read more.

Big data analytics in the public cloud: Challenges and opportunities

11:15–11:55 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: AI and Data technologies in the cloud

Jian Zhang (Intel), Chendi Xue (Intel), Yuan Zhou (Intel)

Average rating:

(4.50, 2 ratings)

Jian Zhang, Chendi Xue, and Yuan Zhou explore the challenges of migrating big data analytics workloads to the public cloud (e.g., performance lost and missing features) and demonstrate how to use a new in-memory data accelerator leveraging persistent memory and RDMA NICs to resolve this issues and enable new opportunities for big data workloads on the cloud. Read more.

Oracle's second-generation cloud: Optimized for the partner ecosystem (sponsored by Oracle Cloud Infrastructure)

11:15–11:55 Thursday, 2/05/2019

Session

Sponsored
Location: Capital Suite 2/3

Ben Lackey (Oracle)

Average rating:

(5.00, 1 rating)

Join Ben Lackey to learn how Oracle Cloud Infrastructure's architecture makes it the right place to run compute-intensive partner applications like H20.ai, Cloudera, DataStax, and more. Read more.

Half-correct and half-wrong collective data wisdom: 3 patterns to sanity

11:15–11:55 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: Data preparation, data governance, and data lineage, Financial Services

Sandeep U (Intuit)

Average rating:

(4.67, 3 ratings)

Teams today rely on dictionaries of collective wisdom—a mixed bag with regard to correctness: some datasets have accurate attribute details, while others are incorrect and outdated. This significantly impacts productivity of analysts and scientists. Sandeep Uttamchandani outlines three patterns to better manage data dictionaries. Read more.

Transforming a financial services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up on AWS

11:15–11:55 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 10/11

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Financial Services, Security and Privacy

Eoin O'Flanagan (NewDay), Darragh McConville (Kainos)

Average rating:

(4.86, 7 ratings)

Eoin O'Flanagan and Darragh McConville explain how NewDay built a high-performance contemporary data processing platform from the ground up on AWS. Join in to explore the company's journey from a traditional legacy onsite data estate to an entirely cloud-based PCI DSS-compliant platform. Read more.

Executive Briefing: Analytics for executives

11:15–11:55 Thursday, 2/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: AI and machine learning in the enterprise, Transportation and Logistics

Brandy Freitas (Pitney Bowes)

Average rating:

(4.71, 7 ratings)

Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. Brandy Freitas walks you through developing context and vocabulary around data science topics to help build a culture of data within your organization. Read more.

Fraud detection at a financial institution using unsupervised learning and text mining

11:15–11:55 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: AI and machine learning in the enterprise, Financial Services, Security and Privacy, Text and Language processing and analysis

David Dogon (Van Lanschot Kempen)

Average rating:

(4.75, 8 ratings)

David Dogon dives into a best practice use case for detecting fraud at a financial institution and details a dynamic and robust monitoring system that successfully detects unwanted client behavior. Join in to learn how machine learning models can provide a solution in cases where traditional systems fall short. Read more.

Learning "learning to rank"

11:15–11:55 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: Media, Marketing, Advertising, Retail and e-commerce

Sophie Watson (Red Hat)

Average rating:

(4.10, 10 ratings)

Identifying relevant documents quickly and efficiently enhances both user experience and business revenue every day. Sophie Watson demonstrates how to implement learning-to-rank algorithms and provides you with the information you need to implement your own successful ranking system. Read more.

Insightful health: Amplifying intelligence in healthcare patient flow execution

11:15–11:55 Thursday, 2/05/2019

Session

Case studies, Strata Business Summit
Location: Capital Suite 12

Secondary topics: Health and Medicine

Fabio Ferraretto, Claudia Regina Laselva (Albert Einstein Jewish Hospital)

Average rating:

(3.33, 3 ratings)

Fabio Ferraretto and Claudia Regina Laselva explain how Hospital Albert Einstein and Accenture evolved patient flow experience and efficiency with the use of applied AI, statistics, and combinatorial math, allowing the hospital to anticipate E2E visibility within patient flow operations, from admission of emergency and elective demands to assignment and medical releases. Read more.

Streaming at Lyft

11:15–11:55 Thursday, 2/05/2019

Session

Data Engineering and Architecture, Expo Hall, Streaming and IoT
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: Data Platforms, Streaming and realtime analytics, Transportation and Logistics

Thomas Weise (Lyft)

Average rating:

(4.50, 14 ratings)

Fast data and stream processing are essential for making Lyft rides a good experience for passengers and drivers. Lyft's systems need to track and react to event streams in real time to update locations, compute routes and estimates, balance prices, and more. Thomas Weise offers an overview of the streaming platform that powers these use cases. Read more.

12:05

Inclusive design: Deep learning on audio in Azure, identifying sounds in real time

12:05–12:45 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Swetha Machanavajhala (Microsoft), Xiaoyong Zhu (Microsoft)

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure. Read more.

Deep learning for recommender systems

12:05–12:45 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall (Capital Hall N24)

Secondary topics: Deep Learning, Media, Marketing, Advertising, Retail and e-commerce

Oliver Gindele (Datatonic)

Average rating:

(4.50, 6 ratings)

The success of deep learning has reached the realm of structured data in the past few years, where neural networks have been shown to improve the effectiveness and predictability of recommendation engines. Oliver Gindele offers a brief overview of such deep recommender systems and explains how they can be implemented in TensorFlow. Read more.

Schema on read and the new logging way

12:05–12:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Data Platforms, Streaming and realtime analytics

David Josephsen (Sparkpost)

Average rating:

(3.50, 2 ratings)

David Josephsen tells the story of how Sparkpost's reliability engineering team abandoned ELK for a DIY schema-on-read logging infrastructure. Join in to learn the architectural details, trials, and tribulations from the company's Internal Event Hose data ingestion pipeline project, which uses Fluentd, Kinesis, Parquet, and AWS Athena to make logging sane. Read more.

Herding elephants: Seamless data access in a multicluster clouds

12:05–12:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: AI and Data technologies in the cloud, Data Platforms

Pradeep Bhadani (Hotels.com), Elliot West (Hotels.com)

Average rating:

(4.17, 6 ratings)

Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms. Read more.

Data science at Deutsche Telekom: Predicting global travel patterns and network demand

12:05–12:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: Data Platforms, Security and Privacy, Transportation and Logistics

Vaclav Surovec (Deutsche Telekom), Gabor Kotalik (Deutsche Telekom)

Average rating:

(4.00, 2 ratings)

Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management. Read more.

Application intelligence: Bridging the gap between human expertise and machine learning

12:05–12:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 10/11

Secondary topics: AI and machine learning in the enterprise

Rebecca Simmonds (Red Hat), Michael McCune (Red Hat)

Average rating:

(3.00, 6 ratings)

Artificial intelligence and machine learning are now popularly used terms, but how do you make use of these techniques without throwing away the valuable knowledge of experienced employees? Rebecca Simmonds and Michael McCune delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems. Read more.

Executive Briefing: The intelligent edge and the demise of big data?

12:05–12:45 Thursday, 2/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: IoT and its applications, Security and Privacy

Alasdair Allan (Babilim Light Industries)

Average rating:

(5.00, 4 ratings)

Alasdair Allan explains why the current age, where privacy is no longer "a social norm," may not long survive the coming of the internet of things, as new smart embedded hardware may cause the demise of large-scale data harvesting. Smart devices will process data at the edge, allowing us to extract insights from the data without storing potentially privacy- and GDPR-infringing data. Read more.

NLP Architect by Intel's AI Lab

12:05–12:45 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Deep Learning, Text and Language processing and analysis

Moshe Wasserblat (Intel)

Average rating:

(4.67, 3 ratings)

Moshe Wasserblat offers an overview of NLP Architect, an open source DL NLP library that provides SOTA NLP models, making it easy for researchers to implement NLP algorithms and for data scientists to build NLP-based solutions for extracting insight from textual data to improve business operations. Read more.

How to mitigate mobile fraud risk by data analytics

12:05–12:45 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

SEONMIN KIM (LINE)

Average rating:

(4.00, 6 ratings)

Seonmin Kim offers an introduction to activities that mitigate the risk of mobile payments through various data analytical skills, drawn from actual case studies of mobile frauds, along with tree-based machine learning, graph analytics, and statistical approaches. Read more.

Starting with the end in mind: Lessons learned from data strategies that work

12:05–12:45 Thursday, 2/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 12

Secondary topics: AI and machine learning in the enterprise

Vidya Raman (Cloudera)

Average rating:

(4.62, 8 ratings)

Not surprisingly, there's no single approach to embracing data-driven innovations within any industry vertical. However, some enterprises are doing a better job than others when it comes to establishing a culture, process, and infrastructure that lends itself to data-driven innovations. Vidya Raman explores some key foundational ingredients that span multiple industries. Read more.

Unleashing Apache Kafka and TensorFlow in hybrid architectures

12:05–12:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management

Kai Wähner (Confluent)

Average rating:

(4.75, 8 ratings)

How do you leverage the flexibility and extreme scale of the public cloud and the Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures that span multiple public clouds—or bridge your on-premises data center to the cloud? Join Kai Wähner to learn how to use technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures. Read more.

12:45

Thursday Topic Tables at Lunch

12:45–14:05 Thursday, 2/05/2019

Event

Location: Expo Hall

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.

14:05

Deep learning for fonts

14:05–14:45 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: Deep Learning

Raghotham Sripadraj (Ericsson), Nischal Harohalli Padmanabha (Omnius)

Average rating:

(5.00, 1 rating)

Deep learning has enabled massive breakthroughs in offbeat tracks and has enabled better understanding of how an artist paints, how an artist composes music, and so on. Nischal Harohalli Padmanabha and Raghotham Sripadraj discuss their project Deep Learning for Humans and their plans to build a font classifier. Read more.

AI for good at scale in real time: Challenges in machine learning and deep learning

14:05–14:45 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI, Expo Hall
Location: Expo Hall (Capital Hall N24)

Secondary topics: Data Integration and Data Pipelines, Deep Learning

Alex Jaimes (Dataminr)

Average rating:

(3.00, 2 ratings)

When emergency events occur, social signals and sensor data are generated. Alex Jaimes explains how to apply machine learning and deep learning to process large amounts of heterogeneous data from various sources in real time, with a particular focus on how such information can be used for emergencies and in critical events for first responders and for other social good use cases. Read more.

Mutant tests too: The SQL

14:05–14:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Elliot West (Hotels.com), Jaydene Green (Hotels.com)

Average rating:

(3.00, 3 ratings)

Elliot West and Jay Green share approaches for applying software engineering best practices to SQL-based data applications to improve maintainability and data quality. Using open source tools, Elliot and Jay show how to build effective test suites for Apache Hive code bases and offer an overview of Mutant Swarm, a tool to identify weaknesses in tests and to measure SQL code coverage. Read more.

The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit

14:05–14:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: Security and Privacy

Marcel Ruiz Forns (Wikimedia Foundation)

Average rating:

(4.75, 4 ratings)

Analysts and researchers studying Wikipedia are hungry for long-term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. Marcel Ruiz Forns explains how the Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both. Read more.

Unlocking insights in AI by building a feature store

14:05–14:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: AI and Data technologies in the cloud, AI and machine learning in the enterprise, Data Platforms, Transportation and Logistics

Willem Pienaar (GOJEK), Zhi Ling Chen (GOJEK)

Average rating:

(4.80, 5 ratings)

Features are key to driving impact with AI at all scales, allowing organizations to dramatically accelerate innovation and time to market. Willem Pienaar and Zhiling Chen explain how GOJEK, Indonesia's first billion-dollar startup, unlocked insights in AI by building a feature store called Feast, and the lessons they learned along the way. Read more.

Simplicity at scale: How Cloudflare’s analyses some of the world’s largest DDoS attacks

14:05–14:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 10/11

Secondary topics: Security and Privacy, Streaming and realtime analytics

Tom Walwyn (Cloudflare)

Average rating:

(4.00, 1 rating)

Cloudflare powers nearly 10 percent of all Internet requests worldwide, absorbing some of the largest DDoS attacks. Learn how we use ClickHouse and SQL to simplify our data pipelines on a global scale while experiencing over 10 million events per second. Read more.

Executive Briefing: The hidden data scientists lurking in your company

14:05–14:45 Thursday, 2/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: AI and machine learning in the enterprise

Jack Norris (MapR Technologies)

Average rating:

(4.75, 4 ratings)

Many companies delay addressing core improvements in increasing revenues, reducing costs and risk exposure by tying changes to a to-be-hired data scientist. Drawing on three customer examples, Jack Norris explains how to achieve excellent results faster by starting with domain experience and helping developers and analysts better leverage data with available and understandable analytics. Read more.

8 prerequisites of a graph query language

14:05–14:45 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: Graph technologies and analytics

Mingxi Wu (TigerGraph)

Average rating:

(2.75, 4 ratings)

Graph query language is the key to unleash the value from connected data. Mingxi Wu outlines the eight prerequisites of a practical graph query language, drawn from six years' experience dealing with real-world graph analytical use cases. Along the way, Mingxi compares GSQL, Gremlin, Cypher, and SPARQL, pointing out their respective pros and cons. Read more.

Reinforcement learning: A gentle introduction and an industrial application

14:05–14:45 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: IoT and its applications, Temporal data and time-series

Christian Hidber (bSquare)

Average rating:

(4.86, 7 ratings)

Reinforcement learning (RL) learns complex processes autonomously like walking, beating the world champion in Go, or flying a helicopter. No big datasets with the “right” answers are needed: the algorithms learn by experimenting. Christian Hidber shows how and why RL works and demonstrates how to apply it to an industrial hydraulics application with 7,000 clients in 42 countries. Read more.

Practicing data science: A collection of case studies

14:05–14:45 Thursday, 2/05/2019

Session

Case studies, Strata Business Summit
Location: Capital Suite 12

Secondary topics: AI and machine learning in the enterprise

Rosaria Silipo (KNIME)

Average rating:

(4.40, 10 ratings)

Rosaria Silipo shares a collection of past data science projects. While the structure is often similar—data collection, data transformation, model training, deployment—each required its own special trick, whether a change in perspective or a particular technique to deal with special case and special business questions. Read more.

Autoscaling Spark on Kubernetes

14:05–14:45 Thursday, 2/05/2019

Session

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud

Holden Karau (Independent), Kris Nova (Independent)

Average rating:

(4.86, 7 ratings)

In the Kubernetes world, where declarative resources are a first-class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice, we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova and Holden Karau for a fun adventure. Read more.

14:55

A deep learning approach to automatic call routing

14:55–15:35 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 17

Secondary topics: AI and machine learning in the enterprise, Deep Learning

Tal Doron (GigaSpaces)

Average rating:

(3.50, 2 ratings)

Technological advancements are transforming customer experience, and businesses are beginning to benefit from deep learning innovations to automate call center routing to the most proper agent. Tal Doron explains how to run deep learning models with Intel BigDL and Spark frameworks colocated on an in-memory computing platform to enhance the customer experience without the need for GPUs Read more.

14:55–15:35 Thursday, 2/05/2019

Location: Expo Hall (Capital Hall N24)

TBC

The future of cloud native data warehousing: Emerging trends and technologies

14:55–15:35 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Greg Rahn (Cloudera)

Average rating:

(3.00, 7 ratings)

Data warehouses have traditionally run in the data center, and in recent years, they've been adapted to be more cloud native. Greg Rahn discusses a number of emerging trends and technologies that will impact how data warehouses are run both in the cloud and on-premises and explains what that means for architects, administrators, and end users. Read more.

Mastering data with Spark and machine learning

14:55–15:35 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: Automation in data science and big data, Data preparation, data governance, and data lineage

Sonal Goyal (Nube)

Average rating:

(1.00, 4 ratings)

Enterprise data on customers, vendors, and products is often siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting, and 360 views. Traditional rule-based MDM systems with legacy architectures struggle to unify this growing data. Sonal Goyal offers an overview of a modern master data application using Spark, Cassandra, ML, and Elastic. Read more.

Architecting a data platform to support analytic workflows for scientific data

14:55–15:35 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, IoT and its applications

Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)

Average rating:

(3.67, 3 ratings)

In upstream oil and gas, a vast amount of the data requested for analytics projects is scientific data: physical measurements about the real world. Historically, this data has been managed library style, but a new system was needed to best provide this data. Sun Maria Lehmann and Jane McConnell share architectural best practices learned from their work with subsurface data. Read more.

Learning how to perform ETL data migrations with open source tool Embulk

14:55–15:35 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 10/11

Secondary topics: Data Integration and Data Pipelines

Jason Bell (Independent Speaker)

Average rating:

(5.00, 1 rating)

The Embulk data migration tool offers a convenient way to load data in to a variety of systems with basic configuration. Jason Bell offers an overview of the Embulk tool and outlines some common data migration scenarios that a data engineer could employ using the tool. Read more.

Executive Briefing: AWS technology trends—Data lakes and analytics

14:55–15:35 Thursday, 2/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: AI and Data technologies in the cloud

Nikki Rouda (Amazon Web Services)

Average rating:

(4.14, 7 ratings)

Nikki Rouda shares key trends in data lakes and analytics and explains how they shape the services offered by AWS. Specific topics include the rise of machine-generated data and semistructured and unstructured data as dominant sources of new data, the move toward serverless, SPI-centric computing, and the growing need for local access to data from users around the world. Read more.

Learning with limited labeled data

14:55–15:35 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Shioulin Sam (Cloudera Fast Forward Labs)

Average rating:

(4.45, 11 ratings)

Supervised machine learning requires large labeled datasets—a prohibitive limitation in many real-world applications. What if machines could learn with fewer labeled examples? Shioulin Sam shares an algorithmic solution that relies on collaboration between humans and machines to label smartly and discusses product possibilities. Read more.

Early incident detection using fusion analytics of commuter-centric data sources

14:55–15:35 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: IoT and its applications, Temporal data and time-series, Transportation and Logistics

Christopher Hooi (Land Transport Authority of Singapore)

Average rating:

(5.00, 3 ratings)

Christopher Hooi offers an overview of the Fusion Analytics for Public Transport Event Response (FASTER) system, a real-time advanced analytics solution for early warning of potential train incidents. FASTER uses engineering and commuter-centric IoT data sources to activate contingency plans at the earliest possible time and reduce impact to commuters. Read more.

Data-driven digital transformation and jobs: The new software hierarchy and ML

14:55–15:35 Thursday, 2/05/2019

Session

Culture and organization, Strata Business Summit
Location: Capital Suite 12

Secondary topics: AI and machine learning in the enterprise

Robert Cohen (Economic Strategy Institute)

Average rating:

(2.00, 2 ratings)

Robert Cohen discusses the skills that employers are seeking from employees in digital jobs, linked to the new software hierarchy driving digital transformation. Robert describes this software hierarchy as one that ranges from DevOps, CI/CD, and microservices to Kubernetes and Istio. This hierarchy is used to define the jobs that are central to data-driven digital transformation. Read more.

Performant time series data management and analytics with PostgreSQL

14:55–15:35 Thursday, 2/05/2019

Session

Data Engineering and Architecture, Expo Hall
Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: Streaming and realtime analytics, Temporal data and time-series

Michael Freedman (TimescaleDB | Princeton University)

Average rating:

(4.75, 4 ratings)

Time series databases require ingesting high volumes of structured data, answering complex, performant queries for recent and historical time intervals, and performing specialized time-centric analysis and data management. Michael Freedman explains how to avoid these operational problems by reengineering Postgres to serve as a general data platform, including high-volume time series workloads. Read more.

15:35

15:35–16:35 Thursday, 2/05/2019

Location: Expo Hall

Afternoon break (1h)

16:35

Deep learning with TensorFlow and Spark using GPUs and Docker containers

16:35–17:15 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Platforms

Thomas Phelan (HPE BlueData)

Average rating:

(3.29, 7 ratings)

Organizations need to keep ahead of their competition by using the latest AI, ML, and DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. Thomas Phelan discusses the effective deployment of such applications in a container environment. Read more.

Migrating Apache Oozie workflows to Apache Airflow

16:35–17:15 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: S11 B

Secondary topics: Data Integration and Data Pipelines

Feng Lu (Google Cloud), James Malone (Google), Apurva Desai (Google Cloud), Cameron Moberg (Truman State University | Google Cloud)

Average rating:

(4.00, 3 ratings)

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution. Read more.

From legacy to cloud: An end-to-end data integration journey

16:35–17:15 Thursday, 2/05/2019

Session

Data Engineering and Architecture
Location: Capital Suite 8/9

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Retail and e-commerce

Max Schultze (Zalando SE)

Average rating:

(4.83, 12 ratings)

Max Schultze details Zalondo's end-to-end data integration platform to serve analytical use cases and machine learning throughout the company, covering raw data collection, standardized data preparation (binary conversion, partitioning, etc.), user-driven analytics, and machine learning. Read more.

Executive Briefing: What it takes to use machine learning in fast data pipelines

16:35–17:15 Thursday, 2/05/2019

Session

Executive Briefing and best practices, Strata Business Summit
Location: Capital Suite 13

Secondary topics: Data Integration and Data Pipelines, Streaming and realtime analytics

Dean Wampler (Anyscale)

Average rating:

(5.00, 4 ratings)

Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations. Read more.

Evaluating cybersecurity defenses with a data science approach

16:35–17:15 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 14

Secondary topics: AI and machine learning in the enterprise, Financial Services, Security and Privacy

Brennan Lodge (Goldman Sachs), Jay Kesavan (Bowery Analytics LLC)

Average rating:

(3.00, 3 ratings)

Cybersecurity analysts are under siege to keep pace with the ever-changing threat landscape. The analysts are overworked as they are bombarded with and burned out by the sheer number of alerts that they must carefully investigate. Brennan Lodge and Jay Kesavan explain how to use a data science model for alert evaluations to empower your cybersecurity analysts. Read more.

Improving infrastructure efficiency with unsupervised algorithms

16:35–17:15 Thursday, 2/05/2019

Session

Data Science, Machine Learning & AI
Location: Capital Suite 15/16

Secondary topics: IoT and its applications, Transportation and Logistics

Alexandre Hubert (Dataiku)

Average rating:

(5.00, 1 rating)

GRDF helps bring natural gas to nearly 11 million customers every day. Alexandre Hubert explains how, in partnership with GRDF, Dataiku worked to optimize the manual process of qualifying addresses to visit and ultimately save GRDF time and money. This solution was the culmination of a yearlong adventure in the land of maintenance experts, legacy IT systems, and Agile development. Read more.

Presented by

Global Sponsors

Zettabyte Sponsor

Exabyte Sponsor

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

Schedule List ViewGrid View

Sponsorship Opportunities

Partner Opportunities

Contact Us

Schedule List View Grid View