San FranciscoLondonNew York

Presented By
O’Reilly + Cloudera

Make Data Work

29 April–2 May 2019
London, UK

Schedule List View Grid View

Topics

Expo Hall (Capital Hall N24)

11:15 Recommending and searching at Spotify Mounia Lalmas (Spotify)

12:05 Agile NLP workflows with spaCy and Prodigy Matthew Honnibal (Explosion AI)

14:05 Fair, privacy-preserving, and secure ML Mikio Braun (Zalando)

14:55 TensorFlow for everyone Wolff Dobson (Google, Inc.)

16:35 Opening the black box: Explainable AI (XAI) Maren Eckhoff (QuantumBlack)

Expo Hall 2 (Capital Hall N24)

11:15 Stream, stream, stream: Different streaming methods with Spark and Kafka Itai Yaffe (Nielsen)

12:05 Report card on streaming microservices Ted Dunning (MapR, now part of HPE)

14:05 Nielsen presents: Fun with Kafka, Spark, and offset management Simona Meriam (Nielsen)

14:55 Processing 10M samples a second to drive smart maintenance in complex IIoT systems Geir Engdahl (Cognite), Daniel Bergqvist (Google)

16:35 Deploying your real-time apps on thousands of servers and still being able to breathe Constantin Muraru (Adobe), Dan Popescu (Adobe)

17:25 Mastering streaming and pipelines: Designing and supporting the nervous system of your company Ted Malaska (Capital One)

S11 A

11:15 The Presto Cost-Based Optimizer for interactive SQL on anything Wojciech Biela (Starburst), Piotr Findeisen (Starburst)

12:05 Running SQL-based workloads in the cloud at 20x–200x lower cost using Apache Arrow Jacques Nadeau (Dremio)

14:05 Picking Parquet: Improved performance for selective queries in Impala, Hive, and Spark Anna Szonyi (Cloudera), Zoltán Borók-Nagy (Cloudera)

14:55 Improving Spark downscaling; Or, Not throwing away all of our work Holden Karau (Independent), Mikayla Konst (Google), Ben Sidhom (Google)

16:35 Scalability-aware autoscaling of a Spark application Anirudha Beria (Qubole), Rohit Karlupia (Qubole)

17:25 Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber Felix Cheung (Uber)

S11 B

11:15 Protecting sensitive data in huge datasets: Cloud tools you can use Felipe Hoffa (Google)

12:05 Leveraging metadata for automating delivery and operations of advanced data platforms Peter Billen (Accenture)

14:05 Disrupting data discovery Mark Grover (Lyft)

14:55 Model serving via Pulsar functions Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

16:35 Continuous intelligence: Keeping your AI application in production Arif Wider (ThoughtWorks), Emily Gorcenski (ThoughtWorks)

17:25 Information architecture for an enterprise data cloud Mark Samson (Cloudera), Phillip Radley (BT)

Capital Suite 8/9

11:15 Building a sales AI platform: Key principles and lessons learned Moty Fania (Intel)

12:05 The changing face of ETL: Event-driven architectures for data engineers Robin Moffatt (Confluent)

14:05 Building the data infrastructure for the internet of things at zettabyte scale JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)

14:55 The Lyft data platform: Now and in the future Mark Grover (Lyft), Deepak Tiwari (Lyft)

16:35 How do you evolve your data infrastructure? Neelesh Salian (Stitch Fix)

17:25 Mass production of AI solutions Nate Keating (Google)

Capital Suite 10/11

11:15 Serverless for data and AI Avner Braverman (Binaris)

12:05 Responsible AI innovation Laila Paszti (GTC Law Group PC & Affiliates)

14:05 Using data for evil V: The AI strikes back Duncan Ross (Times Higher Education), Francine Bennett (Mastodon C)

14:55 Integrated Business Intelligence Suite: How Uber built a platform to convert raw data into knowledge Shailesh Chauhan (Uber)

16:35 Empathy: The secret ingredient in the design of engaging data products and analytics tools Brian O'Neill (Designing for Analytics)

17:25 Science-fictional user interfaces Mars Geldard (University of Tasmania), Paris Buttfield-Addison (Secret Lab)

Capital Suite 12

11:15 Implementing enterprise data management in industrial and scientific organizations Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)

12:05 Insights from engineering Europe's largest marketing platform for fashion Dirk Petzoldt (Zalando SE)

14:05 An Innovation Architecture industrializes AI from PoCs to production Teresa Tung (Accenture), Jean-Luc Chatelain (Accenture)

14:55 Signal processing, machine learning, and video tell the truth David Maman (Binah)

16:35 The vindication of big data: How Santander UK uses Hadoop to defend privacy Maurício Lins (everis NTT DATA UK), Lidia Crespo (Santander UK)

17:25 Why is it so hard to do AI for good? Duncan Ross (Times Higher Education), giselle cory (DataKind UK)

Capital Suite 13

11:15 Executive Briefing: From the edge to AI—Taking control of your data for fun and profit Mick Hollison (Cloudera)

12:05 Executive Briefing: Why managing machines is harder than you think Pete Skomoroch (Workday)

14:05 Executive Briefing: Big data in the era of heavy worldwide privacy regulations Mark Donsky (Okera), Nikki Rouda (Amazon Web Services)

14:55 Executive Briefing: Overview of data governance Paco Nathan (derwen.ai)

16:35 Executive Briefing: 5 things every executive should NOT know Ellen Friedman (Independent)

17:25 Executive Briefing: Using a domain knowledge graph to manage AI at scale Teresa Tung (Accenture), Jean-Luc Chatelain (Accenture)

Capital Suite 14

11:15 Spark NLP in action: How Indeed applies NLP to standardize résumé content at scale Alexander Thomas (John Snow Labs), Alexis Yelton (Indeed)

12:05 Dealing with data scarcity in natural language processing Yves Peirsman (NLP Town)

14:05 The evolution of data science skill sets: An analysis using exponential family embeddings Maryam Jahanshahi (TapRecruit)

14:55 Solving data cleaning and unification using human-guided machine learning Ihab Ilyas (University of Waterloo)

16:35 From BI to big data; Or, There and back again Francesco Mucio (Francescomuc.io)

17:25 Federated learning: Machine learning with privacy on the edge Chris Wallace (Cloudera)

Capital Suite 15/16

11:15 Building a secure and transparent ML pipeline using open source technologies Nick Pentreath (IBM)

12:05 Visually communicating statistical and machine learning methods Michael Freeman (University of Washington)

14:05 Using machine learning for stock picking Alun Biffin (Van Lanschot Kempen), David Dogon (Van Lanschot Kempen)

14:55 Explainable machine learning in fintech Eitan Anzenberg (Bill.com)

16:35 A Magic 8 Ball for optimal cost and resource allocation for the big data stack Shivnath Babu (Unravel Data Systems | Duke University), Alkis Simitsis (Micro Focus)

17:25 Reading China: Predicting policy change with machine learning Weifeng Zhong (Mercatus Center at George Mason University)

Capital Suite 17

11:15 Predicting real-time transaction fraud using supervised learning Sami Niemi (Barclays)

12:05 Sequence-to-sequence modeling for time series Arun Kejariwal (Independent), Ira Cohen (Anodot)

14:05 The unreasonable effectiveness of transfer learning on NLP David Low (Pand.ai)

14:55 Synthetic video generation: Why seeing should not always be believing Alexander Adam (Faculty)

16:35 LSTM-based time series anomaly detection using Analytics Zoo for Spark and BigDL Guoqiong Song (Intel)

17:25 Creating a data engineering culture Jesse Anderson (Big Data Institute)

Capital Suite 2/3

11:15 Model governance and model ops in the enterprise Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

12:05 How retailers can leverage data to stay competitive in an ever-changing digital landscape (sponsored by Data Reply) Luca Piccolo (Data Reply), Michele Miraglia (Data Reply)

14:05 Intelligent design patterns for cloud-based analytics and BI (sponsored by Arcadia Data) Shant Hovsepian (Arcadia Data)

14:55 Augment your recommender system with transfer learning on images (sponsored by Dataiku) Larry Orimoloye (Dataiku)

16:35 Engineering ML to improve the shopping experience (sponsored by Zara Tech) Julio Lopez (Inditex)

17:25 Augmented OLAP for big data from on-premises to multicloud (sponsored by Kyligence) Luke Han (Kyligence)

Capital Suite 4

11:15 India's data dilemma with India Stack Sundeep Reddy Mallu (Gramener)

12:05 Is it possible to regulate machine learning? Dream versus R&D (sponsored by AXA) Marcin Detyniecki (AXA)

14:05 How a LiveData strategy breaks down barriers to overcome data gravity (sponsored by WANdisco) Joel Horwitz (WANdisco)

14:55 Build your own data lake with AWS Glue and Amazon Athena (sponsored by Amazon Web Services) Damon Cortesi (Amazon Web Services)

16:35 Data catalogs are changing the nature of working with data (sponsored by Alation) Debora Seys

17:25 Infinite retention using storage offloading with Apache Pulsar Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

Auditorium
9:00 Wednesday keynote welcome Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

9:10 The enterprise data cloud Mick Hollison (Cloudera)

9:25 Making data science useful Cassie Kozyrkov (Google)

9:45 Sustaining machine learning in the enterprise Ben Lorica (O'Reilly)

10:00 Finding your North Star Cait O'Riordan (Financial Times)

10:15 Making the future John Burke

10:45 Morning break | Room: Expo Hall

12:45 Wednesday Topic Tables at Lunch | Room: Expo Hall

15:35 Afternoon break | Room: Expo Hall

18:05 Expo Hall Reception | Room: Expo Hall

19:05 Dinner | Room: On Your Own

20:00 Data After Dark | Room: Madison London: One New Change, St Paul’s, London

8:00 Early morning coffee sponsored by Immuta | Room: Level 0 - Blvd

8:15 Speed Networking | Room: Level 0 - Blvd

11:15-11:55 (40m) Data Science, Machine Learning & AI, Expo Hall Media, Marketing, Advertising, Retail and e-commerce

Recommending and searching at Spotify

Mounia Lalmas (Spotify)

Spotify's mission is "to match fans and artists in a personal and relevant way." Mounia Lalmas shares some of the (research) work the company is doing to achieve this, from using machine learning to metric validation, illustrated through examples within the context of home and search.

12:05-12:45 (40m) Data Science, Machine Learning & AI, Expo Hall AI and machine learning in the enterprise, Text and Language processing and analysis

Agile NLP workflows with spaCy and Prodigy

Matthew Honnibal (Explosion AI)

Matthew Honnibal shares "one weird trick" that can give your NLP project a better chance of success: avoid a waterfall methodology where data definition, corpus construction, modeling, and deployment are performed as separate phases of work.

14:05-14:45 (40m) Data Science, Machine Learning & AI, Expo Hall Security and Privacy

Fair, privacy-preserving, and secure ML

Mikio Braun (Zalando)

Mikio Braun explores techniques and concepts around fairness, privacy, and security when it comes to machine learning models.

14:55-15:35 (40m) Data Science, Machine Learning & AI, Expo Hall Deep Learning

TensorFlow for everyone

Wolff Dobson (Google, Inc.)

Wolff Dobson covers the latest in TensorFlow. Whether you're a beginner or are migrating from 1.x to 2.0, you'll learn the best ways to set up your model, feed your data to it, and distribute it for fast training. You'll also discover how TensorFlow has been recently upgraded to be more intuitive.

16:35-17:15 (40m) Data Science, Machine Learning & AI, Expo Hall Ethics, Security and Privacy

Opening the black box: Explainable AI (XAI)

Maren Eckhoff (QuantumBlack)

The success of machine learning algorithms in a wide range of domains has led to a desire to leverage their power in ever more areas. Maren Eckhoff discusses modern explainability techniques that increase the transparency of black box algorithms, drive adoption, and help manage ethical, legal, and business risks. Many of these methods can be applied to any model without limiting performance.

11:15-11:55 (40m) Data Engineering and Architecture, Expo Hall AI and Data technologies in the cloud, Data Integration and Data Pipelines, Media, Marketing, Advertising, Streaming and realtime analytics

Stream, stream, stream: Different streaming methods with Spark and Kafka

Itai Yaffe (Nielsen)

NMC (Nielsen Marketing Cloud) provides customers (both marketers and publishers) with real-time analytics tools to profile their target audiences. To achieve that, the company needs to ingest billions of events per day into its big data stores in a scalable, cost-efficient way. Itai Yaffe explains how NMC continuously transforms its data infrastructure to support these goals.

12:05-12:45 (40m) Data Engineering and Architecture, Expo Hall, Streaming and IoT Streaming and realtime analytics

Report card on streaming microservices

Ted Dunning (MapR, now part of HPE)

As a community, we have been pushing streaming architectures, particularly microservices, for several years now. But what are the results in the field? Ted Dunning shares several (anonymized) case histories, describing the good, the bad, and the ugly. In particular, Ted covers how several teams who were new to big data fared by skipping MapReduce and jumping straight into streaming.

14:05-14:45 (40m) Data Engineering and Architecture, Expo Hall AI and Data technologies in the cloud, Media, Marketing, Advertising, Streaming and realtime analytics

Nielsen presents: Fun with Kafka, Spark, and offset management

Simona Meriam (Nielsen)

Simona Meriam explains how Nielsen Marketing Cloud (NMC) used to manage its Kafka consumer offsets against Spark-Kafka 0.8 consumer and why the company decided to upgrade from Spark-Kafka 0.8 to 0.10 consumer. Simona reviews the problems encountered during the upgrade and details the process that led to the solution.

14:55-15:35 (40m) Data Engineering and Architecture, Expo Hall, Streaming and IoT AI and Data technologies in the cloud, IoT and its applications, Streaming and realtime analytics

Processing 10M samples a second to drive smart maintenance in complex IIoT systems

Geir Engdahl (Cognite), Daniel Bergqvist (Google)

Geir Engdahl and Daniel Bergqvist explain how Cognite is developing IIoT smart maintenance systems that can process 10M samples a second from thousands of sensors. You'll explore an architecture designed for high performance, robust streaming sensor data ingest, and cost-effective storage of large volumes of time series data as well as best practices learned along the way.

16:35-17:15 (40m) Data Engineering and Architecture, Expo Hall AI and Data technologies in the cloud, Automation in data science and big data

Deploying your real-time apps on thousands of servers and still being able to breathe

Constantin Muraru (Adobe), Dan Popescu (Adobe)

With the current crop of cloud providers, obtaining servers to run your real-time application has never been easier. But what happens, though, when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers, in a fast, reliable way, with minimal human intervention? Constantin Muraru and Dan Popescu tell you how to tackle this challenge.

17:25-18:05 (40m) Data Engineering and Architecture, Expo Hall Data Integration and Data Pipelines, Financial Services, Streaming and realtime analytics

Mastering streaming and pipelines: Designing and supporting the nervous system of your company

Ted Malaska (Capital One)

The world of data is all about building the best path to support time and quality to value. 80% to 90% of the work is getting the data into the hands and tools that can create value. Ted Malaska takes you on a journey to investigate strategies and designs that can change the way your company looks and approaches data.

11:15-11:55 (40m) Data Engineering and Architecture AI and Data technologies in the cloud

The Presto Cost-Based Optimizer for interactive SQL on anything

Wojciech Biela (Starburst), Piotr Findeisen (Starburst)

Presto is a popular open source–distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3, Azure ADSL, RDBMS, NoSQL, etc). Wojciech Biela and Piotr Findeisen offer an overview of the Cost-Based Optimizer (CBO) for Presto, which brings a great performance boost. Join in to learn about CBO internals, the motivating use cases, and observed improvements.

12:05-12:45 (40m) Data Engineering and Architecture AI and Data technologies in the cloud

Running SQL-based workloads in the cloud at 20x–200x lower cost using Apache Arrow

Jacques Nadeau (Dremio)

Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. Jacques Nadeau explains how to accelerate TPC workloads, invisible to client apps, and how to use Apache Arrow, Parquet, and Calcite to provide a scalable, high-performance solution optimized for cloud deployments while significantly reducing operational costs.

14:05-14:45 (40m) Data Engineering and Architecture

Picking Parquet: Improved performance for selective queries in Impala, Hive, and Spark

Anna Szonyi (Cloudera), Zoltán Borók-Nagy (Cloudera)

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. Anna Szonyi and Zoltán Borók-Nagy share the technical details of the design and its implementation along with practical tips to help data architects leverage these new capabilities in their schema design and performance results for common workloads.

14:55-15:35 (40m) Data Engineering and Architecture AI and Data technologies in the cloud

Improving Spark downscaling; Or, Not throwing away all of our work

Holden Karau (Independent), Mikayla Konst (Google), Ben Sidhom (Google)

As more workloads move to severless-like environments, the importance of properly handling downscaling increases. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers—everything from how to schedule jobs to the location of blocks and their impact.

16:35-17:15 (40m) Data Engineering and Architecture AI and Data technologies in the cloud, Data Integration and Data Pipelines

Scalability-aware autoscaling of a Spark application

Anirudha Beria (Qubole), Rohit Karlupia (Qubole)

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs at the same time. Scalability-aware autoscaling uses historical information to make better scaling decisions. Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs.

17:25-18:05 (40m) Data Engineering and Architecture Data Platforms, Transportation and Logistics

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber

Felix Cheung (Uber)

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame.

11:15-11:55 (40m) Data Engineering and Architecture AI and Data technologies in the cloud, Open Data, Data Generation and Data Networks, Security and Privacy

Protecting sensitive data in huge datasets: Cloud tools you can use

Felipe Hoffa (Google)

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa explores how to handle massive public datasets, taking you from theory to real life as he showcases newly available tools that help with PII detection and bring concepts like k-anonymity and l-diversity to the practical realm.

12:05-12:45 (40m) Data Engineering and Architecture Automation in data science and big data, Data preparation, data governance, and data lineage

Leveraging metadata for automating delivery and operations of advanced data platforms

Peter Billen (Accenture)

Peter Billen explains how to use metadata to automate delivery and operations of a data platform. By injecting automation into the delivery processes, you shorten the time to market while improving the quality of the initial user experience. Typical examples include data profiling and prototyping, test automation, continuous delivery and deployment, and automated code creation.

14:05-14:45 (40m) Data Engineering and Architecture

Disrupting data discovery

Mark Grover (Lyft)

Mark Grover discusses how Lyft has reduced the time it takes to discover data by 10 times by building its own data portal, Amundsen. Mark gives a demo of Amundsen, leads a deep dive into its architecture, and discusses how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. Mark closes with a future roadmap, unsolved problems, and collaboration model.

14:55-15:35 (40m) Data Engineering and Architecture AI and Data technologies in the cloud, Model lifecycle management

Model serving via Pulsar functions

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Arun Kejariwal and Karthik Ramasamy walk you through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.

16:35-17:15 (40m) Data Engineering and Architecture Model lifecycle management

Continuous intelligence: Keeping your AI application in production

Arif Wider (ThoughtWorks), Emily Gorcenski (ThoughtWorks)

Machine learning can be challenging to deploy and maintain. Any delays in moving models from research to production mean leaving your data scientists' best work on the table. Arif Wider and Emily Gorcenski explore continuous delivery (CD) for AI/ML along with case studies for applying CD principles to data science workflows.

17:25-18:05 (40m) Data Engineering and Architecture AI and Data technologies in the cloud, Data Platforms

Information architecture for an enterprise data cloud

Mark Samson (Cloudera), Phillip Radley (BT)

It's now possible to build a modern data platform capable of storing, processing, and analyzing a wide variety of data across multiple public and private cloud platforms and on-premises data centers. Mark Samson and Phillip Radley outline an information architecture for such a platform, informed by working with multiple large organizations that have built such platforms over the last five years.

11:15-11:55 (40m) Data Engineering and Architecture AI and machine learning in the enterprise, Data Platforms, Deep Learning, Text and Language processing and analysis

Building a sales AI platform: Key principles and lessons learned

Moty Fania (Intel)

Moty Fania shares his experience implementing a sales AI platform that handles processing of millions of website pages and sifts through millions of tweets per day. The platform is based on unique open source technologies and was designed for real-time data extraction and actuation.

12:05-12:45 (40m) Data Engineering and Architecture Data Integration and Data Pipelines

The changing face of ETL: Event-driven architectures for data engineers

Robin Moffatt (Confluent)

Robin Moffatt discusses the concepts of events, their relevance to software and data engineers, and their ability to unify architectures in a powerful way. Join in to learn why analytics, data integration, and ETL fit naturally into a streaming world. Along the way, Robin will lead a hands-on demonstration of these concepts in practice and commentary on the design choices made.

14:05-14:45 (40m) Data Engineering and Architecture Data Platforms, IoT and its applications, Retail and e-commerce, Temporal data and time-series

Building the data infrastructure for the internet of things at zettabyte scale

JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)

Jian Chang and Sanjian Chen share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, and discuss lessons learned from years of development and continuous improvement.

14:55-15:35 (40m) Data Engineering and Architecture Data Integration and Data Pipelines, Data Platforms, Data preparation, data governance, and data lineage, Model lifecycle management, Security and Privacy, Transportation and Logistics

The Lyft data platform: Now and in the future

Mark Grover (Lyft), Deepak Tiwari (Lyft)

Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.

16:35-17:15 (40m) Data Engineering and Architecture Data Platforms, Data preparation, data governance, and data lineage, Retail and e-commerce

How do you evolve your data infrastructure?

Neelesh Salian (Stitch Fix)

Developing data infrastructure is not trivial; neither is changing it. It takes effort and discipline to make changes that can affect your team. Neelesh Salian discusses how Stitch Fix's data platform team maintains and innovates its infrastructure for the company's data scientists.

17:25-18:05 (40m) Data Engineering and Architecture Data Platforms

Mass production of AI solutions

Nate Keating (Google)

AI will change how we live in the next 30 years, but it's still currently limited to a small group of companies. In order to scale the impact of AI across the globe, we need to reduce the cost of building AI solutions, but how? Nate Keating explains how to apply lessons learned from other industries—specifically, the automobile industry, which went through a similar cycle.

11:15-11:55 (40m) Data Engineering and Architecture AI and Data technologies in the cloud

Serverless for data and AI

Avner Braverman (Binaris)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code.

12:05-12:45 (40m) Law and Ethics, Strata Business Summit AI and machine learning in the enterprise, Ethics

Responsible AI innovation

Laila Paszti (GTC Law Group PC & Affiliates)

As companies commercialize novel applications of AI in areas such as finance, hiring, and public policy, there's concern that these automated decision-making systems may unconsciously duplicate social biases, with unintended societal consequences. Laila Paszti shares practical advice for companies to counteract such prejudices through a legal- and ethics-based approach to innovation.

14:05-14:45 (40m) Law and Ethics, Strata Business Summit AI and machine learning in the enterprise, Ethics

Using data for evil V: The AI strikes back

Duncan Ross (Times Higher Education), Francine Bennett (Mastodon C)

Being good is hard. Being evil is fun and gets you paid more. Once more Duncan Ross and Francine Bennett explore how to do high-impact evil with data and analysis (and possibly AI). Make the maximum (negative) impact on your friends, your business, and the world—or use this talk to avoid ethical dilemmas, develop ways to deal responsibly with data, or even do good. But that would be perverse.

14:55-15:35 (40m) Law and Ethics, Strata Business Summit Data Platforms, Transportation and Logistics, Visualization, Design, and UX

Integrated Business Intelligence Suite: How Uber built a platform to convert raw data into knowledge

Shailesh Chauhan (Uber)

Shailesh Chauhan explains how Uber built its business intelligence platform, detailing why the company took a platform approach rather than adding features in a piecemeal fashion.

16:35-17:15 (40m) Strata Business Summit, Visualization and UX Visualization, Design, and UX

Empathy: The secret ingredient in the design of engaging data products and analytics tools

Brian O'Neill (Designing for Analytics)

Brian O'Neill explains how design is fundamentally improving the bottom line of business and can help data teams uncover the real problems and needs of customers and business stakeholders. Join in to learn and practice a key aspect of good design: how to properly interview stakeholders and users.

17:25-18:05 (40m) Strata Business Summit, Visualization and UX Visualization, Design, and UX

Science-fictional user interfaces

Mars Geldard (University of Tasmania), Paris Buttfield-Addison (Secret Lab)

Science fiction has been showcasing complex, AI-driven interfaces for decades. As TV, movies, and video games have become more capable of visualizing a possible future, the grandeur of these imagined science fictional interfaces has increased. Mars Geldard and Paris Buttfield-Addison investigate what we can learn from Hollywood UX. Is there a useful takeaway? Does sci-fi show the future of AI UX?

11:15-11:55 (40m) Executive Briefing and best practices, Strata Business Summit

Implementing enterprise data management in industrial and scientific organizations

Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)

To succeed in implementing enterprise data management in industrial and scientific organizations and realize business value, the worlds of business data, facilities data, and scientific data—which have long been managed separately—must be brought together. Sun Maria Lehmann and Jane McConnell explore the cultural and organizational differences and the data management requirements to succeed.

12:05-12:45 (40m) Case studies, Strata Business Summit Data Platforms, Retail and e-commerce

Insights from engineering Europe's largest marketing platform for fashion

Dirk Petzoldt (Zalando SE)

Dirk Petzoldt shares a case study from Europe’s leading online fashion platform Zalando illustrating its journey to a scalable, personalized machine learning–based marketing platform.

14:05-14:45 (40m) Culture and organization

An Innovation Architecture industrializes AI from PoCs to production

Teresa Tung (Accenture), Jean-Luc Chatelain (Accenture)

Innovation is abundant as companies reimagine themselves as data-driven and AI-powered businesses. How do enterprises organize to move beyond numerous, often similar proofs of concept (PoCs) into production-quality products and services? Teresa Tung and Jean-Luc Chatelain explore Accenture’s Innovation Architecture, which manages PoCs and pilots through embedding into scalable, saleable solutions.

14:55-15:35 (40m) Case studies, Strata Business Summit

Signal processing, machine learning, and video tell the truth

David Maman (Binah)

David Maman demonstrates how the combination of a mere few minutes of video, signal processing, remote heart-rate monitoring, machine learning, and data science can identify a person’s emotions, health condition and performance. Financial institutions and potential employers can now analyze whether you have good or bad intentions.

16:35-17:15 (40m) Case studies, Strata Business Summit Data preparation, data governance, and data lineage, Financial Services, Security and Privacy

The vindication of big data: How Santander UK uses Hadoop to defend privacy

Maurício Lins (everis NTT DATA UK), Lidia Crespo (Santander UK)

Big data is usually regarded as a menace to data privacy. But with data privacy principles and a customer-first mindset, it can be a game changer. Maurício Lins and Lidia Crespo explain how Santander UK applied this model to comply with GDPR, using graph technology, Hadoop, Spark, and Kudu to drive data obscuring, data portability, and machine learning exploration.

17:25-18:05 (40m) Law and Ethics, Strata Business Summit Ethics

Why is it so hard to do AI for good?

Duncan Ross (Times Higher Education), giselle cory (DataKind UK)

DataKind UK has been working in data for good since 2013, helping over 100 UK charities to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations; others haven't. Duncan Ross and Giselle Cory explain how to identify the right data for good projects and how this can act as a framework for avoiding the same problems across industry.

11:15-11:55 (40m) Executive Briefing and best practices, Strata Business Summit AI and Data technologies in the cloud, AI and machine learning in the enterprise, IoT and its applications

Executive Briefing: From the edge to AI—Taking control of your data for fun and profit

Mick Hollison (Cloudera)

Managing your data securely is difficult, as is choosing the right machine learning tools and managing models and applications in compliance with regulation and law. Mick Hollison covers the risks and the issues that matter most and explains how to address them with an enterprise data cloud and by embracing your data center and the public cloud in combination.

12:05-12:45 (40m) Executive Briefing and best practices, Strata Business Summit AI and machine learning in the enterprise, Open Data, Data Generation and Data Networks

Executive Briefing: Why managing machines is harder than you think

Pete Skomoroch (Workday)

In the next decade, companies that understand how to apply machine intelligence will scale and win their markets. Others will fail to ship successful AI products that matter to customers. Pete Skomoroch details how to combine product design, machine learning, and executive strategy to create a business where every product interaction benefits from your investment in machine intelligence.

14:05-14:45 (40m) Executive Briefing and best practices, Strata Business Summit Data preparation, data governance, and data lineage, Security and Privacy

Executive Briefing: Big data in the era of heavy worldwide privacy regulations

Mark Donsky (Okera), Nikki Rouda (Amazon Web Services)

The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance.

14:55-15:35 (40m) Executive Briefing and best practices, Strata Business Summit AI and machine learning in the enterprise, Data preparation, data governance, and data lineage

Executive Briefing: Overview of data governance

Paco Nathan (derwen.ai)

Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa.

16:35-17:15 (40m) Executive Briefing and best practices, Strata Business Summit

Executive Briefing: 5 things every executive should NOT know

Ellen Friedman (Independent)

A surprising fact of modern technology is that not knowing some things can make you better at what you do. This isn’t just lack of distraction or being too delicate to face reality. It’s about separation of concerns, with a techno flavor. Ellen Friedman outlines five things that best practice with emerging technologies and new architectures can give us ways to not know—and why that’s important.

17:25-18:05 (40m) Executive Briefing and best practices, Strata Business Summit AI and machine learning in the enterprise, Financial Services, Graph technologies and analytics

Executive Briefing: Using a domain knowledge graph to manage AI at scale

Teresa Tung (Accenture), Jean-Luc Chatelain (Accenture)

How do enterprises scale moving beyond one-off AI projects to making it reusable? Teresa Tung and Jean-Luc Chatelain explain how domain knowledge graphs—the technology behind today's internet search—can bring the same democratized experience to enterprise AI. They then explore other applications of knowledge graphs in oil and gas, financial services, and enterprise IT.

11:15-11:55 (40m) Data Science, Machine Learning & AI Deep Learning, Media, Marketing, Advertising, Text and Language processing and analysis

Spark NLP in action: How Indeed applies NLP to standardize résumé content at scale

Alexander Thomas (John Snow Labs), Alexis Yelton (Indeed)

Alexander Thomas and Alexis Yelton demonstrate how to use Spark NLP and Apache Spark to standardize semistructured text, illustrated by Indeed's standardization process for résumé content.

12:05-12:45 (40m) Data Science, Machine Learning & AI Text and Language processing and analysis

Dealing with data scarcity in natural language processing

Yves Peirsman (NLP Town)

In this age of big data, NLP professionals are all too often faced with a lack of data: written language is abundant, but labeled text is much harder to come by. Yves Peirsman outlines the most effective ways of addressing this challenge, from the semiautomatic construction of labeled training data to transfer learning approaches that reduce the need for labeled training examples.

14:05-14:45 (40m) Data Science, Machine Learning & AI Media, Marketing, Advertising, Text and Language processing and analysis

The evolution of data science skill sets: An analysis using exponential family embeddings

Maryam Jahanshahi (TapRecruit)

Maryam Jahanshahi explores exponential family embeddings: methods that extend the idea behind word embeddings to other data types. You'll learn how TapRecruit used dynamic embeddings to understand how data science skill sets have transformed over the last three years, using its large corpus of job descriptions, and more generally, how these models can enrich analysis of specialized datasets.

14:55-15:35 (40m) Data Science, Machine Learning & AI Data preparation, data governance, and data lineage

Solving data cleaning and unification using human-guided machine learning

Ihab Ilyas (University of Waterloo)

Last year, Ihab Ilyas covered two primary challenges in applying machine learning to data curation: entity consolidation and using probabilistic inference to suggest data repair for identified errors and anomalies. This year, he explores these limitations in greater detail and explains why data unification projects quickly require human-guided machine learning and a probabilistic model.

16:35-17:15 (40m) Data Engineering and Architecture

From BI to big data; Or, There and back again

Francesco Mucio (Francescomuc.io)

Francesco Mucio shares the basic tools he and his team had to learn (or relearn) moving from the coziness of their database to the big world of Spark, cloud, distributed systems, and continuous applications. It was an unexpected journey that ended exactly where it started: with an SQL query.

17:25-18:05 (40m) Data Science, Machine Learning & AI Security and Privacy

Federated learning: Machine learning with privacy on the edge

Chris Wallace (Cloudera)

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. Chris Wallace discusses the algorithmic solutions and the product opportunities.

11:15-11:55 (40m) Data Science, Machine Learning & AI Ethics, Security and Privacy

Building a secure and transparent ML pipeline using open source technologies

Nick Pentreath (IBM)

The application of AI algorithms in domains such as criminal justice, credit scoring, and hiring holds unlimited promise. At the same time, it raises legitimate concerns about algorithmic fairness. There's a growing demand for fairness, accountability, and transparency from machine learning (ML) systems. Nick Pentreath explains how to build just such a pipeline leveraging open source tools.

12:05-12:45 (40m) Data Science, Machine Learning & AI, Visualization and UX Visualization, Design, and UX

Visually communicating statistical and machine learning methods

Michael Freeman (University of Washington)

Statistical and machine learning techniques are only useful when they're understood by decision makers. While implementing these techniques is easier than ever, communicating about their assumptions and mechanics is not. Michael Freeman details a design process for crafting visual explanations of analytical techniques and communicating them to stakeholders.

14:05-14:45 (40m) Data Science, Machine Learning & AI Financial Services, Temporal data and time-series

Using machine learning for stock picking

Alun Biffin (Van Lanschot Kempen), David Dogon (Van Lanschot Kempen)

Alun Biffin and David Dogon explain how machine learning revolutionized the stock-picking process for portfolio managers at Kempen Capital Management by filtering the vast small-cap investment universe down to a handful of optimal stocks.

14:55-15:35 (40m) Data Science, Machine Learning & AI Ethics, Financial Services, Health and Medicine

Explainable machine learning in fintech

Eitan Anzenberg (Bill.com)

Machine learning applications balance interpretability and performance. Linear models provide formulas to directly compare the influence of the input variables, while nonlinear algorithms produce more accurate models. Eitan Anzenberg explores a solution that utilizes what-if scenarios to calculate the marginal influence of features per prediction and compare with standardized methods such as LIME.

16:35-17:15 (40m) Data Science, Machine Learning & AI Automation in data science and big data, Temporal data and time-series

A Magic 8 Ball for optimal cost and resource allocation for the big data stack

Shivnath Babu (Unravel Data Systems | Duke University), Alkis Simitsis (Micro Focus)

Cost and resource provisioning are critical components of the big data stack. Shivnath Babu and Alkis Simitsis detail how to build a Magic 8 Ball for the big data stack—a decomposable time series model for optimal cost and resource allocation that offers enterprises a glimpse into their future needs and enables effective and cost-efficient project and operational planning.

17:25-18:05 (40m) Data Science, Machine Learning & AI Text and Language processing and analysis

Reading China: Predicting policy change with machine learning

Weifeng Zhong (Mercatus Center at George Mason University)

Weifeng Zhong shares a machine learning algorithm built to “read” the People’s Daily (the official newspaper of the Communist Party of China) and predict changes in China’s policy priorities. The output of this algorithm, named the Policy Change Index (PCI) of China, turns out to be a leading indicator of the actual policy changes in China since 1951.

11:15-11:55 (40m) Data Science, Machine Learning & AI Deep Learning, Financial Services, Temporal data and time-series

Predicting real-time transaction fraud using supervised learning

Sami Niemi (Barclays)

Predicting transaction fraud of debit and credit card payments in real time is an important challenge, which state-of-art supervised machine learning models can help to solve. Sami Niemi offers an overview of the solutions Barclays has been developing and testing and details how well models perform in variety of situations like card present and card not present debit and credit card transactions.

12:05-12:45 (40m) Data Science, Machine Learning & AI Deep Learning, Temporal data and time-series

Sequence-to-sequence modeling for time series

Arun Kejariwal (Independent), Ira Cohen (Anodot)

Sequence-to-sequence modeling (seq2seq) is now being used for applications based on time series data. Arun Kejariwal and Ira Cohen offer an overview seq2seq and explore its early use cases. They then walk you through leveraging seq2seq modeling for these use cases, particularly with regard to real-time anomaly detection and forecasting.

14:05-14:45 (40m) Data Science, Machine Learning & AI Deep Learning, Text and Language processing and analysis

The unreasonable effectiveness of transfer learning on NLP

David Low (Pand.ai)

Transfer learning has been proven to be a tremendous success in computer vision—a result of the ImageNet competition. In the past few months, there have been several breakthroughs in natural language processing with transfer learning, namely ELMo, OpenAI Transformer, and ULMFit. David Low demonstrates how to use transfer learning on an NLP application with SOTA accuracy.

14:55-15:35 (40m) Data Science, Machine Learning & AI Deep Learning, Media, Marketing, Advertising, Security and Privacy

Synthetic video generation: Why seeing should not always be believing

Alexander Adam (Faculty)

The advent of "fake news" has led us to doubt the truth of online media, and advances in machine learning give us an even greater reason to question what we are seeing. Despite the many beneficial applications of this technology, it's also potentially very dangerous. Alex Adam explains how synthetic videos are created and how they can be detected.

16:35-17:15 (40m) Data Science, Machine Learning & AI Deep Learning, Temporal data and time-series

LSTM-based time series anomaly detection using Analytics Zoo for Spark and BigDL

Guoqiong Song (Intel)

Collecting and processing massive time series data (e.g., logs, sensor readings, etc.) and detecting the anomalies in real time is critical for many emerging smart systems, such as industrial, manufacturing, AIOps, and the IoT. Guoqiong Song explains how to detect anomalies in time series data using Analytics Zoo and BigDL at scale on a standard Spark cluster.

17:25-18:05 (40m) Culture and organization, Strata Business Summit

Creating a data engineering culture

Jesse Anderson (Big Data Institute)

In this talk, we will cover the most common reasons why data engineering teams fail and how to correct them. This will include ways to get your management to understand that data engineering is really complex and time consuming. It is not data warehousing with new names. Management needs to understand that you can’t compare a data engineering team to the web development team, for example.

11:15-11:55 (40m) Data Engineering and Architecture Model lifecycle management

Model governance and model ops in the enterprise

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them.

12:05-12:45 (40m) Sponsored

How retailers can leverage data to stay competitive in an ever-changing digital landscape (sponsored by Data Reply)

Luca Piccolo (Data Reply), Michele Miraglia (Data Reply)

Retailers are facing a daunting challenge: remaining competitive in an ever-changing landscape that is becoming increasingly digital—which requires them to overcome rifts in internal systems and seamlessly leverage their data to generate business value. Luca Piccolo and Michele Miraglia outline Data Reply's approach, distilled while supporting retailers in successfully tackling these challenges.

14:05-14:45 (40m) Sponsored

Intelligent design patterns for cloud-based analytics and BI (sponsored by Arcadia Data)

Shant Hovsepian (Arcadia Data)

With cloud object storage, you may expect business intelligence (BI) applications to benefit from the scale of data and real-time analytics, but traditional BI in the cloud faces not-so-obvious challenges. Shant Hovsepian discusses considerations for service-oriented cloud design and shows how native cloud BI provides analytic depth, low cost, and high performance.

14:55-15:35 (40m) Sponsored

Augment your recommender system with transfer learning on images (sponsored by Dataiku)

Larry Orimoloye (Dataiku)

Recommender systems are tools that provide suggestions that best suit the customers' needs, even when they're not aware of it. Larry Orimoloye explains how Dataiku helped one of the world's leading vacation retailers drive customers toward better recommendations.

16:35-17:15 (40m) Sponsored

Engineering ML to improve the shopping experience (sponsored by Zara Tech)

Julio Lopez (Inditex)

Julio López explains how Zara Tech uses indirect observation and ML engineering to augment the understanding of Zara's processes in order to improve them. Join in to learn how Zara Tech built and uses a Spark ML pipeline to provide KPIs to improve the shopping experience.

17:25-18:05 (40m) Sponsored

Augmented OLAP for big data from on-premises to multicloud (sponsored by Kyligence)

Luke Han (Kyligence)

Augmenting data management and analytics platforms with artificial intelligence and machine learning is game changing for analysts, engineers, and other users. It enables companies to optimize their storage, speed, and spending. Luke Han explains how the Kyligence platform is evolving to the next level, with augmented capabilities such as intelligent modeling, smart pushdowns, and more.

11:15-11:55 (40m) Law and Ethics Ethics, Security and Privacy

India's data dilemma with India Stack

Sundeep Reddy Mallu (Gramener)

Answering the simple question of what rights Indian citizens have over their data is a nightmare. The rollout of India Stack technology-based solutions has added fuel to fire. Sundeep Reddy Mallu explains, with on-the-ground examples, how businesses and citizens in India's booming digital economy are navigating the India Stack ecosystem while dealing with data privacy, security, and ethics.

12:05-12:45 (40m) Sponsored

Is it possible to regulate machine learning? Dream versus R&D (sponsored by AXA)

Marcin Detyniecki (AXA)

Marcin Detyniecki offers an overview of the machine learning backend and its possible applications for the insurance business and other businesses based on the power of research merged with business.

14:05-14:45 (40m) Sponsored

How a LiveData strategy breaks down barriers to overcome data gravity (sponsored by WANdisco)

Joel Horwitz (WANdisco)

Joel Horwitz shares best practices WANdisco clients have taken to evolve their data architecture to become a LiveData company.

14:55-15:35 (40m) Sponsored

Build your own data lake with AWS Glue and Amazon Athena (sponsored by Amazon Web Services)

Damon Cortesi (Amazon Web Services)

Damon Cortesi demonstrates how to use AWS Glue and Amazon Athena to implement an end-to-end pipeline.

16:35-17:15 (40m) Sponsored

Data catalogs are changing the nature of working with data (sponsored by Alation)

Debora Seys (.)

Deb Seys shares the results of a study that she oversaw at eBay in collaboration with the Kellogg School of Management at Northwestern University. Examining the work of 2,000 analysts and almost 80,000 queries, the study revealed that a data catalog can be used as a learning platform that increases analyst productivity and creates a more collaborative approach to discovery and innovation.

17:25-18:05 (40m) Data Engineering and Architecture Streaming and realtime analytics

Infinite retention using storage offloading with Apache Pulsar

Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

This talk discusses how Apache Pulsar provides infinite retention of events in topics. We will discuss how the segment oriented architecture allows unlimited topic growth, how you can keep costs down by using tiered storage and how you can run ad-hoc queries on the topic using SQL.

9:00-9:10 (10m)

Wednesday keynote welcome

Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

9:10-9:25 (15m)

The enterprise data cloud

Mick Hollison (Cloudera)

The last decade has seen incredible changes in our technology. The advent of big data and powerful new analytic techniques, including machine learning and AI, means that we understand the world in ways that were simply impossible before. The simultaneous explosion of public cloud services has fundamentally changed our expectations of technology: it should be fast, simple, and flexible to use.

9:25-9:45 (20m) Data Science, Machine Learning & AI

Making data science useful

Cassie Kozyrkov (Google)

Despite the rise of data engineering and data science functions in today's corporations, leaders report difficulty in extracting value from data. Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness, and hiring experts doesn’t seem to help. Join Cassie Kozyrkov to talk about how you can change that.

9:45-10:00 (15m)

Sustaining machine learning in the enterprise

Ben Lorica (O'Reilly)

Keynote with Ben Lorica

10:00-10:15 (15m)

Finding your North Star

Cait O'Riordan (Financial Times)

The Financial Times hit its target of 1 million paying subscribers a year ahead of schedule. Cait O'Riordan discusses the North Star metric the company uses to drive subscriber growth, detailing how it's embedded across the organization and within the engineering and product teams she's responsible for.

10:15-10:40 (25m)

Making the future

John Burke

James Burke asks whether we can use big data and predictive analytics at the social level to take the guesswork out of prediction and make the future what we all want it to be. If so, this would give us the tools to handle what looks like being the greatest change to the way we live since we left the caves.

10:45-11:15 (30m)

Break: Morning break

12:45-14:05 (1h 20m)

Wednesday Topic Tables at Lunch

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.

15:35-16:35 (1h)

Break: Afternoon break

18:05-19:05 (1h)

Expo Hall Reception

Unwind after a long day of sessions with small bites and drinks while networking with Strata attendees, exhibitors, and sponsors.

19:05-20:00 (55m)

Break: Dinner

20:00-22:00 (2h)

Data After Dark

Join us for Data After Dark, the official attendee party for Strata in London, which promises to be an unforgettable evening. Take in breathtaking views of London as you enjoy delicious food, drinks, and fun at the Madison London, near St. Paul's Cathedral. Don't miss it.

8:00-9:00 (1h)

Break: Early morning coffee sponsored by Immuta

8:15-8:45 (30m)

Speed Networking

Gather before keynotes on Wednesday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees.

Presented by

Global Sponsors

Zettabyte Sponsor

Exabyte Sponsor

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2019, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

Schedule List ViewGrid View

Topics

Sponsorship Opportunities

Partner Opportunities

Contact Us

Schedule List View Grid View