Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Mingxi Wu (TigerGraph)
Graph query language is the key to unleash the value from connected data. Mingxi Wu outlines the eight prerequisites of a practical graph query language, drawn from six years' experience dealing with real-world graph analytical use cases. Along the way, Mingxi compares GSQL, Gremlin, Cypher, and SPARQL, pointing out their respective pros and cons.
Ganes Kesari (Gramener)
Global environmental challenges have pushed our planet to the brink of disaster. Rapid advances in deep learning are placing immense power in the hands of consumers and enterprises. Ganes Kesari explains how this power can be marshaled to support environmental groups and researchers who need immediate assistance to address the rapid depletion of our rich biodiversity.
Mark Madsen (Teradata), Todd Walter (Archimedata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.
Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)
In upstream oil and gas, a vast amount of the data requested for analytics projects is scientific data: physical measurements about the real world. Historically, this data has been managed library style, but a new system was needed to best provide this data. Sun Maria Lehmann and Jane McConnell share architectural best practices learned from their work with subsurface data.
Jian Zhang (Intel), Chendi Xue (Intel), Yuan Zhou (Intel)
Jian Zhang, Chendi Xue, and Yuan Zhou explore the challenges of migrating big data analytics workloads to the public cloud (e.g., performance lost and missing features) and demonstrate how to use a new in-memory data accelerator leveraging persistent memory and RDMA NICs to resolve this issues and enable new opportunities for big data workloads on the cloud.
Damon Cortesi (Amazon Web Services)
Damon Cortesi demonstrates how to use AWS Glue and Amazon Athena to implement an end-to-end pipeline.
Moty Fania (Intel)
Moty Fania shares his experience implementing a sales AI platform that handles processing of millions of website pages and sifts through millions of tweets per day. The platform is based on unique open source technologies and was designed for real-time data extraction and actuation.
Jorge Lopez (Amazon Web Services), Nikki Rouda (Amazon Web Services), Damon Cortesi (Amazon Web Services), Sven Hansen (Amazon Web Services), Manos Samatas (Amazon Web Services), Alket Memushaj (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more.
Shingai Manjengwa (Fireside Analytics Inc.)
Shingai Manjengwa shares insights from teaching data science to 300,000 online learners, second-career college graduates, and grade 12/6th form high school students, explaining how business leaders can increase data science skill sets across different levels and functions in an organization to create real and measurable value from data.
Arif Wider (ThoughtWorks), Emily Gorcenski (ThoughtWorks)
Machine learning can be challenging to deploy and maintain. Any delays in moving models from research to production mean leaving your data scientists' best work on the table. Arif Wider and Emily Gorcenski explore continuous delivery (CD) for AI/ML along with case studies for applying CD principles to data science workflows.
Danilo Sato (ThoughtWorks), Christoph Windheuser (ThoughtWorks)
Danilo Sato and Christoph Windheuser walk you through applying continuous delivery (CD), pioneered by ThoughtWorks, to data science and machine learning. Join in to learn how to make changes to your models while safely integrating and deploying them into production, using testing and automation techniques to release reliably at any time and with a high frequency.
Deb Seys shares the results of a study that she oversaw at eBay in collaboration with the Kellogg School of Management at Northwestern University. Examining the work of 2,000 analysts and almost 80,000 queries, the study revealed that a data catalog can be used as a learning platform that increases analyst productivity and creates a more collaborative approach to discovery and innovation.
Vaclav Surovec (Deutsche Telekom), Gabor Kotalik (Deutsche Telekom)
Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management.
Charlotte Werger (Van Lanschot Kempen)
Charlotte Werger outlines the components necessary to transform a traditional wealth manager into a data-driven business, paying special attention to devising and executing a transformation strategy by identifying key business subunits where automation and improved predictive modeling can result in significant gains and synergies.
Robert Cohen (Economic Strategy Institute)
Robert Cohen discusses the skills that employers are seeking from employees in digital jobs, linked to the new software hierarchy driving digital transformation. Robert describes this software hierarchy as one that ranges from DevOps, CI/CD, and microservices to Kubernetes and Istio. This hierarchy is used to define the jobs that are central to data-driven digital transformation.
Yves Peirsman (NLP Town)
In this age of big data, NLP professionals are all too often faced with a lack of data: written language is abundant, but labeled text is much harder to come by. Yves Peirsman outlines the most effective ways of addressing this challenge, from the semiautomatic construction of labeled training data to transfer learning approaches that reduce the need for labeled training examples.
Deep learning has enabled massive breakthroughs in offbeat tracks and has enabled better understanding of how an artist paints, how an artist composes music, and so on. Nischal Harohalli Padmanabha and Raghotham Sripadraj discuss their project Deep Learning for Humans and their plans to build a font classifier.
Oliver Gindele (Datatonic)
The success of deep learning has reached the realm of structured data in the past few years, where neural networks have been shown to improve the effectiveness and predictability of recommendation engines. Oliver Gindele offers a brief overview of such deep recommender systems and explains how they can be implemented in TensorFlow.
Scott Stevenson (Faculty)
Modern deep learning systems allow us to build speech synthesis systems with the naturalness of a human speaker. While there are myriad benevolent applications, this also ushers in a new era of fake news. Scott Stevenson explores the danger of such systems and details how deep learning can also be used to build countermeasures to protect against political disinformation.
Thomas Phelan (HPE BlueData)
Organizations need to keep ahead of their competition by using the latest AI, ML, and DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. Thomas Phelan discusses the effective deployment of such applications in a container environment.
Mark Grover (Lyft)
Mark Grover discusses how Lyft has reduced the time it takes to discover data by 10 times by building its own data portal, Amundsen. Mark gives a demo of Amundsen, leads a deep dive into its architecture, and discusses how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. Mark closes with a future roadmap, unsolved problems, and collaboration model.
Brennan Lodge (Goldman Sachs), Jay Kesavan (Bowery Analytics LLC)
Cybersecurity analysts are under siege to keep pace with the ever-changing threat landscape. The analysts are overworked as they are bombarded with and burned out by the sheer number of alerts that they must carefully investigate. Brennan Lodge and Jay Kesavan explain how to use a data science model for alert evaluations to empower your cybersecurity analysts.
Nikki Rouda (Amazon Web Services)
Nikki Rouda shares key trends in data lakes and analytics and explains how they shape the services offered by AWS. Specific topics include the rise of machine-generated data and semistructured and unstructured data as dominant sources of new data, the move toward serverless, SPI-centric computing, and the growing need for local access to data from users around the world.
Mark Donsky (Okera), Nikki Rouda (Amazon Web Services)
The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance.
Paco Nathan (derwen.ai)
Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa.
Jack Norris (MapR Technologies)
Many companies delay addressing core improvements in increasing revenues, reducing costs and risk exposure by tying changes to a to-be-hired data scientist. Drawing on three customer examples, Jack Norris explains how to achieve excellent results faster by starting with domain experience and helping developers and analysts better leverage data with available and understandable analytics.
Dean Wampler (Lightbend)
Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations.
Pete Skomoroch (Workday)
In the next decade, companies that understand how to apply machine intelligence will scale and win their markets. Others will fail to ship successful AI products that matter to customers. Pete Skomoroch details how to combine product design, machine learning, and executive strategy to create a business where every product interaction benefits from your investment in machine intelligence.
Eitan Anzenberg (Bill.com)
Machine learning applications balance interpretability and performance. Linear models provide formulas to directly compare the influence of the input variables, while nonlinear algorithms produce more accurate models. Eitan Anzenberg explores a solution that utilizes what-if scenarios to calculate the marginal influence of features per prediction and compare with standardized methods such as LIME.
Mikio Braun (Zalando)
Mikio Braun explores techniques and concepts around fairness, privacy, and security when it comes to machine learning models.
Cait O'Riordan (Financial Times)
The Financial Times hit its target of 1 million paying subscribers a year ahead of schedule. Cait O'Riordan discusses the North Star metric the company uses to drive subscriber growth, detailing how it's embedded across the organization and within the engineering and product teams she's responsible for.
Francesco Mucio (francescomuc.io)
Francesco Mucio shares the basic tools he and his team had to learn (or relearn) moving from the coziness of their database to the big world of Spark, cloud, distributed systems, and continuous applications. It was an unexpected journey that ended exactly where it started: with an SQL query.
Max Schultze (Zalando SE)
Max Schultze details Zalondo's end-to-end data integration platform to serve analytical use cases and machine learning throughout the company, covering raw data collection, standardized data preparation (binary conversion, partitioning, etc.), user-driven analytics, and machine learning.
Mark Donsky (Okera), Ifigeneia Derekli (Cloudera), Lars George (Okera), Michael Ernest (Dataiku)
New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Ifigeneia Derekli, Lars George, and Michael Ernest share hands-on best practices for meeting these challenges, with special attention paid to CCPA.
Pradeep Bhadani (Hotels.com), Elliot West (Hotels.com)
Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms.
Neelesh Salian (Stitch Fix)
Developing data infrastructure is not trivial; neither is changing it. It takes effort and discipline to make changes that can affect your team. Neelesh Salian discusses how Stitch Fix's data platform team maintains and innovates its infrastructure for the company's data scientists.
Luca Piccolo (Data Reply), Michele Miraglia (Data Reply)
Retailers are facing a daunting challenge: remaining competitive in an ever-changing landscape that is becoming increasingly digital—which requires them to overcome rifts in internal systems and seamlessly leverage their data to generate business value. Luca Piccolo and Michele Miraglia outline Data Reply's approach, distilled while supporting retailers in successfully tackling these challenges.
SEONMIN KIM (LINE)
Seonmin Kim offers an introduction to activities that mitigate the risk of mobile payments through various data analytical skills, drawn from actual case studies of mobile frauds, along with tree-based machine learning, graph analytics, and statistical approaches.
Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)
To succeed in implementing enterprise data management in industrial and scientific organizations and realize business value, the worlds of business data, facilities data, and scientific data—which have long been managed separately—must be brought together. Sun Maria Lehmann and Jane McConnell explore the cultural and organizational differences and the data management requirements to succeed.
Sundeep Reddy Mallu (Gramener)
Answering the simple question of what rights Indian citizens have over their data is a nightmare. The rollout of India Stack technology-based solutions has added fuel to fire. Sundeep Reddy Mallu explains, with on-the-ground examples, how businesses and citizens in India's booming digital economy are navigating the India Stack ecosystem while dealing with data privacy, security, and ethics.
Fabio Ferraretto (Accenture), Claudia Regina Laselva (Albert Einstein Jewish Hospital)
Fabio Ferraretto and Claudia Regina Laselva explain how Hospital Albert Einstein and Accenture evolved patient flow experience and efficiency with the use of applied AI, statistics, and combinatorial math, allowing the hospital to anticipate E2E visibility within patient flow operations, from admission of emergency and elective demands to assignment and medical releases.
Sophie Watson (Red Hat)
Identifying relevant documents quickly and efficiently enhances both user experience and business revenue every day. Sophie Watson demonstrates how to implement learning-to-rank algorithms and provides you with the information you need to implement your own successful ranking system.
Jason Bell (Independent Speaker)
The Embulk data migration tool offers a convenient way to load data in to a variety of systems with basic configuration. Jason Bell offers an overview of the Embulk tool and outlines some common data migration scenarios that a data engineer could employ using the tool.
Peter Billen (Accenture)
Peter Billen explains how to use metadata to automate delivery and operations of a data platform. By injecting automation into the delivery processes, you shorten the time to market while improving the quality of the initial user experience. Typical examples include data profiling and prototyping, test automation, continuous delivery and deployment, and automated code creation.
Guoqiong Song (Intel)
Collecting and processing massive time series data (e.g., logs, sensor readings, etc.) and detecting the anomalies in real time is critical for many emerging smart systems, such as industrial, manufacturing, AIOps, and the IoT. Guoqiong Song explains how to detect anomalies in time series data using Analytics Zoo and BigDL at scale on a standard Spark cluster.
Cassie Kozyrkov (Google)
Despite the rise of data engineering and data science functions in today's corporations, leaders report difficulty in extracting value from data. Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness, and hiring experts doesn’t seem to help. Join Cassie Kozyrkov to talk about how you can change that.
James Burke asks whether we can use big data and predictive analytics at the social level to take the guesswork out of prediction and make the future what we all want it to be. If so, this would give us the tools to handle what looks like being the greatest change to the way we live since we left the caves.
Nate Keating (Google)
AI will change how we live in the next 30 years, but it's still currently limited to a small group of companies. In order to scale the impact of AI across the globe, we need to reduce the cost of building AI solutions, but how? Nate Keating explains how to apply lessons learned from other industries—specifically, the automobile industry, which went through a similar cycle.
Sonal Goyal (Nube)
Enterprise data on customers, vendors, and products is often siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting, and 360 views. Traditional rule-based MDM systems with legacy architectures struggle to unify this growing data. Sonal Goyal offers an overview of a modern master data application using Spark, Cassandra, ML, and Elastic.
Feng Lu (Google Cloud), James Malone (Google), Apurva Desai (Google Cloud), Cameron Moberg (Truman State University | Google Cloud)
Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution.
Elliot West (Hotels.com), Jaydene Green (Hotels.com)
Elliot West and Jay Green share approaches for applying software engineering best practices to SQL-based data applications to improve maintainability and data quality. Using open source tools, Elliot and Jay show how to build effective test suites for Apache Hive code bases and offer an overview of Mutant Swarm, a tool to identify weaknesses in tests and to measure SQL code coverage.
Yiannis Kanellopoulos (Code4Thought)
Black box algorithmic systems make decisions that have a great impact in our lives. Thus, the need for their accountability and transparency is growing. Code4Thought created an evaluation model reflecting the state of practice in several organizations. Yiannis Kanellopoulos explores this model and shares lessons learned from its application at a financial corporation.
Rosaria Silipo (KNIME)
Rosaria Silipo shares a collection of past data science projects. While the structure is often similar—data collection, data transformation, model training, deployment—each required its own special trick, whether a change in perspective or a particular technique to deal with special case and special business questions.
Sami Niemi (Barclays)
Predicting transaction fraud of debit and credit card payments in real time is an important challenge, which state-of-art supervised machine learning models can help to solve. Sami Niemi offers an overview of the solutions Barclays has been developing and testing and details how well models perform in variety of situations like card present and card not present debit and credit card transactions.
Geir Engdahl (Cognite), Daniel Bergqvist (Google)
Geir Engdahl and Daniel Bergqvist explain how Cognite is developing IIoT smart maintenance systems that can process 10M samples a second from thousands of sensors. You'll explore an architecture designed for high performance, robust streaming sensor data ingest, and cost-effective storage of large volumes of time series data as well as best practices learned along the way.
Weifeng Zhong (Mercatus Center at George Mason University)
Weifeng Zhong shares a machine learning algorithm built to “read” the People’s Daily (the official newspaper of the Communist Party of China) and predict changes in China’s policy priorities. The output of this algorithm, named the Policy Change Index (PCI) of China, turns out to be a leading indicator of the actual policy changes in China since 1951.
Robin Moffatt (Confluent)
Robin Moffatt walks you through the architectural reasoning for Apache Kafka and the benefits of real-time integration. You'll then build a streaming data pipeline using nothing but your bare hands, Kafka Connect, and KSQL.
Christian Hidber (bSquare)
Reinforcement learning (RL) learns complex processes autonomously like walking, beating the world champion in Go, or flying a helicopter. No big datasets with the “right” answers are needed: the algorithms learn by experimenting. Christian Hidber shows how and why RL works and demonstrates how to apply it to an industrial hydraulics application with 7,000 clients in 42 countries.
Manish Maheshwari (Cloudera)
Apache Impala is an MPP SQL query engine for planet-scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. Manish Maheshwari explains how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and anti-patterns for end users or BI applications.
Avner Braverman (Binaris)
What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code.
Alexander Thomas (John Snow Labs), Alexis Yelton (Indeed)
Alexander Thomas and Alexis Yelton demonstrate how to use Spark NLP and Apache Spark to standardize semistructured text, illustrated by Indeed's standardization process for résumé content.
Itai Yaffe (Nielsen)
NMC (Nielsen Marketing Cloud) provides customers (both marketers and publishers) with real-time analytics tools to profile their target audiences. To achieve that, the company needs to ingest billions of events per day into its big data stores in a scalable, cost-efficient way. Itai Yaffe explains how NMC continuously transforms its data infrastructure to support these goals.
Alexander Adam (Faculty)
The advent of "fake news" has led us to doubt the truth of online media, and advances in machine learning give us an even greater reason to question what we are seeing. Despite the many beneficial applications of this technology, it's also potentially very dangerous. Alex Adam explains how synthetic videos are created and how they can be detected.
Robin Moffatt (Confluent)
Robin Moffatt discusses the concepts of events, their relevance to software and data engineers, and their ability to unify architectures in a powerful way. Join in to learn why analytics, data integration, and ETL fit naturally into a streaming world. Along the way, Robin will lead a hands-on demonstration of these concepts in practice and commentary on the design choices made.
Simon Moritz (Ericsson)
The truth is no longer what you see with your eyes; the truth is in the digital sphere, where it only sometimes needs a physical twin. After all, what's the need for a road sign along the street if the information is already in the car? Simon Moritz details how the Fourth Industrial Revolution is transforming companies and business models as we know it.
Mick Hollison (Cloudera)
The last decade has seen incredible changes in our technology. The advent of big data and powerful new analytic techniques, including machine learning and AI, means that we understand the world in ways that were simply impossible before. The simultaneous explosion of public cloud services has fundamentally changed our expectations of technology: it should be fast, simple, and flexible to use.
Mark Grover (Lyft), Deepak Tiwari (Lyft)
Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future.
Wojciech Biela (Starburst), Piotr Findeisen (Starburst)
Presto is a popular open source–distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3, Azure ADSL, RDBMS, NoSQL, etc). Wojciech Biela and Piotr Findeisen offer an overview of the Cost-Based Optimizer (CBO) for Presto, which brings a great performance boost. Join in to learn about CBO internals, the motivating use cases, and observed improvements.
David Low (Pand.ai)
Transfer learning has been proven to be a tremendous success in computer vision—a result of the ImageNet competition. In the past few months, there have been several breakthroughs in natural language processing with transfer learning, namely ELMo, OpenAI Transformer, and ULMFit. David Low demonstrates how to use transfer learning on an NLP application with SOTA accuracy.
Chris Taggart (OpenCorporates)
Chris Taggart explains the benefits of white box data and outlines the structural shifts that are moving the data world toward it.
Marcel Ruiz Forns (Wikimedia Foundation)
Analysts and researchers studying Wikipedia are hungry for long-term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. Marcel Ruiz Forns explains how the Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both.
Martin Leijen (Rabobank / Digital Transformation Office)
Martin Leijen discusses how Rabobank created a data and intelligence lab as an enabler for data and business domains to accelerate in using AI and Advanced Analytics.
Eoin O'Flanagan (NewDay), Darragh McConville (Kainos)
Eoin O'Flanagan and Darragh McConville explain how NewDay built a high-performance contemporary data processing platform from the ground up on AWS. Join in to explore the company's journey from a traditional legacy onsite data estate to an entirely cloud-based PCI DSS-compliant platform.
Kai Wähner (Confluent)
How do you leverage the flexibility and extreme scale of the public cloud and the Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures that span multiple public clouds—or bridge your on-premises data center to the cloud? Join Kai Wähner to learn how to use technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures.
Volker Schnecke (Novo Nordisk)
Today, more than 650 million people worldwide are obese, and most of them will develop additional health issues during their lifetime. However, not all are at equal risk. Volker Schnecke discusses how Novo Nordisk mines the electronic health records (EHRs) of millions of patients to understand the risk in people with obesity and to support the discovery of new medicines.
Duncan Ross (Times Higher Education), giselle cory (DataKind UK)
DataKind UK has been working in data for good since 2013, helping over 100 UK charities to do data science for the benefit of their users. Some of those projects have delivered above and beyond expectations; others haven't. Duncan Ross and Giselle Cory explain how to identify the right data for good projects and how this can act as a framework for avoiding the same problems across industry.
Peter Aiken (Data BluePrint | DAMA International | Virginia Commonwealth University)
Peter Aiken offers a more operational perspective on the use of data strategy, which is especially useful for organizations just getting started with data