Presented By
O’Reilly + Cloudera
Make Data Work
29 April–2 May 2019
London, UK

Data Engineering & Architecture

29 April–2 May 2019
London, UK

Learn to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Featured Speakers

Monday 29 April - Tuesday 30 April: 2-Day Training (Platinum & Training passes)
Tuesday 30 April: Tutorials (Gold & Silver passes)
Wednesday 1 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:45
Morning break
Thursday 2 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:45
Morning break
Add to your personal schedule
9:00 - 17:00 Monday, 29 April & Tuesday, 30 April
Location: London Suite 2
Secondary topics:  Data Integration and Data Pipelines, Streaming and realtime analytics
Jesse Anderson (Big Data Institute)
Jesse Anderson offers an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it as well as how to create consumers and publishers. Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL. Read more.
Add to your personal schedule
9:00 - 17:00 Monday, 29 April & Tuesday, 30 April
Location: London Suite 3
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines
Jorge Lopez (Amazon Web Services), Nikki Rouda (Amazon Web Services), Damon Cortesi (AWS), Sven Hansen (Amazon Web Services), Manos Samatas (Amazon Web Services), Alket Memushaj (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more Read more.
Add to your personal schedule
9:0012:30 Tuesday, 30 April 2019
Location: Capital Suite 8
Secondary topics:  Financial Services
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 30 April 2019
Location: Capital Suite 10
Secondary topics:  Security and Privacy
Mark Donsky (Okera), Ifigeneia Derekli (Cloudera), Lars George (Okera), Michael Ernest (Okera)
New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Ifigeneia Derekli, Lars George, and Michael Ernest share hands-on best practices for meeting these challenges, with special attention paid to CCPA. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 30 April 2019
Location: Capital Suite 11
Secondary topics:  Streaming and realtime analytics
Robin Moffatt (Confluent)
In this workshop you will learn the architectural reasoning for Apache Kafka and the benefits of real-time integration, and then build a streaming data pipeline using nothing but your bare hands, Kafka Connect, and KSQL. Read more.
Add to your personal schedule
9:0012:30 Tuesday, 30 April 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud, Data Platforms
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 30 April 2019
Location: Capital Suite 15
Secondary topics:  AI and Data technologies in the cloud
Matt Fuller (Starburst)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 30 April 2019
Location: Capital Suite 4
Secondary topics:  AI and Data technologies in the cloud
Colm Moynihan (Cloudera), Jonathan Seidman (Cloudera), Michael Kohs (Cloudera)
Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.
Add to your personal schedule
13:3017:00 Tuesday, 30 April 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Streaming and realtime analytics, Temporal data and time-series
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Many industry segments have been grappling with fast data (high-volume, high-velocity data). In this tutorial we shall lead the audience through a journey of the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline - messaging, compute and storage - for real-time data and algorithms to extract insights - e.g., heavy-hitters, quantiles - from data streams. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Location: Capital Suite 10/11
Secondary topics:  AI and Data technologies in the cloud
Avner Braverman (Binaris)
What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud
Wojciech Biela (Starburst), Piotr Findeisen (Starburst)
Presto is a popular open source distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3/Azure ADSL, RDBMS, no-SQL, etc). Recently Starburst has contributed the Cost-Based Optimizer for Presto which brings a great performance boost for Presto. Learn about this CBO’s internals, the motivating use cases and observed improvements. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Location: S11 B
Secondary topics:  AI and Data technologies in the cloud, Open Data, Data Generation and Data Networks, Security and Privacy
Felipe Hoffa (Google)
Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa explores how to handle massive public datasets, taking you from theory to real life as he showcases newly available tools that help with PII detection and bring concepts like k-anonymity and l-diversity to the practical realm. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Location: Capital Suite 8/9
Secondary topics:  AI and machine learning in the enterprise, Data Platforms, Deep Learning, Text and Language processing and analysis
Moty Fania (Intel)
In this session, Moty Fania will share his experience of implementing a Sales AI platform. It handles processing of millions of website pages and sifting thru millions of tweets per day. The platform is based on unique open source technologies and was designed for real-time, data extraction and actuation. This session highlights the key learnings with a thorough review of the architecture. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Media, Marketing, Advertising, Streaming and realtime analytics
Itai Yaffe (Nielsen)
At Nielsen Marketing Cloud, we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores and we need to do it in a scalable yet cost-efficient manner. In this talk, we will discuss how we continuously transform our data infrastructure to support these goals. Read more.
Add to your personal schedule
11:1511:55 Wednesday, 1 May 2019
Location: Capital Suite 2/3
Secondary topics:  Model lifecycle management
Harish Doddi (Datatron Technologies), Jerry Xu (Datatron Technologies)
Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 1 May 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud
Jacques Nadeau (Dremio)
Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. Jacques Nadeau explains how to accelerate TPC workloads, invisible to client apps, and how to use Apache Arrow, Parquet, and Calcite to provide a scalable, high-performance solution optimized for cloud deployments while significantly reducing operational costs. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 1 May 2019
Location: S11 B
Secondary topics:  Automation in data science and big data, Data preparation, data governance, and data lineage
Peter Billen (Accenture)
In this session we will explain how to use metadata to automate delivery and operations of a data platform. By injecting automation into the delivery processes we shorten the time-to-market while improving the quality of the initial user experience. Typical examples include: Data profiling and prototyping, Test automation, Continuous delivery and deployment, Automated code creation Read more.
Add to your personal schedule
12:0512:45 Wednesday, 1 May 2019
Location: Capital Suite 8/9
Secondary topics:  Data Integration and Data Pipelines
Robin Moffatt (Confluent)
This talk discusses the concepts of events, their relevance to software and data engineers and their ability to unify architectures in a powerful way. It describes why analytics, data integration and ETL fit naturally into a streaming world. There'll be a hands-on demonstration of these concepts in practice and commentary on the design choices made. Read more.
Add to your personal schedule
12:0512:45 Wednesday, 1 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  Streaming and realtime analytics
Ted Dunning (MapR)
As a community, we have been pushing streaming architectures, particularly microservices, for several years now. But what are the results in the field? Ted Dunning shares several (anonymized) case histories, describing the good, the bad, and the ugly. In particular, Ted covers how several teams who were new to big data fared by skipping MapReduce and jumping straight into streaming. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 1 May 2019
Location: S11 A
Anna Szonyi (Cloudera)
The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. Anna Szonyi shares the technical details of the design and its implementation along with practical tips to help data architects leverage these new capabilities in their schema design and performance results for common workloads. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 1 May 2019
Location: S11 B
Secondary topics:  Data Platforms, Data preparation, data governance, and data lineage
Ananth Durai (Slack)
Logs are everywhere—every organization collects tons of data every day. The logs are only as good as the trust they earn to make business-critical decisions. Building trust and reliability of logs are critical to creating a data-driven organization. Ananth Durai walks you through his experience building reliable logging infrastructure at Slack and explains how it helped build confidence in data. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 1 May 2019
Location: Capital Suite 8/9
Secondary topics:  Data Platforms, IoT and its applications, Retail and e-commerce, Temporal data and time-series
JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)
We would like to share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, from years of development and continuous improvement. Read more.
Add to your personal schedule
14:0514:45 Wednesday, 1 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  AI and Data technologies in the cloud, Media, Marketing, Advertising, Streaming and realtime analytics
Simona Meriam (Nielsen)
Simona Meriam explains how NMC (Nielsen Marketing Cloud) used to manage its Kafka consumer offsets against Spark-Kafka 0.8 consumer and why the company decided to upgrade from Spark-Kafka 0.8 to 0.10 consumer. Simona reviews the problems encountered during the upgrade and details the process that led to the solution. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 1 May 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud
Holden Karau (Google), Mikayla Konst (Google), Ben Sidhom (Google)
As more workloads move to severless-like environments, the importance of properly handling downscaling increases. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers—everything from how to schedule jobs to location of blocks and their impact. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 1 May 2019
Location: S11 B
Secondary topics:  AI and Data technologies in the cloud, Model lifecycle management
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
In this talk, we shall walk the audience through an architecture whereby models are served in real-time and the models are updated, using Apache Pulsar, without restarting the application at hand. Further, we will describe how Pulsar functions can be applied to support two example use cases, viz., sampling and filtering. We shall lead the audience through a concrete case study of the same. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 1 May 2019
Location: Capital Suite 8/9
Secondary topics:  Data Integration and Data Pipelines, Data Platforms, Data preparation, data governance, and data lineage, Model lifecycle management, Security and Privacy, Transportation and Logistics
Mark Grover (Lyft), Deepak Tiwari (Lyft)
Lyft’s data platform is at the heart of Lyft’s business. Decisions all the way from pricing, to ETA, to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. In this talk, Mark Grover walks through various choices Lyft has made in the development and sustenance of the data platform and why along with what lies ahead in future. Read more.
Add to your personal schedule
14:5515:35 Wednesday, 1 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  AI and Data technologies in the cloud, IoT and its applications, Streaming and realtime analytics
Geir Engdahl (Cognite), Daniel Bergqvist (Google)
Geir Engdahl and Daniel Bergqvist explain how Cognite is developing IIoT smart maintenance systems that can process 10M samples a second from thousands of sensors. You'll explore an architecture designed for high performance, robust streaming sensor data ingest, and cost-effective storage of large volumes of time series data as well as best practices learned along the way. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 1 May 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines
Anirudha Beria (Qubole), Rohit Karlupia (Qubole)
Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs at the same time. Scalability-aware autoscaling uses historical information to make better scaling decisions. Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 1 May 2019
Location: S11 B
Secondary topics:  Model lifecycle management
Arif Wider (ThoughtWorks), Emily Gorcenski (ThoughtWorks)
Machine learning can be challenging to deploy and maintain. Any delays moving models from research to production mean leaving your data scientists' best work on the table. Arif Wider andEmily Gorcenski explore continuous delivery (CD) for AI/ML along with case studies for applying CD principles to data science workflows. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 1 May 2019
Location: Capital Suite 8/9
Secondary topics:  Data Platforms, Data preparation, data governance, and data lineage, Retail and e-commerce
Neelesh Salian (Stitch Fix)
Developing data infrastructure is not trivial; neither is changing it. It takes effort and discipline to make changes that can affect your team. Neelesh Salian discusses how Stitch Fix's data platform team maintains and innovates its infrastructure for the company's data scientists. Read more.
Add to your personal schedule
16:3517:15 Wednesday, 1 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  AI and Data technologies in the cloud, Automation in data science and big data
Constantin Muraru (Adobe), Dan Popescu (Adobe)
With the current crop of cloud providers, obtaining servers to run your real-time application has never been easier. But what happens though when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers, in a fast, reliable way, with minimal human intervention? Constantin Muraru and Dan Popescu tell you how to tackle this challenge. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 1 May 2019
Location: S11 A
Secondary topics:  Data Platforms, Transportation and Logistics
Felix Cheung (Uber)
Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 1 May 2019
Location: S11 B
Secondary topics:  AI and Data technologies in the cloud, Data Platforms
Mark Samson (Cloudera), Phillip Radley (BT)
It is now possible to build a modern data platform capable of storing, processing and analysing a wide variety of data across multiple public and private Cloud platforms and on-premise data centres. This session will outline an information architecture for such a platform, informed by working with multiple large organisations who have built such platforms over the last 5 years. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 1 May 2019
Location: Capital Suite 8/9
Secondary topics:  Data Platforms
Hussein Mehanna (Google Cloud)
AI will change how we live in the next 30 years. However, AI is still limited to a small group of companies. Building AI systems is expensive and difficult. But in order to scale the impact of AI across the globe, we need to reduce the cost of building AI solutions? How can we do that? Can we learn from other industries? Yes, we can. The automobile industry went through a similar cycle. Read more.
Add to your personal schedule
17:2518:05 Wednesday, 1 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  Data Integration and Data Pipelines, Financial Services, Streaming and realtime analytics
Ted Malaska (Capital One)
In the world of data it is all about building the best path to support time/quality to value. 80% to 90% of the work is getting the data into the hands and tools that can create value. This talk will take us on a journey of different patterns and solution that can work at the largest of companies. Read more.
Add to your personal schedule
11:1511:55 Thursday, 2 May 2019
Location: S11 A
Manish Maheshwari (Cloudera)
Apache Impala is a MPP SQL query engine for planet-scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. Manish Maheshwari explains how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and anti-patterns for end users or BI applications. Read more.
Add to your personal schedule
11:1511:55 Thursday, 2 May 2019
Location: S11 B
Secondary topics:  AI and Data technologies in the cloud
Jian Zhang (Intel), Chendi Xue (Intel), Yuan Zhou (Intel)
Introduce the challenges of migrating bigdata analytics workloads to public cloud - like performance lost, and missing features. Show case how to the new in memory data accelerator leveraging persistent memory and RDMA NICs can resolve this issues and enables new opportunities for bigdata workloads on the cloud. Read more.
Add to your personal schedule
11:1511:55 Thursday, 2 May 2019
Location: Capital Suite 8/9
Secondary topics:  Data preparation, data governance, and data lineage, Financial Services
Sandeep U (Intuit)
Teams today rely on dictionaries of collective wisdom—a mixed bag wit regard to correctness: some datasets have accurate attribute details, while others are incorrect and outdated. This significantly impacts productivity of analysts and scientists. Sandeep Uttamchandani outlines three patterns to better manage data dictionaries. Read more.
Add to your personal schedule
11:1511:55 Thursday, 2 May 2019
Location: Capital Suite 10/11
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Financial Services, Security and Privacy
Eoin O'Flanagan (NewDay), Darragh McConville (Kainos)
Eoin O'Flanagan and Darragh McConville explain how NewDay built a high-performance contemporary data processing platform, from the ground up, on AWS. Join in to explore the company's journey from a traditional legacy onsite data estate to an entirely cloud-based PCI DSS-compliant platform. Read more.
Add to your personal schedule
11:1511:55 Thursday, 2 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  Data Platforms, Streaming and realtime analytics, Transportation and Logistics
Thomas Weise (Lyft)
Fast data and stream processing are essential for making Lyft rides a good experience for passengers and drivers. Lyft's systems need to track and react to event streams in real time to update locations, compute routes and estimates, balance prices, and more. Thomas Weise offers an overview of the streaming platform that powers these use cases. Read more.
Add to your personal schedule
12:0512:45 Thursday, 2 May 2019
Location: Capital Suite 10/11
Secondary topics:  AI and machine learning in the enterprise
Rebecca Simmonds (Red Hat), Michael McCune (Red Hat)
Artificial intelligence and machine learning are now popularly used terms, but how do you make use of these techniques without throwing away the valuable knowledge of experienced employees? Rebecca Simmonds and Michael McCune delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems. Read more.
Add to your personal schedule
12:0512:45 Thursday, 2 May 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Data Platforms, Streaming and realtime analytics
David Josephsen (Sparkpost)
This is the story of how Sparkpost Reliability Engineering abandoned ELK for a DIY Schema-On-Read logging infrastructure. We share architectural details and tribulations from our _Internal Event Hose_ data ingestion pipeline project, which uses Fluentd, Kinesis, Parquet and AWS Athena to make logging sane. Read more.
Add to your personal schedule
12:0512:45 Thursday, 2 May 2019
Location: S11 B
Secondary topics:  AI and Data technologies in the cloud, Data Platforms
Pradeep Bhadani (Hotels.com), Elliot West (Hotels.com)
Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms. Read more.
Add to your personal schedule
12:0512:45 Thursday, 2 May 2019
Location: Capital Suite 8/9
Secondary topics:  Data Platforms, Security and Privacy, Transportation and Logistics
Vaclav Surovec (Deutsche Telekom), Gabor Kotalik (Deutsche Telekom)
Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management. Read more.
Add to your personal schedule
12:0512:45 Thursday, 2 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  AI and Data technologies in the cloud, Model lifecycle management
Kai Wähner (Confluent)
How can you leverage the flexibility and extreme scale in public cloud combined with Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures, which span multiple public clouds or bridge your on-premise data centre to cloud? Join this talk to learn how to apply technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures Read more.
Add to your personal schedule
14:0514:45 Thursday, 2 May 2019
Location: S11 A
Elliot West (Hotels.com), Jaydene Green (Hotels.com)
Hotels.com describe approaches for applying software engineering best practices to SQL-based data applications in order to improve maintainability and data quality. Using open source tools we show how to build effective test suites for Apache Hive code bases. We also present Mutant Swarm, a mutation testing tool we’ve developed to identify weaknesses in tests and to measure SQL code coverage. Read more.
Add to your personal schedule
14:0514:45 Thursday, 2 May 2019
Location: S11 B
Secondary topics:  Security and Privacy
Marcel Ruiz Forns (Wikimedia Foundation)
Analysts and researchers studying Wikipedia are hungry for long-term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. Marcel Ruiz Forns explains how the Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both. Read more.
Add to your personal schedule
14:0514:45 Thursday, 2 May 2019
Location: Capital Suite 8/9
Secondary topics:  AI and Data technologies in the cloud, AI and machine learning in the enterprise, Data Platforms, Transportation and Logistics
Willem Pienaar (GOJEK), Zhi Ling Chen (GOJEK)
Features are key to driving impact with AI at all scales, allowing organizations to dramatically accelerate innovation and time to market. Willem Pienaar and Zhiling Chen explain how GOJEK, Indonesia's first billion-dollar startup, unlocked insights in AI by building a feature store called Feast, and the lessons they learned along the way. Read more.
Add to your personal schedule
14:0514:45 Thursday, 2 May 2019
Location: Capital Suite 10/11
Secondary topics:  Data Integration and Data Pipelines, Data Platforms, Transportation and Logistics, Visualization, Design, and UX
Ravi Suhag (GOJEK)
GOJEK builds products that help millions of Indonesians commute, shop, eat, and pay daily. The data team is responsible for creating resilient and scalable data infrastructure across all of GOJEK’s 18+ products. Ravi Suhag shares lessons learned while realizing this vision. Read more.
Add to your personal schedule
14:0514:45 Thursday, 2 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  AI and Data technologies in the cloud
Holden Karau (Google), Kris Nova (VMware)
In the Kubernetes world, where declarative resources are a first-class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice, we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova and Holden Karau for a fun adventure. Read more.
Add to your personal schedule
14:5515:35 Thursday, 2 May 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud
Greg Rahn (Cloudera)
Data warehouses have traditionally run in the data center, and in recent years, they've been adapted to be more cloud native. Greg Rahn discusses a number of emerging trends and technologies that will impact how data warehouses are run both in the cloud and on-premises and explains what that means for architects, administrators, and end users. Read more.
Add to your personal schedule
14:5515:35 Thursday, 2 May 2019
Location: S11 B
Secondary topics:  Automation in data science and big data, Data preparation, data governance, and data lineage
Sonal Goyal (Nube)
Enterprise data on customers, vendors, products etc is siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting and 360 views. Traditional rule based MDM systems with legacy architectures struggle to unify this growing data. This talk covers a modern master data application using Spark, Cassandra, ML and Elastic. Read more.
Add to your personal schedule
14:5515:35 Thursday, 2 May 2019
Location: Capital Suite 8/9
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, IoT and its applications
Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)
In upstream oil and gas, a vast amount of the data requested for analytics projects is scientific data: physical measurements about the real world. Historically, this data has been managed library style, but a new system was needed to best provide this data. Sun Maria Lehmann and Jane McConnell share architectural best practices learned from their work with subsurface data. Read more.
Add to your personal schedule
14:5515:35 Thursday, 2 May 2019
Location: Capital Suite 10/11
Secondary topics:  Data Integration and Data Pipelines
Jason Bell (DeskHoppa)
The Embulk data migration tool offers a convenient way to load data in to a variety of systems with basic configuration. Jason Bell offers an overview of the Embulk tool and outlines some common data migration scenarios that a data engineer could employ using the tool. Read more.
Add to your personal schedule
14:5515:35 Thursday, 2 May 2019
Location: Expo Hall 2 (Capital Hall N24)
Secondary topics:  Streaming and realtime analytics, Temporal data and time-series
Michael Freedman (TimescaleDB)
Time series databases require ingesting high volumes of structured data, answering complex, performant queries for recent and historical time intervals, and performing specialized time-centric analysis and data management. Michael Freedman explains how to avoid these operational problems by reengineering Postgres to serve as a general data platform, including high-volume time series workloads. Read more.
Add to your personal schedule
16:3517:15 Thursday, 2 May 2019
Location: S11 A
Secondary topics:  AI and Data technologies in the cloud, Data Platforms
Thomas Phelan (BlueData)
Organizations need to keep ahead of their competition by using the latest AI, ML, and DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. Thomas Phelan discusses the effective deployment of such applications in a container environment. Read more.
Add to your personal schedule
16:3517:15 Thursday, 2 May 2019
Location: S11 B
Secondary topics:  Data Integration and Data Pipelines
Feng Lu (Google Cloud), James Malone (Google), Apurva Desai (Google Cloud), Cameron Moberg (Truman State University | Google Cloud)
Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution. Read more.
Add to your personal schedule
16:3517:15 Thursday, 2 May 2019
Location: Capital Suite 8/9
Secondary topics:  AI and Data technologies in the cloud, Data Integration and Data Pipelines, Retail and e-commerce
Max Schultze (Zalando SE)
Max Schultze details Zalondo's end-to-end data integration platform to serve analytical use cases and machine learning throughout the company, covering raw data collection, standardized data preparation (binary conversion, partitioning, etc.), user-driven analytics, and machine learning. Read more.