Data Engineering & Architecture: Big data conference & machine learning training

Wednesday 1 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 \| Location: Auditorium Strata Data Conference Keynotes
10:45 Morning break

Thursday 2 May: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 \| Location: Auditorium Strata Data Conference Keynotes
10:45 Morning break

9:00 - 17:00 Monday, 29 April & Tuesday, 30 April

Professional Kafka development

Location: London Suite 2

Secondary topics: Data Integration and Data Pipelines, Streaming and realtime analytics

Jesse Anderson (Big Data Institute)

Average rating:

(5.00, 1 rating)

Jesse Anderson offers an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it as well as how to create consumers and publishers. Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL. Read more.

9:00 - 17:00 Monday, 29 April & Tuesday, 30 April

Building a serverless big data application on AWS

Location: London Suite 3

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Jorge Lopez (Amazon Web Services), Nikki Rouda (Amazon Web Services), Damon Cortesi (Amazon Web Services), Sven Hansen (Amazon Web Services), Manos Samatas (Amazon Web Services), Alket Memushaj (Amazon Web Services)

Average rating:

(3.50, 2 ratings)

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.

9:00–12:30 Tuesday, 30 April 2019

Foundations for successful data projects

Location: Capital Suite 8

Secondary topics: Financial Services

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Average rating:

(3.50, 12 ratings)

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.

9:00–12:30 Tuesday, 30 April 2019

Getting ready for GDPR and CCPA: Securing and governing hybrid, cloud, and on-premises big data deployments

Location: Capital Suite 10

Secondary topics: Security and Privacy

Mark Donsky (Okera), Ifigeneia Derekli (Cloudera), Lars George (Okera), Michael Ernest (Dataiku)

Average rating:

(4.00, 2 ratings)

New regulations such as CCPA and GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Ifigeneia Derekli, Lars George, and Michael Ernest share hands-on best practices for meeting these challenges, with special attention paid to CCPA. Read more.

9:00–12:30 Tuesday, 30 April 2019

Real-time SQL stream processing at scale with Apache Kafka and KSQL

Location: Capital Suite 11

Secondary topics: Streaming and realtime analytics

Robin Moffatt (Confluent)

Average rating:

(5.00, 5 ratings)

Robin Moffatt walks you through the architectural reasoning for Apache Kafka and the benefits of real-time integration. You'll then build a streaming data pipeline using nothing but your bare hands, Kafka Connect, and KSQL. Read more.

9:00–12:30 Tuesday, 30 April 2019

Architecting a data platform for enterprise use

Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Platforms

Mark Madsen (Teradata), Todd Walter (Archimedata)

Average rating:

(3.71, 7 ratings)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

9:00–17:00 Tuesday, 30 April 2019

Data Case Studies

Location: Capital Suite 12

Paco Nathan (derwen.ai), Ganes Kesari (Gramener), Alicia Williams (Google), Semih Kumluk (Turkcell), Simon Moritz (Ericsson), Samuel Cristóbal (Innaxis), Volker Schnecke (Novo Nordisk), Julia Butter (Scout24), Cecilia Marchi (Jakala), Caroline Goulard (Dataveyes), Marc Rind (ADP), Juan Bengochea (Royal Caribbean Cruise Lines), Aaronpal Dhanda (EasyJet )

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

13:30–17:00 Tuesday, 30 April 2019

Learning Presto: SQL on anything

Location: Capital Suite 15

Secondary topics: AI and Data technologies in the cloud

Matt Fuller (Starburst)

Average rating:

(5.00, 2 ratings)

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.

13:30–17:00 Tuesday, 30 April 2019

Running multidisciplinary big data workloads in the cloud

Location: Capital Suite 4

Secondary topics: AI and Data technologies in the cloud

Colm Moynihan (Cloudera), Jonathan Seidman (Cloudera), Michael Kohs (Cloudera)

Average rating:

(4.00, 2 ratings)

Moving to the cloud poses a number of challenges. Join Colm Moynihan, Jonathan Seidman, and Michael Kohs to explore cloud architecture and challenges and learn how to use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.

13:30–17:00 Tuesday, 30 April 2019

Architecture and algorithms for end-to-end streaming data processing

Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Streaming and realtime analytics, Temporal data and time-series

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

Average rating:

(3.00, 10 ratings)

Many industry segments have been grappling with fast data (high-volume, high-velocity data). Arun Kejariwal and Karthik Ramasamy walk you through the state-of-the-art systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage—for real-time data and algorithms to extract insights (e.g., heavy hitters and quantiles) from data streams. Read more.

11:15–11:55 Wednesday, 1 May 2019

Serverless for data and AI

Location: Capital Suite 10/11

Secondary topics: AI and Data technologies in the cloud

Avner Braverman (Binaris)

Average rating:

(2.71, 7 ratings)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code. Read more.

11:15–11:55 Wednesday, 1 May 2019

The Presto Cost-Based Optimizer for interactive SQL on anything

Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Wojciech Biela (Starburst), Piotr Findeisen (Starburst)

Average rating:

(3.12, 8 ratings)

Presto is a popular open source–distributed SQL engine for interactive queries over heterogeneous data sources (Hadoop/HDFS, Amazon S3, Azure ADSL, RDBMS, NoSQL, etc). Wojciech Biela and Piotr Findeisen offer an overview of the Cost-Based Optimizer (CBO) for Presto, which brings a great performance boost. Join in to learn about CBO internals, the motivating use cases, and observed improvements. Read more.

11:15–11:55 Wednesday, 1 May 2019

Protecting sensitive data in huge datasets: Cloud tools you can use

Location: S11 B

Secondary topics: AI and Data technologies in the cloud, Open Data, Data Generation and Data Networks, Security and Privacy

Felipe Hoffa (Google)

Average rating:

(3.50, 4 ratings)

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa explores how to handle massive public datasets, taking you from theory to real life as he showcases newly available tools that help with PII detection and bring concepts like k-anonymity and l-diversity to the practical realm. Read more.

11:15–11:55 Wednesday, 1 May 2019

Building a sales AI platform: Key principles and lessons learned

Location: Capital Suite 8/9

Secondary topics: AI and machine learning in the enterprise, Data Platforms, Deep Learning, Text and Language processing and analysis

Moty Fania (Intel)

Average rating:

(3.83, 6 ratings)

Moty Fania shares his experience implementing a sales AI platform that handles processing of millions of website pages and sifts through millions of tweets per day. The platform is based on unique open source technologies and was designed for real-time data extraction and actuation. Read more.

11:15–11:55 Wednesday, 1 May 2019

Stream, stream, stream: Different streaming methods with Spark and Kafka

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Media, Marketing, Advertising, Streaming and realtime analytics

Itai Yaffe (Nielsen)

Average rating:

(4.45, 11 ratings)

NMC (Nielsen Marketing Cloud) provides customers (both marketers and publishers) with real-time analytics tools to profile their target audiences. To achieve that, the company needs to ingest billions of events per day into its big data stores in a scalable, cost-efficient way. Itai Yaffe explains how NMC continuously transforms its data infrastructure to support these goals. Read more.

11:15–11:55 Wednesday, 1 May 2019

Model governance and model ops in the enterprise

Location: Capital Suite 2/3

Secondary topics: Model lifecycle management

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Average rating:

(5.00, 1 rating)

Harish Doddi and Jerry Xu share the challenges they faced scaling machine learning models and detail the solutions they're building to conquer them. Read more.

12:05–12:45 Wednesday, 1 May 2019

Running SQL-based workloads in the cloud at 20x–200x lower cost using Apache Arrow

Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Jacques Nadeau (Dremio)

Average rating:

(4.75, 4 ratings)

Performance and cost are two important considerations in determining optimized solutions for SQL workloads in the cloud. Jacques Nadeau explains how to accelerate TPC workloads, invisible to client apps, and how to use Apache Arrow, Parquet, and Calcite to provide a scalable, high-performance solution optimized for cloud deployments while significantly reducing operational costs. Read more.

12:05–12:45 Wednesday, 1 May 2019

Leveraging metadata for automating delivery and operations of advanced data platforms

Location: S11 B

Secondary topics: Automation in data science and big data, Data preparation, data governance, and data lineage

Peter Billen (Accenture)

Average rating:

(4.50, 6 ratings)

Peter Billen explains how to use metadata to automate delivery and operations of a data platform. By injecting automation into the delivery processes, you shorten the time to market while improving the quality of the initial user experience. Typical examples include data profiling and prototyping, test automation, continuous delivery and deployment, and automated code creation. Read more.

12:05–12:45 Wednesday, 1 May 2019

The changing face of ETL: Event-driven architectures for data engineers

Location: Capital Suite 8/9

Secondary topics: Data Integration and Data Pipelines

Robin Moffatt (Confluent)

Average rating:

(4.21, 14 ratings)

Robin Moffatt discusses the concepts of events, their relevance to software and data engineers, and their ability to unify architectures in a powerful way. Join in to learn why analytics, data integration, and ETL fit naturally into a streaming world. Along the way, Robin will lead a hands-on demonstration of these concepts in practice and commentary on the design choices made. Read more.

12:05–12:45 Wednesday, 1 May 2019

Report card on streaming microservices

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: Streaming and realtime analytics

Ted Dunning (MapR, now part of HPE)

Average rating:

(4.67, 6 ratings)

As a community, we have been pushing streaming architectures, particularly microservices, for several years now. But what are the results in the field? Ted Dunning shares several (anonymized) case histories, describing the good, the bad, and the ugly. In particular, Ted covers how several teams who were new to big data fared by skipping MapReduce and jumping straight into streaming. Read more.

14:05–14:45 Wednesday, 1 May 2019

Picking Parquet: Improved performance for selective queries in Impala, Hive, and Spark

Location: S11 A

Anna Szonyi (Cloudera), Zoltán Borók-Nagy (Cloudera)

Average rating:

(4.20, 10 ratings)

The Parquet format recently added column indexes, which improve the performance of query engines like Impala, Hive, and Spark on selective queries. Anna Szonyi and Zoltán Borók-Nagy share the technical details of the design and its implementation along with practical tips to help data architects leverage these new capabilities in their schema design and performance results for common workloads. Read more.

14:05–14:45 Wednesday, 1 May 2019

Disrupting data discovery

Location: S11 B

Mark Grover (Lyft)

Average rating:

(4.64, 11 ratings)

Mark Grover discusses how Lyft has reduced the time it takes to discover data by 10 times by building its own data portal, Amundsen. Mark gives a demo of Amundsen, leads a deep dive into its architecture, and discusses how it leverages centralized metadata, PageRank, and a comprehensive data graph to achieve its goal. Mark closes with a future roadmap, unsolved problems, and collaboration model. Read more.

14:05–14:45 Wednesday, 1 May 2019

Building the data infrastructure for the internet of things at zettabyte scale

Location: Capital Suite 8/9

Secondary topics: Data Platforms, IoT and its applications, Retail and e-commerce, Temporal data and time-series

JIAN CHANG (Alibaba Group), Sanjian Chen (Alibaba Group)

Average rating:

(3.33, 3 ratings)

Jian Chang and Sanjian Chen share the architecture design and many detailed technology innovations of Alibaba TSDB, a state-of-the-art database for IoT data management, and discuss lessons learned from years of development and continuous improvement. Read more.

14:05–14:45 Wednesday, 1 May 2019

Nielsen presents: Fun with Kafka, Spark, and offset management

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, Media, Marketing, Advertising, Streaming and realtime analytics

Simona Meriam (Nielsen)

Average rating:

(4.57, 7 ratings)

Simona Meriam explains how Nielsen Marketing Cloud (NMC) used to manage its Kafka consumer offsets against Spark-Kafka 0.8 consumer and why the company decided to upgrade from Spark-Kafka 0.8 to 0.10 consumer. Simona reviews the problems encountered during the upgrade and details the process that led to the solution. Read more.

14:55–15:35 Wednesday, 1 May 2019

Improving Spark downscaling; Or, Not throwing away all of our work

Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Holden Karau (Independent), Mikayla Konst (Google), Ben Sidhom (Google)

Average rating:

(3.75, 4 ratings)

As more workloads move to severless-like environments, the importance of properly handling downscaling increases. Holden Karau, Mikayla Konst, and Ben Sidhom explore approaches for improving the scale-down experience on open source cluster managers—everything from how to schedule jobs to the location of blocks and their impact. Read more.

14:55–15:35 Wednesday, 1 May 2019

Model serving via Pulsar functions

Location: S11 B

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(3.00, 1 rating)

Arun Kejariwal and Karthik Ramasamy walk you through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same. Read more.

14:55–15:35 Wednesday, 1 May 2019

The Lyft data platform: Now and in the future

Location: Capital Suite 8/9

Secondary topics: Data Integration and Data Pipelines, Data Platforms, Data preparation, data governance, and data lineage, Model lifecycle management, Security and Privacy, Transportation and Logistics

Mark Grover (Lyft), Deepak Tiwari (Lyft)

Average rating:

(4.69, 13 ratings)

Lyft’s data platform is at the heart of the company's business. Decisions from pricing to ETA to business operations rely on Lyft’s data platform. Moreover, it powers the enormous scale and speed at which Lyft operates. Mark Grover and Deepak Tiwari walk you through the choices Lyft made in the development and sustenance of the data platform, along with what lies ahead in the future. Read more.

14:55–15:35 Wednesday, 1 May 2019

Processing 10M samples a second to drive smart maintenance in complex IIoT systems

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, IoT and its applications, Streaming and realtime analytics

Geir Engdahl (Cognite), Daniel Bergqvist (Google)

Average rating:

(4.00, 2 ratings)

Geir Engdahl and Daniel Bergqvist explain how Cognite is developing IIoT smart maintenance systems that can process 10M samples a second from thousands of sensors. You'll explore an architecture designed for high performance, robust streaming sensor data ingest, and cost-effective storage of large volumes of time series data as well as best practices learned along the way. Read more.

16:35–17:15 Wednesday, 1 May 2019

Scalability-aware autoscaling of a Spark application

Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines

Anirudha Beria (Qubole), Rohit Karlupia (Qubole)

Average rating:

(3.67, 3 ratings)

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs at the same time. Scalability-aware autoscaling uses historical information to make better scaling decisions. Anirudha Beria and Rohit Karlupia explain how to measure the efficiency of autoscaling policies and discuss more efficient autoscaling policies, in terms of latency and costs. Read more.

16:35–17:15 Wednesday, 1 May 2019

Continuous intelligence: Keeping your AI application in production

Location: S11 B

Secondary topics: Model lifecycle management

Arif Wider (ThoughtWorks), Emily Gorcenski (ThoughtWorks)

Average rating:

(3.90, 10 ratings)

Machine learning can be challenging to deploy and maintain. Any delays in moving models from research to production mean leaving your data scientists' best work on the table. Arif Wider and Emily Gorcenski explore continuous delivery (CD) for AI/ML along with case studies for applying CD principles to data science workflows. Read more.

16:35–17:15 Wednesday, 1 May 2019

How do you evolve your data infrastructure?

Location: Capital Suite 8/9

Secondary topics: Data Platforms, Data preparation, data governance, and data lineage, Retail and e-commerce

Neelesh Salian (Stitch Fix)

Average rating:

(4.25, 4 ratings)

Developing data infrastructure is not trivial; neither is changing it. It takes effort and discipline to make changes that can affect your team. Neelesh Salian discusses how Stitch Fix's data platform team maintains and innovates its infrastructure for the company's data scientists. Read more.

16:35–17:15 Wednesday, 1 May 2019

From BI to big data; Or, There and back again

Location: Capital Suite 14

Francesco Mucio (Francescomuc.io)

Average rating:

(4.43, 7 ratings)

Francesco Mucio shares the basic tools he and his team had to learn (or relearn) moving from the coziness of their database to the big world of Spark, cloud, distributed systems, and continuous applications. It was an unexpected journey that ended exactly where it started: with an SQL query. Read more.

16:35–17:15 Wednesday, 1 May 2019

Deploying your real-time apps on thousands of servers and still being able to breathe

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, Automation in data science and big data

Constantin Muraru (Adobe), Dan Popescu (Adobe)

Average rating:

(5.00, 2 ratings)

With the current crop of cloud providers, obtaining servers to run your real-time application has never been easier. But what happens, though, when you wish to deploy your (web) applications frequently, on hundreds or even thousands of servers, in a fast, reliable way, with minimal human intervention? Constantin Muraru and Dan Popescu tell you how to tackle this challenge. Read more.

17:25–18:05 Wednesday, 1 May 2019

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber

Location: S11 A

Secondary topics: Data Platforms, Transportation and Logistics

Felix Cheung (Uber)

Average rating:

(4.42, 12 ratings)

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame. Read more.

17:25–18:05 Wednesday, 1 May 2019

Information architecture for an enterprise data cloud

Location: S11 B

Secondary topics: AI and Data technologies in the cloud, Data Platforms

Mark Samson (Cloudera), Phillip Radley (BT)

Average rating:

(5.00, 2 ratings)

It's now possible to build a modern data platform capable of storing, processing, and analyzing a wide variety of data across multiple public and private cloud platforms and on-premises data centers. Mark Samson and Phillip Radley outline an information architecture for such a platform, informed by working with multiple large organizations that have built such platforms over the last five years. Read more.

17:25–18:05 Wednesday, 1 May 2019

Mass production of AI solutions

Location: Capital Suite 8/9

Secondary topics: Data Platforms

Nate Keating (Google)

Average rating:

(4.00, 5 ratings)

AI will change how we live in the next 30 years, but it's still currently limited to a small group of companies. In order to scale the impact of AI across the globe, we need to reduce the cost of building AI solutions, but how? Nate Keating explains how to apply lessons learned from other industries—specifically, the automobile industry, which went through a similar cycle. Read more.

17:25–18:05 Wednesday, 1 May 2019

Mastering streaming and pipelines: Designing and supporting the nervous system of your company

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: Data Integration and Data Pipelines, Financial Services, Streaming and realtime analytics

Ted Malaska (Capital One)

Average rating:

(4.12, 8 ratings)

The world of data is all about building the best path to support time and quality to value. 80% to 90% of the work is getting the data into the hands and tools that can create value. Ted Malaska takes you on a journey to investigate strategies and designs that can change the way your company looks and approaches data. Read more.

17:25–18:05 Wednesday, 1 May 2019

Infinite retention using storage offloading with Apache Pulsar

Location: Capital Suite 4

Secondary topics: Streaming and realtime analytics

Karthik Ramasamy (Streamlio), Ivan Kelly (Streamlio)

This talk discusses how Apache Pulsar provides infinite retention of events in topics. We will discuss how the segment oriented architecture allows unlimited topic growth, how you can keep costs down by using tiered storage and how you can run ad-hoc queries on the topic using SQL. Read more.

11:15–11:55 Thursday, 2 May 2019

Scaling Impala: Common mistakes and best practices

Location: S11 A

Manish Maheshwari (Cloudera)

Average rating:

(5.00, 1 rating)

Apache Impala is an MPP SQL query engine for planet-scale queries. When set up and used properly, Impala is able to handle hundreds of nodes and tens of thousands of queries hourly. Manish Maheshwari explains how to avoid pitfalls in Impala configuration (memory limits, admission pools, metadata management, statistics), along with best practices and anti-patterns for end users or BI applications. Read more.

11:15–11:55 Thursday, 2 May 2019

Big data analytics in the public cloud: Challenges and opportunities

Location: S11 B

Secondary topics: AI and Data technologies in the cloud

Jian Zhang (Intel), Chendi Xue (Intel), Yuan Zhou (Intel)

Average rating:

(4.50, 2 ratings)

Jian Zhang, Chendi Xue, and Yuan Zhou explore the challenges of migrating big data analytics workloads to the public cloud (e.g., performance lost and missing features) and demonstrate how to use a new in-memory data accelerator leveraging persistent memory and RDMA NICs to resolve this issues and enable new opportunities for big data workloads on the cloud. Read more.

11:15–11:55 Thursday, 2 May 2019

Half-correct and half-wrong collective data wisdom: 3 patterns to sanity

Location: Capital Suite 8/9

Secondary topics: Data preparation, data governance, and data lineage, Financial Services

Sandeep U (Intuit)

Average rating:

(4.67, 3 ratings)

Teams today rely on dictionaries of collective wisdom—a mixed bag with regard to correctness: some datasets have accurate attribute details, while others are incorrect and outdated. This significantly impacts productivity of analysts and scientists. Sandeep Uttamchandani outlines three patterns to better manage data dictionaries. Read more.

11:15–11:55 Thursday, 2 May 2019

Transforming a financial services data infrastructure for the modern era by building a PCI DSS-compliant data platform from the ground up on AWS

Location: Capital Suite 10/11

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Financial Services, Security and Privacy

Eoin O'Flanagan (NewDay), Darragh McConville (Kainos)

Average rating:

(4.86, 7 ratings)

Eoin O'Flanagan and Darragh McConville explain how NewDay built a high-performance contemporary data processing platform from the ground up on AWS. Join in to explore the company's journey from a traditional legacy onsite data estate to an entirely cloud-based PCI DSS-compliant platform. Read more.

11:15–11:55 Thursday, 2 May 2019

Streaming at Lyft

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: Data Platforms, Streaming and realtime analytics, Transportation and Logistics

Thomas Weise (Lyft)

Average rating:

(4.50, 14 ratings)

Fast data and stream processing are essential for making Lyft rides a good experience for passengers and drivers. Lyft's systems need to track and react to event streams in real time to update locations, compute routes and estimates, balance prices, and more. Thomas Weise offers an overview of the streaming platform that powers these use cases. Read more.

12:05–12:45 Thursday, 2 May 2019

Application intelligence: Bridging the gap between human expertise and machine learning

Location: Capital Suite 10/11

Secondary topics: AI and machine learning in the enterprise

Rebecca Simmonds (Red Hat), Michael McCune (Red Hat)

Average rating:

(3.00, 6 ratings)

Artificial intelligence and machine learning are now popularly used terms, but how do you make use of these techniques without throwing away the valuable knowledge of experienced employees? Rebecca Simmonds and Michael McCune delve into this idea with examples of how distributed machine learning frameworks fit together naturally with business rules management systems. Read more.

12:05–12:45 Thursday, 2 May 2019

Schema on read and the new logging way

Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Data Platforms, Streaming and realtime analytics

David Josephsen (Sparkpost)

Average rating:

(3.50, 2 ratings)

David Josephsen tells the story of how Sparkpost's reliability engineering team abandoned ELK for a DIY schema-on-read logging infrastructure. Join in to learn the architectural details, trials, and tribulations from the company's Internal Event Hose data ingestion pipeline project, which uses Fluentd, Kinesis, Parquet, and AWS Athena to make logging sane. Read more.

12:05–12:45 Thursday, 2 May 2019

Herding elephants: Seamless data access in a multicluster clouds

Location: S11 B

Secondary topics: AI and Data technologies in the cloud, Data Platforms

Pradeep Bhadani (Hotels.com), Elliot West (Hotels.com)

Average rating:

(4.17, 6 ratings)

Travel platform Expedia Group likes to give its data teams flexibility and autonomy to work with different technologies. However, this approach generates challenges that cannot be solved by existing tools. Pradeep Bhadani and Elliot West explain how the company built a unified virtual data lake on top of its many heterogeneous and distributed data platforms. Read more.

12:05–12:45 Thursday, 2 May 2019

Data science at Deutsche Telekom: Predicting global travel patterns and network demand

Location: Capital Suite 8/9

Secondary topics: Data Platforms, Security and Privacy, Transportation and Logistics

Vaclav Surovec (Deutsche Telekom), Gabor Kotalik (Deutsche Telekom)

Average rating:

(4.00, 2 ratings)

Knowledge of customers' location and travel patterns is important for many companies, including German telco service operator Deutsche Telekom. Václav Surovec and Gabor Kotalik explain how a commercial roaming project using Cloudera Hadoop helped the company better analyze the behavior of its customers from 10 countries and provide better predictions and visualizations for management. Read more.

12:05–12:45 Thursday, 2 May 2019

Unleashing Apache Kafka and TensorFlow in hybrid architectures

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud, Model lifecycle management

Kai Wähner (Confluent)

Average rating:

(4.75, 8 ratings)

How do you leverage the flexibility and extreme scale of the public cloud and the Apache Kafka ecosystem to build scalable, mission-critical machine learning infrastructures that span multiple public clouds—or bridge your on-premises data center to the cloud? Join Kai Wähner to learn how to use technologies such as TensorFlow with Kafka’s open source ecosystem for machine learning infrastructures. Read more.

14:05–14:45 Thursday, 2 May 2019

Mutant tests too: The SQL

Location: S11 A

Elliot West (Hotels.com), Jaydene Green (Hotels.com)

Average rating:

(3.00, 3 ratings)

Elliot West and Jay Green share approaches for applying software engineering best practices to SQL-based data applications to improve maintainability and data quality. Using open source tools, Elliot and Jay show how to build effective test suites for Apache Hive code bases and offer an overview of Mutant Swarm, a tool to identify weaknesses in tests and to measure SQL code coverage. Read more.

14:05–14:45 Thursday, 2 May 2019

The vegan data diet: How Wikipedia cuts down privacy issues while keeping data fit

Location: S11 B

Secondary topics: Security and Privacy

Marcel Ruiz Forns (Wikimedia Foundation)

Average rating:

(4.75, 4 ratings)

Analysts and researchers studying Wikipedia are hungry for long-term data to build experiments and feed data-driven decisions. But Wikipedia has a strict privacy policy that prevents storing privacy-sensitive data over 90 days. Marcel Ruiz Forns explains how the Wikimedia Foundation's analytics team is working on a vegan data diet to satisfy both. Read more.

14:05–14:45 Thursday, 2 May 2019

Unlocking insights in AI by building a feature store

Location: Capital Suite 8/9

Secondary topics: AI and Data technologies in the cloud, AI and machine learning in the enterprise, Data Platforms, Transportation and Logistics

Willem Pienaar (GOJEK), Zhi Ling Chen (GOJEK)

Average rating:

(4.80, 5 ratings)

Features are key to driving impact with AI at all scales, allowing organizations to dramatically accelerate innovation and time to market. Willem Pienaar and Zhiling Chen explain how GOJEK, Indonesia's first billion-dollar startup, unlocked insights in AI by building a feature store called Feast, and the lessons they learned along the way. Read more.

14:05–14:45 Thursday, 2 May 2019

Simplicity at scale: How Cloudflare’s analyses some of the world’s largest DDoS attacks

Location: Capital Suite 10/11

Secondary topics: Security and Privacy, Streaming and realtime analytics

Tom Walwyn (Cloudflare)

Average rating:

(4.00, 1 rating)

Cloudflare powers nearly 10 percent of all Internet requests worldwide, absorbing some of the largest DDoS attacks. Learn how we use ClickHouse and SQL to simplify our data pipelines on a global scale while experiencing over 10 million events per second. Read more.

14:05–14:45 Thursday, 2 May 2019

Autoscaling Spark on Kubernetes

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: AI and Data technologies in the cloud

Holden Karau (Independent), Kris Nova (Independent)

Average rating:

(4.86, 7 ratings)

In the Kubernetes world, where declarative resources are a first-class citizen, running complicated workloads across distributed infrastructure is easy, and processing big data workloads using Spark is common practice, we can finally look at constructing a hybrid system of running Spark in a distributed cloud native way. Join respective experts Kris Nova and Holden Karau for a fun adventure. Read more.

14:55–15:35 Thursday, 2 May 2019

The future of cloud native data warehousing: Emerging trends and technologies

Location: S11 A

Secondary topics: AI and Data technologies in the cloud

Greg Rahn (Cloudera)

Average rating:

(3.00, 7 ratings)

Data warehouses have traditionally run in the data center, and in recent years, they've been adapted to be more cloud native. Greg Rahn discusses a number of emerging trends and technologies that will impact how data warehouses are run both in the cloud and on-premises and explains what that means for architects, administrators, and end users. Read more.

14:55–15:35 Thursday, 2 May 2019

Mastering data with Spark and machine learning

Location: S11 B

Secondary topics: Automation in data science and big data, Data preparation, data governance, and data lineage

Sonal Goyal (Nube)

Average rating:

(1.00, 4 ratings)

Enterprise data on customers, vendors, and products is often siloed and represented differently in diverse systems, hurting analytics, compliance, regulatory reporting, and 360 views. Traditional rule-based MDM systems with legacy architectures struggle to unify this growing data. Sonal Goyal offers an overview of a modern master data application using Spark, Cassandra, ML, and Elastic. Read more.

14:55–15:35 Thursday, 2 May 2019

Architecting a data platform to support analytic workflows for scientific data

Location: Capital Suite 8/9

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, IoT and its applications

Jane McConnell (Teradata), Sun Maria Lehmann (Equinor)

Average rating:

(3.67, 3 ratings)

In upstream oil and gas, a vast amount of the data requested for analytics projects is scientific data: physical measurements about the real world. Historically, this data has been managed library style, but a new system was needed to best provide this data. Sun Maria Lehmann and Jane McConnell share architectural best practices learned from their work with subsurface data. Read more.

14:55–15:35 Thursday, 2 May 2019

Learning how to perform ETL data migrations with open source tool Embulk

Location: Capital Suite 10/11

Secondary topics: Data Integration and Data Pipelines

Jason Bell (Independent Speaker)

Average rating:

(5.00, 1 rating)

The Embulk data migration tool offers a convenient way to load data in to a variety of systems with basic configuration. Jason Bell offers an overview of the Embulk tool and outlines some common data migration scenarios that a data engineer could employ using the tool. Read more.

14:55–15:35 Thursday, 2 May 2019

Performant time series data management and analytics with PostgreSQL

Location: Expo Hall 2 (Capital Hall N24)

Secondary topics: Streaming and realtime analytics, Temporal data and time-series

Michael Freedman (TimescaleDB | Princeton University)

Average rating:

(4.75, 4 ratings)

Time series databases require ingesting high volumes of structured data, answering complex, performant queries for recent and historical time intervals, and performing specialized time-centric analysis and data management. Michael Freedman explains how to avoid these operational problems by reengineering Postgres to serve as a general data platform, including high-volume time series workloads. Read more.

16:35–17:15 Thursday, 2 May 2019

Deep learning with TensorFlow and Spark using GPUs and Docker containers

Location: S11 A

Secondary topics: AI and Data technologies in the cloud, Data Platforms

Thomas Phelan (HPE BlueData)

Average rating:

(3.29, 7 ratings)

Organizations need to keep ahead of their competition by using the latest AI, ML, and DL technologies such as Spark, TensorFlow, and H2O. The challenge is in how to deploy these tools and keep them running in a consistent manner while maximizing the use of scarce hardware resources, such as GPUs. Thomas Phelan discusses the effective deployment of such applications in a container environment. Read more.

16:35–17:15 Thursday, 2 May 2019

Migrating Apache Oozie workflows to Apache Airflow

Location: S11 B

Secondary topics: Data Integration and Data Pipelines

Feng Lu (Google Cloud), James Malone (Google), Apurva Desai (Google Cloud), Cameron Moberg (Truman State University | Google Cloud)

Average rating:

(4.00, 3 ratings)

Apache Oozie and Apache Airflow (incubating) are both widely used workflow orchestration systems, the former focusing on Apache Hadoop jobs. Feng Lu, James Malone, Apurva Desai, and Cameron Moberg explore an open source Oozie-to-Airflow migration tool developed at Google as a part of creating an effective cross-cloud and cross-system solution. Read more.

16:35–17:15 Thursday, 2 May 2019

From legacy to cloud: An end-to-end data integration journey

Location: Capital Suite 8/9

Secondary topics: AI and Data technologies in the cloud, Data Integration and Data Pipelines, Retail and e-commerce

Max Schultze (Zalando SE)

Average rating:

(4.83, 12 ratings)

Max Schultze details Zalondo's end-to-end data integration platform to serve analytical use cases and machine learning throughout the company, covering raw data collection, standardized data preparation (binary conversion, partitioning, etc.), user-driven analytics, and machine learning. Read more.

Data Engineering & Architecture

Learn to build an analytics infrastructure that unlocks the value of your data

Featured Speakers

Sponsorship Opportunities

Partner Opportunities

Contact Us