Data Engineering and Architecture: Data science + business analytics training: Strata Data Conference

Wednesday, September 25: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am \| Location: 3E Strata Data Conference Keynotes
10:50 Morning break

Thursday, September 26: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am \| Location: 3E Strata Data Conference Keynotes
10:50 Morning break

9:00am - 5:00pm Monday, September 23 & Tuesday, September 24

Location: 1A 17

SOLD OUT: Building a serverless big data application on AWS

Secondary topics: Cloud Platforms and SaaS, Data Integration and Data Processing, Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks

Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Nikki Rouda (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services)

Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.

9:00am - 5:00pm Monday, September 23 & Tuesday, September 24

Location: 1E 06

Professional Kafka development

Secondary topics: Data Integration and Data Processing, Deep dive into specific tools, platforms, or frameworks

Jesse Anderson (Big Data Institute)

Jesse Anderson offers you an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it, as well as how to create consumers and publishers. You'll take a look Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL. Read more.

9:00am–12:30pm Tuesday, September 24, 2019

Location: 1E 08

Learning Presto: SQL on anything

Secondary topics: BI, Interactive Analytics and Visualization, Data Management and Storage, Deep dive into specific tools, platforms, or frameworks

Matt Fuller (Starburst)

Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.

9:00am–12:30pm Tuesday, September 24, 2019

Location: 1E 09

Serverless streaming architectures and algorithms for the enterprise

Secondary topics: Cloud Platforms and SaaS, Data, Analytics, and AI Architecture, Streaming and IoT, Temporal data and time-series analytics

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Anurag Khandelwal (Yale University)

Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems and examine the inception and growth of the serverless paradigm. You'll take a deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions and get a bird’s-eye view of the application domains where you can leverage Pulsar functions. Read more.

9:00am–12:30pm Tuesday, September 24, 2019

Location: 1E 10

Real-time SQL stream processing at scale with Apache Kafka and KSQL

Secondary topics: Data Integration and Data Processing, Deep dive into specific tools, platforms, or frameworks, Streaming and IoT

Viktor Gamov (Confluent)

Building stream processing applications is certainly one of the hot topics in the IT community. But if you've ever thought you needed to be a programmer to do stream processing and build stream processing data pipelines, think again. Viktor Gamov explores KSQL, the stream processing query engine built on top of Apache Kafka. Read more.

9:00am–12:30pm Tuesday, September 24, 2019

Location: 1E 11

Cloudera Edge Management in the IoT

Secondary topics: Deep dive into specific tools, platforms, or frameworks, Streaming and IoT

Purnima Reddy Kuchikulla (Cloudera), Timothy Spann (Cloudera), Abdelkrim Hadjidj (Cloudera), Andre Araujo (Cloudera), Hemanth Yamijala (Cloudera)

There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, Abdelkrim Hadjidj, and Andre Araujo walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with CEM, which addresses these challenges with ease. Read more.

9:00am–12:30pm Tuesday, September 24, 2019

Location: 1E 14

Running multidisciplinary big data workloads in the cloud with CDP

Secondary topics: Cloud Platforms and SaaS, Data Management and Storage

James Morantus (Cloudera), Tony Huinker (Cloudera), Naren Koneru (Cloudera), Ramachandran Venkatesh (Cloudera), Gunther Hagleitner (Cloudera), Olli Draese (Cloudera)

Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep... Read more.

9:00am–5:00pm Tuesday, September 24, 2019

Location: 1A 06

Data Case Studies

David Boyle (Audience Strategies), Richard Evans (Statistics Canada), Rosaria Silipo (KNIME), Leah Xu (Spotify), Arup Nanda (Capital One), Victoriya Kalmanovich (Navy), Tusharadri Mukherjee (Lenovo), David Boyle (Audience Strategies), Richard Evans (Statistics Canada), Leah Xu (Spotify), Victoriya Kalmanovich (Navy), Moise Convolbo (Rakuten), Martin Mendez-Costabel (Bayer Crop Science), gloria macia (F. Hoffmann-La Roche AG), Gwen Campbell (Revibe Technologies), Moise Convolbo (Rakuten), Muhammed Idris (Capria VC | TeraCrunch)

From banking to biotech, retail to government, every business sector is changing in the face of abundant data. Get better at defining business problems and applying data solutions at Strata. Read more.

1:30pm–5:00pm Tuesday, September 24, 2019

Location: 1E 09

From relational databases to cloud databases: Using the right tool for the right job

Secondary topics: BI, Interactive Analytics and Visualization, Cloud Platforms and SaaS, Data Management and Storage, Data, Analytics, and AI Architecture

Gowrishankar Balasubramanian (Amazon Web Services), Rajeev Srinivasan (Amazon Web Services)

Enterprises adopt cloud platforms such as AWS for agility, elasticity, and cost savings. Database design and management requires a different mindset in AWS when compared to traditional RDBMS design. Gowrishankar Balasubramanian and Rajeev Srinivasan explore considerations in choosing the right database for your use case and access pattern while migrating or building a new application on the cloud. Read more.

1:30pm–5:00pm Tuesday, September 24, 2019

Location: 1E 10

Foundations for successful data projects

Secondary topics: Culture and Organization

Ted Malaska (Capital One), Jonathan Seidman (Cloudera), Matthew Schumpert (Cloudera, Inc.), Raman Rajasekhar (Cloudera Inc), Krishna Maheshwari (Cloudera)

The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.

1:30pm–5:00pm Tuesday, September 24, 2019

Location: 1E 14

Kafka and Streams Messaging Manager (SMM) crash course

Secondary topics: Deep dive into specific tools, platforms, or frameworks, Streaming and IoT

Purnima Reddy Kuchikulla (Cloudera), Dan Chaffelson (Cloudera), Attila Kanto (Cloudera), Tony Wu (Cloudera)

Kafka is omnipresent and the backbone of streaming analytics applications and data lakes. The challenge is understanding what's going on overall in the Kafka cluster, including performance, issues, and message flows. Purnima Reddy Kuchikulla and Dan Chaffelson walk you through a hands-on experience to visualize the entire Kafka environment end-to-end and simplify Kafka operations via SMM. Read more.

1:30pm–5:00pm Tuesday, September 24, 2019

Location: 1E 15/16

Hands-on machine learning with Kafka-based streaming pipelines

Secondary topics: Model Development, Governance, Operations

Boris Lublinsky (Lightbend), Dean Wampler (Anyscale)

Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more. Read more.

1:30pm–5:00pm Tuesday, September 24, 2019

Location: 1E 12/13

Architecting a data platform for enterprise use

Secondary topics: BI, Interactive Analytics and Visualization, Cloud Platforms and SaaS, Data, Analytics, and AI Architecture

Mark Madsen (Teradata), Todd Walter (Archimedata)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

9:05am–9:15am Wednesday, September 25, 2019

Location: 3E

Recent trends in data and machine learning technologies

Ben Lorica (O'Reilly)

Ben Lorica dives into emerging technologies for building data infrastructures and machine learning platforms. Read more.

11:20am–12:00pm Wednesday, September 25, 2019

Location: 1A 15/16

Building a multitenant data processing and model inferencing platform with Kafka Streams

Secondary topics: Data Integration and Data Processing, Data, Analytics, and AI Architecture, Retail and e-commerce, Streaming and IoT

Navinder Pal Singh Brar (Walmart Labs)

Each week 275 million people shop at Walmart, generating interaction and transaction data. Navinder Pal Singh Brar explains how the customer backbone team enables extraction, transformation, and storage of customer data to be served to other teams. At 5 billion events per day, the Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer. Read more.

11:20am–12:00pm Wednesday, September 25, 2019

Location: 1A 21/22

Scaling data engineers

Secondary topics: Culture and Organization, Financial Services, Model Development, Governance, Operations

Evgeny Vinogradov (Yandex.Money)

With a microservice architecture, a data warehouse is the first place where all the data meets. It's supplied by many different data sources and used for many purposes—from near-online transactional processing (OLTP) to model fitting and real-time classifying. Evgeny Vinogradov details his experience in managing and scaling data for support of 20+ product teams. Read more.

11:20am–12:00pm Wednesday, September 25, 2019

Location: 1A 23/24

Building an AI platform: Key principles and lessons learned

Secondary topics: Data, Analytics, and AI Architecture

Moty Fania (Intel)

Moty Fania details Intel’s IT experience of implementing a sales AI platform. This platform is based on streaming, microservices architecture with a message bus backbone. It was designed for real-time data extraction and reasoning and handles the processing of millions of website pages and is capable of sifting through millions of tweets per day. Read more.

11:20am–12:00pm Wednesday, September 25, 2019

Location: 1E 07/08

Kubernetes for stateful MPP systems

Secondary topics: Cloud Platforms and SaaS, Data Management and Storage, Data, Analytics, and AI Architecture

Paige Roberts (Vertica), Deepak Majeti (Vertica)

GoodData needed to autorecover from node failures and scale rapidly when workloads spiked on their MPP database in the cloud. Kubernetes could solve it, but it's for stateless microservices, not a stateful MPP database that needs hundreds of containers. Paige Roberts and Deepak Majeti detail the hurdles GoodData needed to overcome in order to merge the power of the database with Kubernetes. Read more.

11:20am–12:00pm Wednesday, September 25, 2019

Location: 1E 09

Data security and privacy anti-patterns

Secondary topics: Data Management and Storage, Privacy and Security

Steven Touw (Immuta)

Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past four years, data security and privacy anti-patterns have emerged across hundreds of customers and industry verticals—there's been an obvious trend. Steven Touw details five anti-patterns and, more importantly, the solutions for them. Read more.

1:15pm–1:55pm Wednesday, September 25, 2019

Location: 1A 15/16

Now you see me; now you compute: Building event-driven architectures with Apache Kafka

Secondary topics: Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks

Michael Noll (Confluent)

Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer. Read more.

1:15pm–1:55pm Wednesday, September 25, 2019

Location: 1A 21/22

A productive data science platform: Beyond a hosted-notebooks solution at LinkedIn

Secondary topics: Data, Analytics, and AI Architecture, Media and Advertising

Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)

Join Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore what's offered by a flexible and scalable hosted data science platform at LinkedIn. It provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management, and collaboration to improve developer productivity. Read more.

1:15pm–1:55pm Wednesday, September 25, 2019

Location: 1A 23/24

Sharing is caring: Using Egeria to establish true enterprise metadata governance

Secondary topics: Data quality, data governance and data lineage, Deep dive into specific tools, platforms, or frameworks

Wim Stoop (Cloudera), Srikanth Venkat (Cloudera)

Establishing enterprise-wide security and governance remains a challenge for most organizations. Integrations and exchanges across the landscape are costly to manage and maintain, and typically work in one direction only. Wim Stoop and Srikanth Venkat explore how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value. Read more.

1:15pm–1:55pm Wednesday, September 25, 2019

Location: 1E 07/08

Your easy move to serverless computing and radically simplified data processing

Secondary topics: Cloud Platforms and SaaS, Data Integration and Data Processing

Gil Vernik (IBM)

Most analytic flows can benefit from serverless, starting with simple cases to and moving to complex data preparations for AI frameworks like TensorFlow. To address the challenge of how to easily integrate serverless without major disruptions to your system, Gil Vernik explores the “push to the cloud” experience, which dramatically simplifies serverless for big data processing frameworks. Read more.

1:15pm–1:55pm Wednesday, September 25, 2019

Location: 1E 09

Parquet modular encryption: Confidentiality and integrity of sensitive column data

Secondary topics: Deep dive into specific tools, platforms, or frameworks, Health and Medicine, Privacy and Security

Gidon Gershinsky (IBM)

The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases. Read more.

2:05pm–2:45pm Wednesday, September 25, 2019

Location: 1A 21/22

From raw data to informed intelligence: Democratizing data science and ML at Uber

Secondary topics: Data, Analytics, and AI Architecture, Transportation and Logistics

Atul Gupte (Uber)

Uber is changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, it uses ML and advanced data science to power every aspect of the Uber experience—from dispatch to customer support. Atul Gupte and Nikhil Joshi explore how Uber enables teams to transform insights into intelligence and facilitate critical workflows. Read more.

2:05pm–2:45pm Wednesday, September 25, 2019

Location: 1A 23/24

The evolution of metadata: LinkedIn’s story

Secondary topics: Data quality, data governance and data lineage, Media and Advertising

Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)

Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIn’s metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars. Read more.

2:05pm–2:45pm Wednesday, September 25, 2019

Location: 1E 07/08

Orchestrating data workflows using a fully serverless architecture

Secondary topics: Cloud Platforms and SaaS, Data, Analytics, and AI Architecture

Tomer Levi (Fundbox)

Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy-to-use, scalable, and flexible data workflow platform is a complex undertaking. Tomer Levi walks you through how the data engineering team at Fundbox uses AWS serverless technologies to address this problem and how it enables data scientists, BI devs, and engineers move faster. Read more.

2:05pm–2:45pm Wednesday, September 25, 2019

Location: 1E 09

Building a best-in-class data lake on AWS and Azure

Secondary topics: BI, Interactive Analytics and Visualization, Cloud Platforms and SaaS, Data Management and Storage

Tomer Shiran (Dremio), Jacques Nadeau (Dremio)

Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. Tomer Shiran and Jacques Nadeau explain how you can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize workloads simultaneously. Read more.

2:55pm–3:35pm Wednesday, September 25, 2019

Location: 1A 15/16

How Orange Financial combats financial fraud over 50M transactions a day using Apache Pulsar

Secondary topics: Data, Analytics, and AI Architecture, Financial Services, Streaming and IoT, Telecom

Weisheng Xie (Orange Financial), Jia Zhai (StreamNative)

As a fintech company of China Telecom with half of a billion registered users and 41 million monthly active users, risk control decision deployment has been critical to its success. Weisheng Xie and Jia Zhai explore how the company leverages Apache Pulsar to boost the efficiency of its risk control decision development for combating financial frauds of over 50 million transactions a day. Read more.

2:55pm–3:35pm Wednesday, September 25, 2019

Location: 1A 23/24

Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

Secondary topics: Data quality, data governance and data lineage, Transportation and Logistics

Kaan Onuk (Uber), Luyao Li (Uber), Atul Gupte (Uber)

Uber takes data driven to the next level. It needs a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata isn't just nice—it's absolutely integral to making data useful. Kaan Onuk, Luyao Li, and Atul Gupte explore the current state of metadata management, end-to-end data flow solutions at Uber, and what’s coming next. Read more.

2:55pm–3:35pm Wednesday, September 25, 2019

Location: 1E 07/08

Time travel for data pipelines: Solving the mystery of what changed

Secondary topics: Data Integration and Data Processing, Data quality, data governance and data lineage

Shradha Ambekar (Intuit), Sunil Goplani (Intuit), Sandeep Uttamchandani (Intuit)

A business insight shows a sudden spike. It can take hours, or days, to debug data pipelines to find the root cause. Shradha Ambekar, Sunil Goplani, and Sandeep Uttamchandani outline how Intuit built a self-service tool that automatically discovers data pipeline lineage and tracks every change, helping debug the issues in minutes—establishing trust in data while improving developer productivity. Read more.

2:55pm–3:35pm Wednesday, September 25, 2019

Location: 1E 09

When machines fight machines: Cyberbattles and the new frontier of artificial intelligence

Secondary topics: Privacy and Security

Marcus Fowler (Darktrace)

Cybersecurity must find what it doesn’t know to look for. AI technologies led to the emergence of self-learning, self-defending networks that achieve this—detecting and autonomously responding to in-progress attacks in real time. Marcus Fowler examine these cyber-immune systems enable the security team to focus on high-value tasks, counter even machine-speed threats, and work in all environments. Read more.

4:35pm–5:15pm Wednesday, September 25, 2019

Location: 1A 15/16

Trill: The crown jewel of Microsoft’s streaming pipeline explained

Secondary topics: Cloud Platforms and SaaS, Data Integration and Data Processing, Media and Advertising, Streaming and IoT

James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)

Trill has been open-sourced, making the streaming engine behind services like the Bing Ads platform available for all to use and extend. James Terwilliger, Badrish Chandramouli, and Jonathan Goldstein dive into the history of and insights from streaming data at Microsoft. They demonstrate how its API can power complex application logic and the performance that gives the engine its name. Read more.

4:35pm–5:15pm Wednesday, September 25, 2019

Location: 1A 21/22

Downscaling: The Achilles heel of autoscaling Spark clusters

Secondary topics: Deep dive into specific tools, platforms, or frameworks

Prakhar Jain (Microsoft), Sourabh Goyal (Qubole)

Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO. Read more.

4:35pm–5:15pm Wednesday, September 25, 2019

Location: 1A 23/24

The case for a common metadata layer for machine learning platforms

Secondary topics: Data quality, data governance and data lineage

Max Neunhöffer (ArangoDB), Joerg Schad (ArangoDB)

Machine learning platforms are becoming more complex, with different components each producing their own metadata and their own way of storing metadata. Max Neunhöffer and Joerg Schad propose a first draft of a common metadata API and demonstrate a first implementation of this API in Kubeflow using ArangoDB, a native multimodel database. Read more.

4:35pm–5:15pm Wednesday, September 25, 2019

Location: 1E 07/08

Apache Hadoop 3.x state of the union and upgrade guidance

Secondary topics: Deep dive into specific tools, platforms, or frameworks

Wangda Tan (Cloudera), Wei-Chiu Chuang (Cloudera)

Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x. Read more.

4:35pm–5:15pm Wednesday, September 25, 2019

Location: 1E 09

Protecting the healthcare enterprise from PHI breaches using streaming and NLP

Secondary topics: Health and Medicine, Privacy and Security

Jeff Zemerick (Mountain Fog)

Hospitals small and large are adopting cloud technologies, and many are in hybrid environments. These distributed environments pose challenges, none of which are more critical than the protection of protected health information (PHI). Jeff Zemerick explores how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment. Read more.