Sep 23–26, 2019

Data Engineering and Architecture

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Featured Speakers

Monday-Tuesday, September 23-24: 2-Day Training (Platinum & Training passes)
Tuesday, September 24: Tutorials (Gold & Silver passes)
Wednesday, September 25: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:50
Morning break
Thursday, September 26: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:50
Morning break
Add to your personal schedule
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1A 17
Secondary topics:  Cloud Platforms and SaaS, Data Integration and Data Processing, Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks
Jorge Lopez (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join in to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.
Add to your personal schedule
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1E 06
Secondary topics:  Data Integration and Data Processing, Deep dive into specific tools, platforms, or frameworks
Jesse Anderson (Big Data Institute)
Jesse Anderson offers an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it as well as how to create consumers and publishers. Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 09
Secondary topics:  Cloud Platforms and SaaS, Data, Analytics, and AI Architecture, Streaming and IoT, Temporal data and time-series analytics
Arun Kejariwal (Facebook), Karthik Ramasamy (Streamlio)
In this tutorial, we shall walk the audience through the landscape of streaming systems and overview the inception and growth of the serverless paradigm. Next, we shall present a deep dive of Apache Pulsar which provides native serverless support in the form of Pulsar functions and paint a bird’s eye view of the application domains where Pulsar functions can be leveraged. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 10
Secondary topics:  Data Integration and Data Processing, Deep dive into specific tools, platforms, or frameworks, Streaming and IoT
Ricardo Ferreira (Confluent)
Building stream processing applications are certainly one of the hot topics among the IT community. Though a lot has been talked about this subject, one might say that building stream processing applications is the new sex during teenage. This tutorial aims to change this by introducing KSQL, the stream processing query engine built on top of Apache Kafka. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 11
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Streaming and IoT
Purnima Reddy Kuchikulla (Cloudera), Timothy Spann (Cloudera), Abdelkrim Hadjidj (Cloudera)
Too many edge devices and agents. How does one control and manage them. How do we have handle the difficulty in collecting real-time data and most importantly, the trouble with updating specific set of agents with edge applications. Get your hands dirty with Cloudera Edge Management that addresses these challenges with ease. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 14
Secondary topics:  Cloud Platforms and SaaS, Data Management and Storage
Jason Wang (Cloudera), Tony Wu (Cloudera), Vinithra Varadharajan (Cloudera)
Moving to the cloud poses challenges from re-architecting to be cloud-native, to data context consistency across workloads that span multiple clusters on-prem and in the cloud. First, we’ll cover in depth cloud architecture and challenges; second, you’ll use Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 12/13
Secondary topics:  Data Management and Storage, Deep dive into specific tools, platforms, or frameworks
Matt Fuller (Starburst)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.
Add to your personal schedule
9:00am5:00pm Tuesday, September 24, 2019
Location: 1A 06
Richard Evans (Statistics Canada), Rosaria Silipo (KNIME), Leah Xu (Spotify), Arup Nanda (Priceline), Victoriya Kalmanovich (Navy), Shreya Sharma (Expedia Inc.), Martin Mendez-Costabel (Bayer Crop Science), Gloria Macia (Roche AG), Gwen Campbell (Revibe Technologies, Inc), Moise Convolbo (Rakuten)
From banking to biotech, retail to government, every business sector is changing in the face of abundant data. Get better at defining business problems and applying data solutions at Strata. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 09
Secondary topics:  Cloud Platforms and SaaS, Data, Analytics, and AI Architecture
Mark Madsen (Teradata), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 10
Secondary topics:  Model Development, Governance, Operations
Boris Lublinsky (Lightbend), Dean Wampler (Lightbend)
This hands-on tutorial examines production use of ML in streaming data pipelines; how to do periodic model retraining and low-latency scoring in live streams. We'll discuss Kafka as the data backplane, pros and cons of microservices vs. systems like Spark and Flink, tips for Tensorflow and SparkML, performance considerations, model metadata tracking, and other techniques. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 11
Secondary topics:  Culture and Organization
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. In this presentation we’ll provide guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 14
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Streaming and IoT
Purnima Reddy Kuchikulla (Cloudera), Dan Chaffelson (Cloudera)
Kafka is omnipresent and is the backbone of not only streaming analytics applications but data lakes as well. The challenge is understanding what is going on overall in the Kafka cluster including performance, issues and message flows. This session gives a hands on experience to visualize their entire Kafka environment end-to-end and simplifies Kafka operations via SMM. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 12/13
Secondary topics:  Cloud Platforms and SaaS, Data Management and Storage, Data, Analytics, and AI Architecture
Gowrishankar Balasubramanian (Amazon Web Services), Rajeev Srinivasan (Amazon Web Services)
Enterprises adopt Cloud platforms such as AWS for agility, elasticity and cost savings. Database design and management requires a different mindset in AWS when compared to traditional RDBMS design. In this session, you will learn important considerations in choosing the right database based on your use cases and access pattern while migrating an application or building a new application on cloud. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 15/16
Secondary topics:  Data Integration and Data Processing, Data, Analytics, and AI Architecture, Retail and e-commerce, Streaming and IoT
Navinder Pal Singh Brar (Walmart Labs)
Each week 275 million people shop at Walmart, generating multi-terabytes of interaction and transaction data. In Customer Backbone team, we enable extraction, transforming and storing of customer data to be served to teams such as Ads and Personalisation. At 5 Billion events/day our Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 21/22
Secondary topics:  Data, Analytics, and AI Architecture
Julien Le Dem (WeWork)
Big Data is crucial to organizations. Big not only by volume of data but also by the multitude of datasources and teams using them. Central data teams doing all the work is outdated as the entire organization becomes an ecosystem and central teams become enablers. We will discuss the principles of a data platform enabling the entire organization to build data centric products. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 23/24
Secondary topics:  Data, Analytics, and AI Architecture
Moty Fania (Intel)
In this session, Moty Fania will share Intel’s IT experience of implementing a Sales AI platform. This platform is based on streaming, micro-services architecture with a message bus backbone. It was designed for real-time, data extraction and reasoning. The platform handles processing of millions of website pages and capable of sifting thru millions of tweets per day. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1E 07/08
Secondary topics:  Cloud Platforms and SaaS, Data Management and Storage, Data, Analytics, and AI Architecture
Paige Roberts (Vertica), Deepak Majeti (Vertica)
a. Analytics experts, GoodData, needed to auto-recover from node failures and scale rapidly when workloads spike on their MPP database in the cloud. Kubernetes could solve that, but K8 is for stateless micro-services, not a stateful MPP database that needs 100s of containers. In order to merge the power of an MPP database with the flexibility of Kubernetes, a lot of hurdles had to be overcome. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1E 09
Secondary topics:  Data Management and Storage, Privacy and Security
Steven Touw (Immuta)
Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past 4 years we’ve seen data security and privacy anti-patterns consistently emerge across 100s of customers and industry verticals - there has been an obvious trend. We’ll cover 5 anti-patterns and more importantly, the solutions for them. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 15/16
Secondary topics:  Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks
Michael Noll (Confluent)
Would you cross the street with traffic information that is a minute old? Certainly not! Modern businesses have the same needs. In this talk we cover why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, we look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 21/22
Secondary topics:  Data, Analytics, and AI Architecture, Media and Advertising
Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)
Come hear about the infrastructure and features offered by flexible and scalable hosted data science platform at LinkedIn. The platform provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management and collaboration that improve developer productivity. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 23/24
Secondary topics:  Data quality, data governance and data lineage, Deep dive into specific tools, platforms, or frameworks
Wim Stoop (Cloudera)
Establishing enterprise wide security and governance remains a challenge for most organisations. Integrations and exchanges across their landscape are costly to manage and maintain, and typically work in one direction only. In this session, we'll discuss how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value for customers. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1E 07/08
Secondary topics:  Cloud Platforms and SaaS, Data Integration and Data Processing
Gil Vernik (IBM)
Most analytic flows can benefit from the serverless, starting with simple cases to complex data preparations for AI frameworks, like TensorFlow. To address the challenge of how to easily integrate serverless, without major disruptions to your system, we present “push to the cloud” experience. This ability dramatically simplifies using serverless for different big data processing frameworks. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1E 09
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Health and Medicine, Privacy and Security
The Apache Parquet community is working on a column encryption mechanism that protects the sensitive data and enables access control for table columns. Many companies are involved, the mechanism specification has recently been signed off by the community management committee. I will present the basics of Parquet encryption technology, its usage model and a number of use cases. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 15/16
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Streaming and IoT
Stephan Ewen (Ververica), Aljoscha Krettek (data Artisans)
The talk discusses how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: New cross-batch-streaming Machine Learning algorithms, State-of-the-art batch performance, and new building blocks for data-driven applications and application consistency. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 21/22
Secondary topics:  Data, Analytics, and AI Architecture, Transportation and Logistics
Atul Gupte (Uber Technologies Inc.), Nikhil Joshi (Uber)
At Uber, we’re changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, we’re using ML and advanced data science to power every aspect of the Uber experience - from dispatch to customer support. In this talk, we’ll explore how we enable teams at Uber to transform insights into intelligence and facilitate critical workflows. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 23/24
Secondary topics:  Data quality, data governance and data lineage, Media and Advertising
Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
How do you scale metadata to an organization of 10,000 employees, 1M+ data assets and an AI-enabled company that ships code to the site three times a day. We describe the journey of LinkedIn’s metadata from a two-person back-office team to a central hub powering data discovery, AI productivity and automatic data privacy. Different metadata strategies and our battle scars will be revealed! Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1E 07/08
Secondary topics:  Cloud Platforms and SaaS, Data, Analytics, and AI Architecture
Tomer Levi (Fundbox)
Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy to use, scalable, and flexible data workflow platform is a complex undertaking. In this talk, attendees will learn how the data engineering team at Fundbox uses AWS serverless technologies to address this problem, and how it enables data scientists, BI devs and engineers move faster. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1E 09
Secondary topics:  Cloud Platforms and SaaS, Privacy and Security
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
With cheap and infinitely scalable storage services such as S3 and ADLS, it has never been easier to dump data into a cloud data lake. But how do you secure that data and make sure it doesn't leak? In this talk we explore numerous capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest) and auditing, as well as network protections. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 15/16
Secondary topics:  Data, Analytics, and AI Architecture, Financial Services, Streaming and IoT, Telecom
Weisheng Xie (China Telecom BestPay Co., Ltd), Sijie Guo (ASF)
As a Fintech company of China Telecom with half billion registered users and 41 million monthly active users, risk control decision deployment has been critical to the success of the business. In this talk we share how we leverage Apache Pulsar to boost the efficiency of our risk control decision development for combating financial frauds over 50 million transactions a day. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 21/22
Secondary topics:  Data, Analytics, and AI Architecture, Deep Learning
Kai Liu (Microsoft (BING))
Facilitating large scale of deep learning projects in parallel requires some effort and innovation. Bing is now running a deployment of thousands of servers to address this challenge. We provides training services, offline data processing, vector hosting, and inferencing service at offline fashion to help data scientists through all steps in the project life cycle. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 23/24
Secondary topics:  Data quality, data governance and data lineage, Transportation and Logistics
Kaan Onuk (Uber), Luyao Li (Uber), Atul Gupte (Uber)
At Uber’s scale and pace of growth, a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata is not just nice to have: it is absolutely integral to making data useful at Uber. In this talk, we will explore the current state of metadata management and end-to-end data flow solutions at Uber and what’s coming next. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1E 07/08
Secondary topics:  Data Integration and Data Processing, Data quality, data governance and data lineage
Shradha Ambekar (Intuit), Sunil Goplani (Intuit), Sandeep Uttamchandani (Intuit)
Imagine a business insight showing a sudden spike.Debugging data pipelines is non-trivial and finding the root cause can take hours or even days! We’ll share how Intuit built a self-serve tool that automatically discovers data pipeline lineage and tracks every change that impacts pipeline.This helps debug pipeline issues in minutes–establishing trust in data while improving developer productivity. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1E 09
Secondary topics:  Privacy and Security
Justin Fier (Darktrace)
Cyber security must find what it doesn’t know to look for. AI technologies have led to the emergence of self-learning, self-defending networks that achieve this – detecting and autonomously responding to in-progress attacks in real time. These cyber immune systems enable the security team to focus on high-value tasks, can counter even machine-speed threats, and work in all environments. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 15/16
Secondary topics:  Cloud Platforms and SaaS, Data Integration and Data Processing, Media and Advertising, Streaming and IoT
James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)
Trill has been open-sourced, making the streaming engine behind services like the multi-billion-dollar Bing Ads platform available for all to use and extend. We give a brief history of streaming data at Microsoft and lessons learned. We then demonstrate how its API can power complex application logic, and the performance that gives the engine its name: a trillion events per day per node. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 21/22
Secondary topics:  Deep dive into specific tools, platforms, or frameworks
Prakhar Jain (Qubole), Sourabh Goyal (Qubole)
Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Upscale a cluster in cloud is fairly easy as compared to downscaling nodes and so overall Total-cost-of-ownership (TCO) goes up. We will talk about new design to get efficient downscaling which further helps in achieving better resource utilization and thus lower TCO. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 23/24
Secondary topics:  Data quality, data governance and data lineage
Max Neunhöffer (ArangoDB), Joerg Schad (Suki)
Machine Learning Platforms being built are becoming more complex with different components each producing their own metadata. Currently, most components provide their own way of storing metadata. In this talk, we propose a first draft of a common Metadata API and demo a first implementation of this API in Kubeflow using ArangoDB, which is a native multi-model database. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 07/08
Secondary topics:  Deep dive into specific tools, platforms, or frameworks
Wangda Tan (Cloudera), Jitendra Pandey (Hortonworks)
In this talk, we’ll start with the current status of Apache Hadoop community, we'll then move on to the exciting present & future of Hadoop 3.x. We will cover new features like erasure coding, GPU support, namenode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. Also we will talk about upgrade guidance from 2.x to 3.x. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 09
Secondary topics:  Health and Medicine, Privacy and Security
Jeff Zemerick (Mountain Fog)
This talk describes how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 15/16
Secondary topics:  Data, Analytics, and AI Architecture, Streaming and IoT
Bas Geerdink (ING)
Streaming Analytics (or Fast Data processing) is the field of making predictions on real-time data. In this talk, I'll present a fast data architecture that covers many use cases that follows a 'pipes and filters' pattern. This architecture can be used to create enterprise-grade solutions with a diversity of technology options. The stack is Kafka, Impala, and Spark Structured Streaming (KISSS). Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 21/22
Secondary topics:  Data, Analytics, and AI Architecture, Deep dive into specific tools, platforms, or frameworks
Chenzhao Guo (Intel Asia-Pacific Research & Development Ltd.), Carson Wang (Intel)
Shuffle in Spark requires the shuffle data to be persisted on local disks.However, the assumptions of collocated storage do not always hold in today’s data centers. We implemented a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends. This makes life easier for those customers who want to leverage the latest storage hardware, and HPC customers Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 23/24
Secondary topics:  Data quality, data governance and data lineage, Data, Analytics, and AI Architecture
Naghman Waheed (Bayer Crop Science), John Cooper (Bayer)
As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what data sets are available for consumption. To address this challenge, a custom metadata management tool was recently deployed as a new capability at Bayer. The system is cloud enabled and uses multiple open source components including machine learning and natural language processing to aid search. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1E 09
Secondary topics:  Media and Advertising, Privacy and Security
Matt Carothers (Cox Communications), Jignesh Patel (Cox Communications)
Organizations often work with sensitive information such as social security number, and Credit card information. Although this data is stored in encrypted form, most analytical operations ranging from data analysis to advanced machine learning algorithms require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 15/16
Secondary topics:  Data Management and Storage, Streaming and IoT
Michael Freedman (TimescaleDB)
Leveraging polyglot solutions for your time-series data can lead to a variety of issues including engineering complexity, operational challenges, and even referential integrity concerns. By re-engineering Postgres to serve as a general data platform, your high-volume time-series workloads will be better streamlined, resulting in more actionable data and greater ease of use. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 21/22
Secondary topics:  Streaming and IoT, Temporal data and time-series analytics
Stavros Kontopoulos (Lightbend), Debasish Ghosh (Lightbend )
In this talk, we discuss online machine learning algorithm choices for streaming applications. We motive the discussion with resource constrained use cases like IoT and personalization. We cover Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms, all the way from implementation to production deployment, describing the pros and cons of using each of them. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 23/24
Secondary topics:  Cloud Platforms and SaaS, Data, Analytics, and AI Architecture, Media and Advertising
Jing Huang (SurveyMonkey), Jessica Mong (SurveyMonkey)
You are a SaaS company that operates on a cloud infra prior to the ML era. How do you successfully extend your existing infrastructure to leverage the power of ML? In this case study, you will learn critical lessons from SurveyMonkey’s journey of expanding its ML capabilities with its rich data repo and hybrid cloud infrastructure. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1E 07/08
Secondary topics:  Data Integration and Data Processing
Petar Zecevic (SV Group d.o.o.)
Large Scale Survey Telescope, or LSST, is one of the most important future surveys. Its unique design will allow it to cover large regions of the sky and obtain images of the faintest objects. In 10 years of its operation it will produce about 80 PB of data, both in images and catalog data. I will present AXS, a system we built for fast processing and cross-matching of survey catalog data. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1E 09
Secondary topics:  Cloud Platforms and SaaS, Data Management and Storage
Rick Houlihan (Amazon Web Services)
Data has always been relational, and it always will be. NoSQL databases are gaining in popularity, but that does not change the fact that the data they manage is still relational, it just changes how we have to model the data. This session dives deep into how real Entity Relationship Models can be efficiently modeled in a denormalized manner using schema examples from real application services. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 15/16
Secondary topics:  Data Management and Storage, Deep dive into specific tools, platforms, or frameworks
Alon Gavra (AppsFlyer)
Kafka, many times is just a piece of the stack that lives in production that often times no one wants to touch - because it just works. At AppsFlyer, Kafka sits at the core of our infrastructure that processes billions of events daily. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 21/22
Secondary topics:  Model Development, Governance, Operations
Jim Scott (MapR Technologies)
Data scientists are creating and testing hundreds or thousands more models than in the past. Models require support from both real-time and static data sources. As data becomes enriched, and parameters tuned and explored, there is a need for versioning everything, including the data. We will discuss the very specific problems and approaches to fix them. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 23/24
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Transportation and Logistics
Omkar Joshi (Uber Technologies), Bo Yang (uber inc)
Omkar Joshi and Bo Yang offer an overview of how Uber’s ingestion (Marmary) & observability team improved performance of Apache Spark applications running on thousands of cluster machines and across 100 thousands+ of applications and how they methodically tackled these issues. They will also cover how they used Uber’s open sourced jvm-profiler for debugging issues at scale. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1E 07/08
Secondary topics:  Cloud Platforms and SaaS, Data, Analytics, and AI Architecture
Jason Wang (Cloudera), Sushant Rao (Cloudera)
We’ll give you actionable understanding of cloud architecture and different approaches customers took in their journey to the cloud. We start with the different ways we’ve seen customers be successful in the cloud. Then deep dive into the decisions they made, and how that drove their cloud architecture. Along the way we review problems they overcame, lessons learned, and core cloud paradigms. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1E 09
Secondary topics:  BI, Interactive Analytics and Visualization
Shant Hovsepian (Arcadia Data)
With cloud object storage (e.g. S3, ADLS) one expects business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces non-obvious challenges. This talk will review service-oriented cloud design (storage, compute, catalog, security, SQL) and shows how native cloud BI provides analytic depth, low cost and performance Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 15/16
Secondary topics:  Data Integration and Data Processing, Data, Analytics, and AI Architecture, Retail and e-commerce, Streaming and IoT
Karthik Ramasamy (Streamlio), Anand Madhavan (Narvar)
Narvar provides next generation post transaction experience for over 500+ retailers. This talk explores the journey of how Narvar moving away from using a slew of technologies for their platform and consolidating their use cases using Apache Pulsar. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 21/22
Secondary topics:  Model Development, Governance, Operations
Diego Oppenheimer (Algorithmia)
Machine Learning (ML) will fundamentally change the way we build and maintain applications. How can we adapt our infrastructure, operations, staffing, and training to meet the challenges of the new Software Development Life Cycle (SDLC) without throwing away everything that already works? Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 23/24
Secondary topics:  Data Integration and Data Processing, Data Management and Storage, Data, Analytics, and AI Architecture, Transportation and Logistics
Reza Shiftehfar (Uber Technologies)
Building a reliable Big Data platform is extremely challenging when it has to store and serve 100s of PetaBytes of data in a real-time fashion . This talk reflects on the challenges faced and proposes architectural solutions to scale a Big Data Platform to ingest, store, and serve 100+ PB of data with minute level latency while efficiently utilizing the hardware and meeting the security needs. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 07/08
Secondary topics:  Data Integration and Data Processing, Data quality, data governance and data lineage
Nikki Rouda (Amazon Web Services), Roy Hasson (Amazon Web Services)
Learn how to deduplicate or link records in a dataset, even when the records don’t have a common unique identifier and no fields match exactly. Link customer records across different databases (e.g. different name spelling or address.) Match external product lists against your own catalog, such as lists of hazardous goods. Solve tough challenges to prepare and cleanse data for analysis. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 09
Secondary topics:  BI, Interactive Analytics and Visualization, Cloud Platforms and SaaS, Data Management and Storage
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. In this talk we describe how companies can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize various workloads simultaneously. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 23/24
Secondary topics:  Cloud Platforms and SaaS, Data Management and Storage, Data, Analytics, and AI Architecture, Financial Services
Vitaliy Baklikov (Development Bank of Singapore), Dipti Borkar (Alluxio )
In this presentation, Vitaliy Baklikov from DBS Bank and Dipti Borkar from Alluxio will share how DBS Bank has built a modern big data analytics stack leveraging an object store even for data-intensive workloads like ATM forecasting and how it uses Alluxio to orchestrate data locality and data access for Spark workloads. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1E 07/08
Secondary topics:  Data, Analytics, and AI Architecture
Tom O'Neill (Periscope Data)
In this session, CTO Tom O’Neill will discuss lessons learned from scaling up Periscope Data to support incredibly large volumes of data and queries from its 1,000+ data teams. He’ll highlight the process of migrating from Heroku to Kubernetes and discovering new ways to leverage its power, plus other developments that have allowed users to delve deeper into new data science and ML analysis. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1E 09
Secondary topics:  Deep dive into specific tools, platforms, or frameworks, Privacy and Security
Owen O'Malley (Cloudera)
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. This talk describes how column encryption in ORC files enables both fine grain protection and audits of who accessed the private data. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 15/16
Secondary topics:  Data Integration and Data Processing, Data, Analytics, and AI Architecture, Streaming and IoT, Telecom
Jonghyok Lee (SK Telecom), Chon Yong Lee (SK Telecom)
Architecture and lessons learned from development of T-CORE, SK Telecom’s monitoring and service analytics platform, which collects system and application data from several thousand servers and applications and provides 3D visualized real-time status of the whole network and services for the operators and analytics platform for data scientists, engineers and developers. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 21/22
Secondary topics:  Cloud Platforms and SaaS, Deep dive into specific tools, platforms, or frameworks, Model Development, Governance, Operations
Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Randall DeFauw (Amazon Web Services)
As an increasing level of automation is becoming available to data science, there is a balance between automation and quality that needs to be maintained. Applying DevOps practices to machine learning workloads not only brings models to the market faster but also maintains the quality and integrity of those models. This presentation will focus on applying DevOps practices to ML workloads. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 26, 2019
Location: 1A 23/24
Secondary topics:  Data, Analytics, and AI Architecture
Supun Kamburugamuve (Indiana University)
Big data computing and high-performance computing (HPC) has evolved over the years as separate paradigms. With the explosion of the data and the demand for machine learning algorithms, these two paradigms are increasingly embracing each other for data management and algorithms. Supun Kamburugamuve explores the possibilities and tools available for getting the best of HPC and big data. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 26, 2019
Location: 1E 07/08
Secondary topics:  Culture and Organization, Financial Services, Model Development, Governance, Operations
Evgeny Vinogradov (Yandex.Money)
With a microservice architecture, DWH is a first place where all the data gets together. It supplied by many different datasources. It is used for many purposes – from near-OLTP till models fitting and realtime classifying. Talk will cover our experience in management and scaling of data Engineering Team and infrastructure for support of 20+ Product Teams. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 26, 2019
Location: 1E 09
Ruixin Xu (Microsoft)
Microsoft big data team run experiment to use Spark and Jupyter notebook as a replacement of existing IDE based diagnose tools for internal DevOps. Experiment result indicates the Spark based solution has improved the diagnosis performance significantly especially for complex job with large profile, and leveraging Jupyter notebook also bring the benefit of fast iteration and easy knowledge share. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 26, 2019
Location: 1A 15/16
Secondary topics:  Data quality, data governance and data lineage, Retail and e-commerce
Neelesh Salian (Stitch Fix)
It is important to understand why Data Lineage is needed for an organization. Once the purpose is defined, we can talk about how to go about building such a system. Read more.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts