Sep 23–26, 2019

Data Engineering and Architecture

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Featured Speakers

Monday-Tuesday, September 23-24: 2-Day Training (Platinum & Training passes)
Tuesday, September 24: Tutorials (Gold & Silver passes)
Wednesday, September 25: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:50
Morning break
Thursday, September 26: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
9:00 | Location: Auditorium
Strata Data Conference Keynotes
10:50
Morning break
Add to your personal schedule
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1A 17
Jorge Lopez (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join Jorge Lopez to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.
Add to your personal schedule
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1E 06
Jesse Anderson (Big Data Institute)
Jesse Anderson offers you an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it, as well as how to create consumers and publishers. You'll take a look Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 08
Matt Fuller (Starburst)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 09
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Anurag Khandelwal (RISELab, UC Berkeley)
Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems and examine the inception and growth of the serverless paradigm. You'll take a deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions and get a bird’s-eye view of the application domains where you can leverage Pulsar functions. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 10
Viktor Gamov (Confluent)
Building stream processing applications is certainly one of the hot topics in the IT community. But if you've ever thought you needed to be a programmer to do stream processing and build stream processing data pipelines, think again. Viktor Gamov explores KSQL, the stream processing query engine built on top of Apache Kafka. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 11
Purnima Reddy Kuchikulla (Cloudera), Timothy Spann (Cloudera), Abdelkrim Hadjidj (Cloudera)
There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, and Abdelkrim Hadjidj walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with Cloudera Edge Management (CEM), which addresses these challenges with ease. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 14
Jason Wang (Cloudera), Tony Wu (Cloudera), Vinithra Varadharajan (Cloudera)
Moving to the cloud poses challenges from rearchitecting to data context consistency across workloads that span multiple clusters. Jason Wang, Tony Wu, and Vinithra Varadharajan explore cloud architecture and its challenges, as well as using Cloudera Altus to build data warehousing and data engineering clusters and run workloads that share metadata between them using Cloudera SDX. Read more.
Add to your personal schedule
9:00am5:00pm Tuesday, September 24, 2019
Location: 1A 06
Richard Evans (Statistics Canada), Rosaria Silipo (KNIME), Leah Xu (Spotify), Arup Nanda (Capital One), Victoriya Kalmanovich (Navy), Shreya Sharma (Expedia), Martin Mendez-Costabel (Bayer Crop Science), Gloria Macia (Roche AG), Gwen Campbell (Revibe Technologies), Moise Convolbo (Rakuten), Muhammed Idris (Capria VC | TeraCrunch)
From banking to biotech, retail to government, every business sector is changing in the face of abundant data. Get better at defining business problems and applying data solutions at Strata. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 09
Gowrishankar Balasubramanian (Amazon Web Services), Rajeev Srinivasan (Amazon Web Services)
Enterprises adopt cloud platforms such as AWS for agility, elasticity, and cost savings. Database design and management requires a different mindset in AWS when compared to traditional RDBMS design. Gowrishankar Balasubramanian and Rajeev Srinivasan explore considerations in choosing the right database for your use case and access pattern while migrating or building a new application on the cloud. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 10
Secondary topics:  Culture and Organization
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 14
Purnima Reddy Kuchikulla (Cloudera), Dan Chaffelson (Cloudera)
Kafka is omnipresent and the backbone of streaming analytics applications and data lakes. The challenge is understanding what's going on overall in the Kafka cluster, including performance, issues, and message flows. Purnima Reddy Kuchikulla and Dan Chaffelson walk you through a hands-on experience to visualize the entire Kafka environment end-to-end and simplify Kafka operations via SMM. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 15/16
Boris Lublinsky (Lightbend), Dean Wampler (Lightbend)
Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 12/13
Mark Madsen (Teradata), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 15/16
Navinder Pal Singh Brar (Walmart Labs)
Each week 275 million people shop at Walmart, generating interaction and transaction data. Navinder Pal Singh Brar explains how the customer backbone team enables extraction, transformation, and storage of customer data to be served to other teams. At 5 billion events per day, the Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 21
Julien Le Dem (WeWork)
Big data is crucial to organizations—and it's big not only by volume but also by the multitude of data sources and teams using them. Central data teams doing all the work is outdated as the entire organization becomes an ecosystem and central teams become enablers. Julien Le Dem outlines the principles of a data platform that enables the entire organization to build data-centric products. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 23
Moty Fania (Intel)
Moty Fania details Intel’s IT experience of implementing a sales AI platform. This platform is based on streaming, microservices architecture with a message bus backbone. It was designed for real-time data extraction and reasoning and handles the processing of millions of website pages and is capable of sifting through millions of tweets per day. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1E 07/08
Paige Roberts (Vertica), Deepak Majeti (Vertica)
GoodData needed to autorecover from node failures and scale rapidly when workloads spiked on their MPP database in the cloud. Kubernetes could solve it, but it's for stateless microservices, not a stateful MPP database that needs hundreds of containers. Paige Roberts and Deepak Majeti detail the hurdles GoodData needed to overcome in order to merge the power of the database with Kubernetes. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 25, 2019
Location: 1E 09
Steven Touw (Immuta)
Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past four years, data security and privacy anti-patterns have emerged across hundreds of customers and industry verticals—there's been an obvious trend. Steven Touw details five anti-patterns and, more importantly, the solutions for them. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 15/16
Michael Noll (Confluent)
Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 21
Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)
Join Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore what's offered by a flexible and scalable hosted data science platform at LinkedIn. It provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management, and collaboration to improve developer productivity. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 23
Wim Stoop (Cloudera), Srikanth Venkat (Cloudera)
Establishing enterprise-wide security and governance remains a challenge for most organizations. Integrations and exchanges across the landscape are costly to manage and maintain, and typically work in one direction only. Wim Stoop and Srikanth Venkat explore how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1E 07/08
Gil Vernik (IBM)
Most analytic flows can benefit from serverless, starting with simple cases to and moving to complex data preparations for AI frameworks like TensorFlow. To address the challenge of how to easily integrate serverless without major disruptions to your system, Gil Vernik explores the “push to the cloud” experience, which dramatically simplifies serverless for big data processing frameworks. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1E 09
The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 15/16
Stephan Ewen (Ververica), Aljoscha Krettek (Ververica)
Stephan Ewen and Aljoscha Krettek detail how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: new cross-batch-streaming machine learning algorithms, state-of-the-art batch performance, and new building blocks for data-driven applications and application consistency. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 21
Atul Gupte (Uber), Nikhil Joshi (Uber)
Uber is changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, it uses ML and advanced data science to power every aspect of the Uber experience—from dispatch to customer support. Atul Gupte and Nikhil Joshi explore how Uber enables teams to transform insights into intelligence and facilitate critical workflows. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 23
Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIn’s metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1E 07/08
Tomer Levi (Fundbox)
Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy-to-use, scalable, and flexible data workflow platform is a complex undertaking. Tomer Levi walks you through how the data engineering team at Fundbox uses AWS serverless technologies to address this problem and how it enables data scientists, BI devs, and engineers move faster. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1E 09
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
With cheap and scalable storage services such as S3 and ADLS, it's never been easier to dump data into a cloud data lake. But you still need to secure that data and be sure it doesn't leak. Tomer Shiran and Jacques Nadeau explore capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest), and auditing, as well as network protections. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 15/16
Weisheng Xie (China Telecom BestPay Co., Ltd), Sijie Guo (ASF)
As a Fintech company of China Telecom with half billion registered users and 41 million monthly active users, risk control decision deployment has been critical to the success of the business. In this talk we share how we leverage Apache Pulsar to boost the efficiency of our risk control decision development for combating financial frauds over 50 million transactions a day. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 21
Kai Liu (BING) (Microsoft), Jack Zhang (Microsoft), Jing Zhao (Microsoft)
Facilitating large-scale deep learning projects in parallel requires effort and innovation. Bing now runs a deployment of thousands of servers to address this challenge. Kai Liu, Jack Zhang, and Jing Zhao detail how Bing provides training services, offline data processing, vector hosting, and inferencing service offline to help data scientists through all steps in the project lifecycle. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 23
Kaan Onuk (Uber), Luyao Li (Uber), Atul Gupte (Uber)
At Uber’s scale and pace of growth, a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata is not just nice to have: it is absolutely integral to making data useful at Uber. In this talk, we will explore the current state of metadata management and end-to-end data flow solutions at Uber and what’s coming next. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1E 07/08
Shradha Ambekar (Intuit), Sunil Goplani (Intuit), Sandeep Uttamchandani (Intuit)
Imagine a business insight showing a sudden spike.Debugging data pipelines is non-trivial and finding the root cause can take hours or even days! We’ll share how Intuit built a self-serve tool that automatically discovers data pipeline lineage and tracks every change that impacts pipeline.This helps debug pipeline issues in minutes–establishing trust in data while improving developer productivity. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1E 09
Secondary topics:  Privacy and Security
Marcus Fowler (Darktrace)
Cyber security must find what it doesn’t know to look for. AI technologies have led to the emergence of self-learning, self-defending networks that achieve this – detecting and autonomously responding to in-progress attacks in real time. These cyber immune systems enable the security team to focus on high-value tasks, can counter even machine-speed threats, and work in all environments. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 15/16
James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)
Trill has been open-sourced, making the streaming engine behind services like the multi-billion-dollar Bing Ads platform available for all to use and extend. We give a brief history of streaming data at Microsoft and lessons learned. We then demonstrate how its API can power complex application logic, and the performance that gives the engine its name: a trillion events per day per node. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 21
Prakhar Jain (Qubole), Sourabh Goyal (Qubole)
Autoscaling of resources aims to achieve low latency for a big data application, while reducing resource costs at the same time. Upscale a cluster in cloud is fairly easy as compared to downscaling nodes and so overall Total-cost-of-ownership (TCO) goes up. We will talk about new design to get efficient downscaling which further helps in achieving better resource utilization and thus lower TCO. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 23
Max Neunhöffer (ArangoDB), Joerg Schad (Suki)
Machine Learning Platforms being built are becoming more complex with different components each producing their own metadata. Currently, most components provide their own way of storing metadata. In this talk, we propose a first draft of a common Metadata API and demo a first implementation of this API in Kubeflow using ArangoDB, which is a native multi-model database. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 07/08
Wangda Tan (Cloudera), Arpit Agarwal (Hortonworks Inc.)
In this talk, we’ll start with the current status of Apache Hadoop community, we'll then move on to the exciting present & future of Hadoop 3.x. We will cover new features like erasure coding, GPU support, namenode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. Also we will talk about upgrade guidance from 2.x to 3.x. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 09
Jeff Zemerick (Mountain Fog)
This talk describes how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 15/16
Bas Geerdink (ING)
Streaming Analytics (or Fast Data processing) is the field of making predictions on real-time data. In this talk, I'll present a fast data architecture that covers many use cases that follows a 'pipes and filters' pattern. This architecture can be used to create enterprise-grade solutions with a diversity of technology options. The stack is Kafka, Ignite, and Spark Structured Streaming (KISSS). Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 21
Chenzhao Guo (Intel Asia-Pacific Research & Development Ltd.), Carson Wang (Intel)
Shuffle in Spark requires the shuffle data to be persisted on local disks.However, the assumptions of collocated storage do not always hold in today’s data centers. We implemented a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends. This makes life easier for those customers who want to leverage the latest storage hardware, and HPC customers Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 23
Naghman Waheed (Bayer Crop Science), John Cooper (Bayer)
As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what data sets are available for consumption. To address this challenge, a custom metadata management tool was recently deployed as a new capability at Bayer. The system is cloud enabled and uses multiple open source components including machine learning and natural language processing to aid search. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1E 09
Matt Carothers (Cox Communications), Jignesh Patel (Cox Communications), Harry Tang (Cox Communication Inc)
Organizations often work with sensitive information such as social security number, and Credit card information. Although this data is stored in encrypted form, most analytical operations ranging from data analysis to advanced machine learning algorithms require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 15/16
Michael Freedman (TimescaleDB)
Leveraging polyglot solutions for your time-series data can lead to a variety of issues including engineering complexity, operational challenges, and even referential integrity concerns. By re-engineering Postgres to serve as a general data platform, your high-volume time-series workloads will be better streamlined, resulting in more actionable data and greater ease of use. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 21
Stavros Kontopoulos (Lightbend), Debasish Ghosh (Lightbend )
In this talk, we discuss online machine learning algorithm choices for streaming applications. We motive the discussion with resource constrained use cases like IoT and personalization. We cover Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms, all the way from implementation to production deployment, describing the pros and cons of using each of them. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 23
Jing Huang (SurveyMonkey), Jessica Mong (SurveyMonkey)
You are a SaaS company that operates on a cloud infra prior to the ML era. How do you successfully extend your existing infrastructure to leverage the power of ML? In this case study, you will learn critical lessons from SurveyMonkey’s journey of expanding its ML capabilities with its rich data repo and hybrid cloud infrastructure. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1E 07/08
Petar Zecevic (SV Group d.o.o.)
Large Scale Survey Telescope, or LSST, is one of the most important future surveys. Its unique design will allow it to cover large regions of the sky and obtain images of the faintest objects. In 10 years of its operation it will produce about 80 PB of data, both in images and catalog data. I will present AXS, a system we built for fast processing and cross-matching of survey catalog data. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 26, 2019
Location: 1E 09
Rick Houlihan (Amazon Web Services)
Data has always been relational, and it always will be. NoSQL databases are gaining in popularity, but that does not change the fact that the data they manage is still relational, it just changes how we have to model the data. This session dives deep into how real Entity Relationship Models can be efficiently modeled in a denormalized manner using schema examples from real application services. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 15/16
Alon Gavra (AppsFlyer)
Kafka, many times is just a piece of the stack that lives in production that often times no one wants to touch - because it just works. At AppsFlyer, Kafka sits at the core of our infrastructure that processes billions of events daily. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 21
Jim Scott (NVIDIA)
Data scientists are creating and testing hundreds or thousands more models than in the past. Models require support from both real-time and static data sources. As data becomes enriched, and parameters tuned and explored, there is a need for versioning everything, including the data. We will discuss the very specific problems and approaches to fix them. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 23
Omkar Joshi (Uber Technologies), Bo Yang (uber inc)
Omkar Joshi and Bo Yang offer an overview of how Uber’s ingestion (Marmary) & observability team improved performance of Apache Spark applications running on thousands of cluster machines and across 100 thousands+ of applications and how they methodically tackled these issues. They will also cover how they used Uber’s open sourced jvm-profiler for debugging issues at scale. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1E 07/08
Jason Wang (Cloudera), Sushant Rao (Cloudera)
We’ll give you actionable understanding of cloud architecture and different approaches customers took in their journey to the cloud. We start with the different ways we’ve seen customers be successful in the cloud. Then deep dive into the decisions they made, and how that drove their cloud architecture. Along the way we review problems they overcame, lessons learned, and core cloud paradigms. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 26, 2019
Location: 1E 09
Shant Hovsepian (Arcadia Data)
With cloud object storage (e.g. S3, ADLS) one expects business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces non-obvious challenges. This talk will review service-oriented cloud design (storage, compute, catalog, security, SQL) and shows how native cloud BI provides analytic depth, low cost and performance Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 15/16
Karthik Ramasamy (Streamlio), Anand Madhavan (Narvar)
Narvar provides next generation post transaction experience for over 500+ retailers. This talk explores the journey of how Narvar moving away from using a slew of technologies for their platform and consolidating their use cases using Apache Pulsar. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 21
Diego Oppenheimer (Algorithmia)
Machine Learning (ML) will fundamentally change the way we build and maintain applications. How can we adapt our infrastructure, operations, staffing, and training to meet the challenges of the new Software Development Life Cycle (SDLC) without throwing away everything that already works? Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 23
Reza Shiftehfar (Uber Technologies)
Building a reliable Big Data platform is extremely challenging when it has to store and serve 100s of PetaBytes of data in a real-time fashion . This talk reflects on the challenges faced and proposes architectural solutions to scale a Big Data Platform to ingest, store, and serve 100+ PB of data with minute level latency while efficiently utilizing the hardware and meeting the security needs. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 07/08
Nikki Rouda (Amazon Web Services), Roy Hasson (Amazon Web Services)
Learn how to deduplicate or link records in a dataset, even when the records don’t have a common unique identifier and no fields match exactly. Link customer records across different databases (e.g. different name spelling or address.) Match external product lists against your own catalog, such as lists of hazardous goods. Solve tough challenges to prepare and cleanse data for analysis. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 09
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. In this talk we describe how companies can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize various workloads simultaneously. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 23
Vitaliy Baklikov (Development Bank of Singapore), Dipti Borkar (Alluxio )
In this presentation, Vitaliy Baklikov from DBS Bank and Dipti Borkar from Alluxio will share how DBS Bank has built a modern big data analytics stack leveraging an object store even for data-intensive workloads like ATM forecasting and how it uses Alluxio to orchestrate data locality and data access for Spark workloads. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1E 07/08
Tom O'Neill (Sisense)
CCO Tom O’Neill will discuss lessons learned from scaling up Periscope Data to support incredibly large volumes of data and queries from its 2,000+ teams as part of Sisense.. He’ll highlight the process of migrating from Heroku to Kubernetes and discovering new ways to leverage its power, plus other developments that have allowed users to delve deeper into new data science and ML analysis. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1E 09
Owen O'Malley (Cloudera)
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. This talk describes how column encryption in ORC files enables both fine grain protection and audits of who accessed the private data. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 15/16
Jonghyok Lee (SK Telecom), Chon Yong Lee (SK Telecom)
Architecture and lessons learned from development of T-CORE, SK Telecom’s monitoring and service analytics platform, which collects system and application data from several thousand servers and applications and provides 3D visualized real-time status of the whole network and services for the operators and analytics platform for data scientists, engineers and developers. Read more.
Add to your personal schedule
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 21
Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Randall DeFauw (Amazon Web Services)
As an increasing level of automation is becoming available to data science, there is a balance between automation and quality that needs to be maintained. Applying DevOps practices to machine learning workloads not only brings models to the market faster but also maintains the quality and integrity of those models. This presentation will focus on applying DevOps practices to ML workloads. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 26, 2019
Location: 1A 23
Supun Kamburugamuve (Indiana University)
Big data computing and high-performance computing (HPC) has evolved over the years as separate paradigms. With the explosion of the data and the demand for machine learning algorithms, these two paradigms are increasingly embracing each other for data management and algorithms. Supun Kamburugamuve explores the possibilities and tools available for getting the best of HPC and big data. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 26, 2019
Location: 1E 07/08
Evgeny Vinogradov (Yandex.Money)
With a microservice architecture, DWH is a first place where all the data gets together. It supplied by many different datasources. It is used for many purposes – from near-OLTP till models fitting and realtime classifying. Talk will cover our experience in management and scaling of data Engineering Team and infrastructure for support of 20+ Product Teams. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 26, 2019
Location: 1E 09
Ruixin Xu (Microsoft), Long Tian (Microsoft), Yu Zhou (Microsoft)
Microsoft big data team run experiment to use Spark and Jupyter notebook as a replacement of existing IDE based diagnose tools for internal DevOps. Experiment result indicates the Spark based solution has improved the diagnosis performance significantly especially for complex job with large profile, and leveraging Jupyter notebook also bring the benefit of fast iteration and easy knowledge share. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 26, 2019
Location: 1A 15/16
Neelesh Salian (Stitch Fix)
It is important to understand why Data Lineage is needed for an organization. Once the purpose is defined, we can talk about how to go about building such a system. Read more.

Contact us

confreg@oreilly.com

For conference registration information and customer service

partners@oreilly.com

For more information on community discounts and trade opportunities with O’Reilly conferences

strataconf@oreilly.com

For information on exhibiting or sponsoring a conference

Contact list

View a complete list of Strata Data Conference contacts