Sep 23–26, 2019

Data Engineering and Architecture

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools, and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Featured Speakers

Monday-Tuesday, September 23-24: 2-Day Training (Platinum & Training passes)
Tuesday, September 24: Tutorials (Gold & Silver passes)
Wednesday, September 25: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am | Location: 3E
Strata Data Conference Keynotes
10:50
Morning break
Thursday, September 26: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45am | Location: 3E
Strata Data Conference Keynotes
10:50
Morning break
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1A 17
Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Nikki Rouda (Amazon Web Services), Jesse Gebhardt (Amazon Web Services), Rajeev Chakrabarti (Amazon Web Services)
Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. Join the AWS team to learn how to incorporate serverless concepts into your big data architectures. You'll explore design patterns to ingest, store, and analyze your data as you build a big data application using AWS technologies such as S3, Athena, Kinesis, and more. Read more.
9:00am - 5:00pm Monday, September 23 & Tuesday, September 24
Location: 1E 06
Jesse Anderson (Big Data Institute)
Jesse Anderson offers you an in-depth look at Apache Kafka. You'll learn how Kafka works and how to create real-time systems with it, as well as how to create consumers and publishers. You'll take a look Jesse then walks you through Kafka’s ecosystem, demonstrating how to use tools like Kafka Streams, Kafka Connect, and KSQL. Read more.
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 08
Matt Fuller (Starburst)
Used by Facebook, Netflix, Airbnb, LinkedIn, Twitter, Uber, and others, Presto has become the ubiquitous open source software for SQL on anything. Presto was built from the ground up for fast interactive SQL analytics against disparate data sources ranging in size from GBs to PBs. Join Matt Fuller to learn how to use Presto and explore use cases and best practices you can implement today. Read more.
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 09
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio), Anurag Khandelwal (RISELab, UC Berkeley)
Arun Kejariwal, Karthik Ramasamy, and Anurag Khandelwal walk you through the landscape of streaming systems and examine the inception and growth of the serverless paradigm. You'll take a deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions and get a bird’s-eye view of the application domains where you can leverage Pulsar functions. Read more.
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 10
Viktor Gamov (Confluent)
Building stream processing applications is certainly one of the hot topics in the IT community. But if you've ever thought you needed to be a programmer to do stream processing and build stream processing data pipelines, think again. Viktor Gamov explores KSQL, the stream processing query engine built on top of Apache Kafka. Read more.
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 11
Purnima Reddy Kuchikulla (Cloudera), Timothy Spann (Cloudera), Abdelkrim Hadjidj (Cloudera), Andre Araujo (Cloudera), Hemanth Yamijala (Cloudera)
There are too many edge devices and agents, and you need to control and manage them. Purnima Reddy Kuchikulla, Timothy Spann, Abdelkrim Hadjidj, and Andre Araujo walk you through handling the difficulty in collecting real-time data and the trouble with updating a specific set of agents with edge applications. Get your hands dirty with CEM, which addresses these challenges with ease. Read more.
9:00am12:30pm Tuesday, September 24, 2019
Location: 1E 14
James Morantus (Cloudera), Tony Huinker (Cloudera), Naren Koneru (Cloudera), Ramachandran Venkatesh (Cloudera), Gunther Hagleitner (Cloudera), Olli Draese (Cloudera)
Organizations now run diverse, multidisciplinary, big data workloads that span data engineering, data warehousing, and data science applications. Many of these workloads operate on the same underlying data, and the workloads themselves can be transient or long running in nature. There are many challenges with moving these workloads to the cloud. In this talk we start off with a technical deep... Read more.
9:00am5:00pm Tuesday, September 24, 2019
Location: 1A 06
David Boyle (Audience Strategies), Richard Evans (Statistics Canada), Rosaria Silipo (KNIME), Leah Xu (Spotify), Arup Nanda (Capital One), Victoriya Kalmanovich (Navy), Tusharadri Mukherjee (Lenovo), David Boyle (Audience Strategies), Richard Evans (Statistics Canada), Leah Xu (Spotify), Victoriya Kalmanovich (Navy), Moise Convolbo (Rakuten), Martin Mendez-Costabel (Bayer Crop Science), gloria macia (Roche AG), Gwen Campbell (Revibe Technologies), Moise Convolbo (Rakuten), Muhammed Idris (Capria VC | TeraCrunch)
From banking to biotech, retail to government, every business sector is changing in the face of abundant data. Get better at defining business problems and applying data solutions at Strata. Read more.
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 09
Gowrishankar Balasubramanian (Amazon Web Services), Rajeev Srinivasan (Amazon Web Services)
Enterprises adopt cloud platforms such as AWS for agility, elasticity, and cost savings. Database design and management requires a different mindset in AWS when compared to traditional RDBMS design. Gowrishankar Balasubramanian and Rajeev Srinivasan explore considerations in choosing the right database for your use case and access pattern while migrating or building a new application on the cloud. Read more.
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 10
Secondary topics:  Culture and Organization
Ted Malaska (Capital One), Jonathan Seidman (Cloudera), Matthew Schumpert (Cloudera, Inc.), Raman Rajasekhar (Cloudera Inc), Krishna Maheshwari (Cloudera)
The enterprise data management space has changed dramatically in recent years, and this has led to new challenges for organizations in creating successful data practices. Ted Malaska and Jonathan Seidman detail guidelines and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects. Read more.
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 14
Purnima Reddy Kuchikulla (Cloudera), Dan Chaffelson (Cloudera), Attila Kanto (Cloudera), Tony Wu (Cloudera)
Kafka is omnipresent and the backbone of streaming analytics applications and data lakes. The challenge is understanding what's going on overall in the Kafka cluster, including performance, issues, and message flows. Purnima Reddy Kuchikulla and Dan Chaffelson walk you through a hands-on experience to visualize the entire Kafka environment end-to-end and simplify Kafka operations via SMM. Read more.
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 15/16
Boris Lublinsky (Lightbend), Dean Wampler (Lightbend)
Boris Lublinsky and Dean Wampler examine ML use in streaming data pipelines, how to do periodic model retraining, and low-latency scoring in live streams. Learn about Kafka as the data backplane, the pros and cons of microservices versus systems like Spark and Flink, tips for TensorFlow and SparkML, performance considerations, metadata tracking, and more. Read more.
1:30pm5:00pm Tuesday, September 24, 2019
Location: 1E 12/13
Mark Madsen (Teradata), Todd Walter (Archimedata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.
9:05am9:15am Wednesday, September 25, 2019
Location: 3E
Ben Lorica (O'Reilly)
Ben Lorica dives into emerging technologies for building data infrastructures and machine learning platforms. Read more.
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 15/16
Navinder Pal Singh Brar (Walmart Labs)
Each week 275 million people shop at Walmart, generating interaction and transaction data. Navinder Pal Singh Brar explains how the customer backbone team enables extraction, transformation, and storage of customer data to be served to other teams. At 5 billion events per day, the Kafka Streams cluster processes events from various channels and maintains a uniform identity of a customer. Read more.
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 21/22
Evgeny Vinogradov (Yandex.Money)
With a microservice architecture, a data warehouse is the first place where all the data meets. It's supplied by many different data sources and used for many purposes—from near-online transactional processing (OLTP) to model fitting and real-time classifying. Evgeny Vinogradov details his experience in managing and scaling data for support of 20+ product teams. Read more.
11:20am12:00pm Wednesday, September 25, 2019
Location: 1A 23/24
Moty Fania (Intel)
Moty Fania details Intel’s IT experience of implementing a sales AI platform. This platform is based on streaming, microservices architecture with a message bus backbone. It was designed for real-time data extraction and reasoning and handles the processing of millions of website pages and is capable of sifting through millions of tweets per day. Read more.
11:20am12:00pm Wednesday, September 25, 2019
Location: 1E 07/08
Paige Roberts (Vertica), Deepak Majeti (Vertica)
GoodData needed to autorecover from node failures and scale rapidly when workloads spiked on their MPP database in the cloud. Kubernetes could solve it, but it's for stateless microservices, not a stateful MPP database that needs hundreds of containers. Paige Roberts and Deepak Majeti detail the hurdles GoodData needed to overcome in order to merge the power of the database with Kubernetes. Read more.
11:20am12:00pm Wednesday, September 25, 2019
Location: 1E 09
Steven Touw (Immuta)
Anti-patterns are behaviors that take bad problems and lead to even worse solutions. In the world of data security and privacy, they’re everywhere. Over the past four years, data security and privacy anti-patterns have emerged across hundreds of customers and industry verticals—there's been an obvious trend. Steven Touw details five anti-patterns and, more importantly, the solutions for them. Read more.
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 15/16
Michael Noll (Confluent)
Would you cross the street with traffic information that's a minute old? Certainly not. Modern businesses have the same needs. Michael Noll explores why and how you can use Kafka and its growing ecosystem to build elastic event-driven architectures. Specifically, you look at Kafka as the storage layer, at Kafka Connect for data integration, and at Kafka Streams and KSQL as the compute layer. Read more.
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 21/22
Swasti Kakker (LinkedIn), Manu Ram Pandit (LinkedIn), Vidya Ravivarma (LinkedIn)
Join Swasti Kakker, Manu Ram Pandit, and Vidya Ravivarma to explore what's offered by a flexible and scalable hosted data science platform at LinkedIn. It provides features to seamlessly develop in multiple languages, enforce developer best practices, governance policies, execute, visualize solutions, efficient knowledge management, and collaboration to improve developer productivity. Read more.
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1A 23/24
Wim Stoop (Cloudera), Srikanth Venkat (Cloudera)
Establishing enterprise-wide security and governance remains a challenge for most organizations. Integrations and exchanges across the landscape are costly to manage and maintain, and typically work in one direction only. Wim Stoop and Srikanth Venkat explore how ODPi's Egeria standard and framework removes the challenges and is leveraged by Cloudera and partners alike to deliver value. Read more.
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1E 07/08
Gil Vernik (IBM)
Most analytic flows can benefit from serverless, starting with simple cases to and moving to complex data preparations for AI frameworks like TensorFlow. To address the challenge of how to easily integrate serverless without major disruptions to your system, Gil Vernik explores the “push to the cloud” experience, which dramatically simplifies serverless for big data processing frameworks. Read more.
1:15pm1:55pm Wednesday, September 25, 2019
Location: 1E 09
The Apache Parquet community is working on a column encryption mechanism that protects sensitive data and enables access control for table columns. Many companies are involved, and the mechanism specification has recently been signed off on by the community management committee. Gidon Gershinsky explores the basics of Parquet encryption technology, its usage model, and a number of use cases. Read more.
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 21/22
Atul Gupte (Uber)
Uber is changing the way people think about transportation. As an integral part of the logistical fabric in 65+ countries around the world, it uses ML and advanced data science to power every aspect of the Uber experience—from dispatch to customer support. Atul Gupte and Nikhil Joshi explore how Uber enables teams to transform insights into intelligence and facilitate critical workflows. Read more.
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1A 23/24
Shirshanka Das (LinkedIn), Mars Lan (LinkedIn)
Imagine scaling metadata to an organization of 10,000 employees, 1M+ data assets, and an AI-enabled company that ships code to the site three times a day. Shirshanka Das and Mars Lan dive into LinkedIn’s metadata journey from a two-person back-office team to a central hub powering data discovery, AI productivity, and automatic data privacy. They reveal metadata strategies and the battle scars. Read more.
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1E 07/08
Tomer Levi (Fundbox)
Use of data workflows is a fundamental functionality of any data engineering team. Nonetheless, designing an easy-to-use, scalable, and flexible data workflow platform is a complex undertaking. Tomer Levi walks you through how the data engineering team at Fundbox uses AWS serverless technologies to address this problem and how it enables data scientists, BI devs, and engineers move faster. Read more.
2:05pm2:45pm Wednesday, September 25, 2019
Location: 1E 09
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Data lakes have become a key ingredient in the data architecture of most companies. In the cloud, object storage systems such as S3 and ADLS make it easier than ever to operate a data lake. Tomer Shiran and Jacques Nadeau explain how you can build best-in-class data lakes in the cloud, leveraging open source technologies and the cloud's elasticity to run and optimize workloads simultaneously. Read more.
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 15/16
Weisheng Xie (Orange Financial), Jia Zhai (streamnative)
As a fintech company of China Telecom with half of a billion registered users and 41 million monthly active users, risk control decision deployment has been critical to its success. Weisheng Xie and Jia Zhai explore how the company leverages Apache Pulsar to boost the efficiency of its risk control decision development for combating financial frauds of over 50 million transactions a day. Read more.
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1A 23/24
Kaan Onuk (Uber), Luyao Li (Uber), Atul Gupte (Uber)
Uber takes data driven to the next level. It needs a robust system for discovering and managing various entities, from datasets to services to pipelines, and their relevant metadata isn't just nice—it's absolutely integral to making data useful. Kaan Onuk, Luyao Li, and Atul Gupte explore the current state of metadata management, end-to-end data flow solutions at Uber, and what’s coming next. Read more.
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1E 07/08
Shradha Ambekar (Intuit), Sunil Goplani (Intuit), Sandeep Uttamchandani (Intuit)
A business insight shows a sudden spike. It can take hours, or days, to debug data pipelines to find the root cause. Shradha Ambekar, Sunil Goplani, and Sandeep Uttamchandani outline how Intuit built a self-service tool that automatically discovers data pipeline lineage and tracks every change, helping debug the issues in minutes—establishing trust in data while improving developer productivity. Read more.
2:55pm3:35pm Wednesday, September 25, 2019
Location: 1E 09
Secondary topics:  Privacy and Security
Marcus Fowler (Darktrace)
Cybersecurity must find what it doesn’t know to look for. AI technologies led to the emergence of self-learning, self-defending networks that achieve this—detecting and autonomously responding to in-progress attacks in real time. Marcus Fowler examine these cyber-immune systems enable the security team to focus on high-value tasks, counter even machine-speed threats, and work in all environments. Read more.
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 15/16
James Terwilliger (Microsoft Corporation), Badrish Chandramouli (Microsoft Research), Jonathan Goldstein (Microsoft Research)
Trill has been open-sourced, making the streaming engine behind services like the Bing Ads platform available for all to use and extend. James Terwilliger, Badrish Chandramouli, and Jonathan Goldstein dive into the history of and insights from streaming data at Microsoft. They demonstrate how its API can power complex application logic and the performance that gives the engine its name. Read more.
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 21/22
Prakhar Jain (Qubole), Sourabh Goyal (Qubole)
Autoscaling of resources aims to achieve low latency for a big data application while reducing resource costs. Upscaling a cluster in cloud is fairly easy as compared to downscaling nodes, and so the overall total cost of ownership (TCO) goes up. Prakhar Jain and Sourabh Goyal examine a new design to get efficient downscaling, which helps achieve better resource utilization and lower TCO. Read more.
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1A 23/24
Max Neunhöffer (ArangoDB), Joerg Schad (ArangoDB)
Machine learning platforms are becoming more complex, with different components each producing their own metadata and their own way of storing metadata. Max Neunhöffer and Joerg Schad propose a first draft of a common metadata API and demonstrate a first implementation of this API in Kubeflow using ArangoDB, a native multimodel database. Read more.
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 07/08
Wangda Tan (Cloudera), Wei-Chiu Chuang (Cloudera)
Wangda Tan and Wei-Chiu Chuang outline the current status of Apache Hadoop community and dive into present and future of Hadoop 3.x. You'll get a peak at new features like erasure coding, GPU support, NameNode federation, Docker, long-running services support, powerful container placement constraints, data node disk balancing, etc. And they walk you through upgrade guidance from 2.x to 3.x. Read more.
4:35pm5:15pm Wednesday, September 25, 2019
Location: 1E 09
Jeff Zemerick (Mountain Fog)
Hospitals small and large are adopting cloud technologies, and many are in hybrid environments. These distributed environments pose challenges, none of which are more critical than the protection of protected health information (PHI). Jeff Zemerick explores how open source technologies can be used to identify and remove PHI from streaming text in an enterprise healthcare environment. Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 15/16
Bas Geerdink (Aizonic)
Streaming analytics (or fast data processing) is the field of making predictions based on real-time data. Bas Geerdink presents a fast data architecture that covers many use cases that follow a "pipes and filters" pattern. This architecture can be used to create enterprise-grade solutions with a diversity of technology options. The stack is Kafka, Ignite, and Spark Structured Streaming (KISSS). Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 21/22
Chenzhao Guo (Intel), Carson Wang (Intel)
Shuffle in Spark requires the shuffle data to be persisted on local disks. However, the assumptions of collocated storage do not always hold in today’s data centers. Chenzhao Guo and Carson Wang outline the implementation of a new Spark shuffle manager, which writes shuffle data to a remote cluster with different storage backends, making life easier for customers. Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 23/24
Naghman Waheed (Bayer Crop Science), John Cooper (Bayer)
As complexity of data systems has grown at Bayer, so has the difficulty to locate and understand what datasets are available for consumption. Naghman Waheed and John Cooper outline a custom metadata management tool recently deployed at Bayer. The system is cloud-enabled and uses multiple open source components, including machine learning and natural language processing to aid searches. Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1E 07/08
Krishna Maheshwari (Cloudera)
Krishna Maheshwari provides an overview of the major features and enhancements in the HBase 2.0 release, upcoming releases, and the future of HBase. You'll be able to ask her questions at the end. Apache HBase 2.0 comes packed with a lot of new functionalities: off-heap read paths, multitier bucket cache, new finite state machine-based assignment manager, etc. Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1E 09
Matt Carothers (Cox Communications), Jignesh Patel (Cox Communications), Harry Tang (Cox Communications)
Organizations often work with sensitive information such as social security and credit card numbers. Although this data is stored in encrypted form, most analytical operations require data decryption for computation. This creates unwanted exposures to theft or unauthorized read by undesirables. Matt Carothers, Jignesh Patel, and Harry Tang explain how homomorphic encryption prevents fraud. Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1A 03
Neelesh Salian (Stitch Fix)
Every data team has to build an ecosystem that sustains the data, the users, and the use of the data itself. This data ecosystem comes with its own challenges during the building phase, maintenance, and enhancement. Neelesh Salian dives into the importance of data lineage for an organization. You'll explore how to go about building such a system. Read more.
5:25pm6:05pm Wednesday, September 25, 2019
Location: 1E 06
venkata gunnu (Comcast), Harish Doddi (Datatron)
Machine learning infrastructure is key to the success of AI at scale in enterprises, with many challenges when you want to bring machine learning models to a production environment, given the legacy of the enterprise environment. Venkata Gunnu and Harish Doddi explore some key insights, what worked, what didn't work, and best practices that helped the data engineering and data science teams. Read more.
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 15/16
Jing Huang (SurveyMonkey), Jessica Mong (SurveyMonkey)
You're a SaaS company operating on a cloud infrastructure prior to the machine learning (ML) era and you need to successfully extend your existing infrastructure to leverage the power of ML. Jing Huang and Jessica Mong detail a case study with critical lessons from SurveyMonkey’s journey of expanding its ML capabilities with its rich data repo and hybrid cloud infrastructure. Read more.
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 21/22
Stavros Kontopoulos (Lightbend), Debasish Ghosh (Lightbend )
Stavros Kontopoulos and Debasish Ghosh explore online machine learning algorithm choices for streaming applications, especially those with resource-constrained use cases like IoT and personalization. They dive into Hoeffding Adaptive Trees, classic sketch data structures, and drift detection algorithms from implementation to production deployment, describing the pros and cons of each of them. Read more.
11:20am12:00pm Thursday, September 26, 2019
Location: 1A 23/24
Michael Freedman (TimescaleDB | Princeton University)
Leveraging polyglot solutions for your time series data can lead to issues including engineering complexity, operational challenges, and even referential integrity concerns. Michael Freedman explains why, by re-engineering PostgreSQL to serve as a general data platform, your high-volume time series workloads will be better streamlined, resulting in more actionable data and greater ease of use. Read more.
11:20am12:00pm Thursday, September 26, 2019
Location: 1E 07/08
Petar Zecevic (SV Group)
The Large Scale Survey Telescope (LSST) is one of the most important future surveys. Its unique design allows it to cover large regions of the sky and obtain images of the faintest objects. After 10 years of operation, it will produce about 80 PB of data in images and catalog data. Petar Zecevic explains AXS, a system built for fast processing and cross-matching of survey catalog data. Read more.
11:20am12:00pm Thursday, September 26, 2019
Location: 1E 09
Rick Houlihan (Amazon Web Services)
Data has always been and will always be relational. NoSQL databases are gaining in popularity, but that doesn't change the fact that the data is still relational, it just changes how we have to model the data. Rick Houlihan dives deep into how real entity relationship models can be efficiently modeled in a denormalized manner using schema examples from real application services. Read more.
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 15/16
Alon Gavra (AppsFlyer)
Frequently, Kafka is just a piece of the stack that lives in production that often times no one wants to touch—because it just works. Alon Gavra outlines how Kafka sits at the core of AppsFlyer's infrastructure that processes billions of events daily. Read more.
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 21/22
Jim Scott (NVIDIA)
Data scientists create and test hundreds or thousands more models than in the past. Models require support from both real-time and static data sources. As data becomes enriched, and parameters tuned and explored, there's a need for versioning everything, including the data. Jim Scott examines the very specific problems and approaches to fix them. Read more.
1:15pm1:55pm Thursday, September 26, 2019
Location: 1A 23/24
Omkar Joshi (Uber), Bo Yang (Uber)
Omkar Joshi and Bo Yang offer an overview of how Uber’s ingestion (Marmary) and observability team improved performance of Apache Spark applications running on thousands of cluster machines and across hundreds of thousands+ of applications and how the team methodically tackled these issues. They also cover how they used Uber’s open-sourced jvm-profiler for debugging issues at scale. Read more.
1:15pm1:55pm Thursday, September 26, 2019
Location: 1E 07/08
Sushant Rao (Cloudera)
Jason Wang and Sushant Rao offer an overview of cloud architecture, then go into detail on core cloud paradigms like compute (virtual machines), cloud storage, authentication and authorization, and encryption and security. They conclude by bringing these concepts together through customer stories to demonstrate how real-world companies have leveraged the cloud for their big data platforms. Read more.
1:15pm1:55pm Thursday, September 26, 2019
Location: 1E 09
Shant Hovsepian (Arcadia Data)
With cloud object storage (e.g., S3, ADLS) one expects business intelligence (BI) applications to benefit from the scale of data and real-time analytics. However, traditional BI in the cloud surfaces nonobvious challenges. Shant Hovsepian examines service-oriented cloud design (storage, compute, catalog, security, SQL) and how native cloud BI provides analytic depth, low cost, and performance. Read more.
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 15/16
Davor Bonaci (Kaskada), Anand Madhavan (Narvar)
Narvar provides next-generation posttransaction experience for over 500 retailers. Karthik Ramasamy and Anand Madhavan take you on the journey of how Narvar moved away from using a slew of technologies for their platform and consolidated its use cases using Apache Pulsar. Read more.
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 21/22
Diego Oppenheimer (Algorithmia)
Machine learning (ML) will fundamentally change the way we build and maintain applications. Diego Oppenheimer dives into how you can adapt your infrastructure, operations, staffing, and training to meet the challenges of the new software development life cycle (SDLC) without throwing away everything that already works. Read more.
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 23/24
Building a reliable big data platform is extremely challenging when it has to store and serve hundreds of petabytes of data in real time. Reza Shiftehfar reflects on the challenges faced and proposes architectural solutions to scale a big data platform to ingest, store, and serve 100+ PB of data with minute-level latency while efficiently utilizing the hardware and meeting security needs. Read more.
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 07/08
Nikki Rouda (Amazon Web Services), Janisha Anand (Amazon Web Services)
Nikki Rouda and Janisha Anand demonstrate how to deduplicate or link records in a dataset, even when the records don’t have a common unique identifier and no fields match exactly. You'll also learn how to link customer records across different databases, match external product lists against your own catalog, and solve tough challenges to prepare and cleanse data for analysis. Read more.
2:05pm2:45pm Thursday, September 26, 2019
Location: 1E 09
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
With cheap and scalable storage services such as S3 and ADLS, it's never been easier to dump data into a cloud data lake. But you still need to secure that data and be sure it doesn't leak. Tomer Shiran and Jacques Nadeau explore capabilities for securing a cloud data lake, including authentication, access control, encryption (in motion and at rest), and auditing, as well as network protections. Read more.
2:05pm2:45pm Thursday, September 26, 2019
Location: 1A 03
Stephan Ewen (Ververica)
Stephan Ewen details how stream processing is becoming a "grand unifying paradigm" for data processing and the newest developments in Apache Flink to support this trend: new cross-batch-streaming machine learning algorithms, state-of-the-art batch performance, and new building blocks for data-driven applications and application consistency. Read more.
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 23/24
Vitaliy Baklikov (DBS Bank), Dipti Borkar (Alluxio )
Vitaliy Baklikov and Dipti Borkar explore how DBS Bank built a modern big data analytics stack leveraging an object store even for data-intensive workloads like ATM forecasting and how it uses Alluxio to orchestrate data locality and data access for Spark workloads. Read more.
3:45pm4:25pm Thursday, September 26, 2019
Location: 1E 07/08
Scott Castle (Sisense)
In this session, Scott Castle, General Manager at Sisense and former VP of Product at Periscope Data, will discuss lessons learned from scaling up Periscope Data to support incredibly large volumes of data and queries from its data teams. Read more.
3:45pm4:25pm Thursday, September 26, 2019
Location: 1E 09
Owen O'Malley (Cloudera)
Fine-grained data protection at a column level in data lake environments has become a mandatory requirement to demonstrate compliance with multiple local and international regulations across many industries today. Owen O'Malley dives into how column encryption in ORC files enables both fine-grain protection and audits of who accessed the private data. Read more.
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 15/16
Jonghyok Lee (SK Telecom), Chon Yong Lee (SK Telecom)
Jonghyok Lee Chon Yong Lee discuss T-CORE, SK Telecom’s monitoring and service analytics platform, which collects system and application data from several thousand servers and applications and provides a 3D visualization of the real-time status of the whole network. Join in to hear lessons learned during development. Read more.
3:45pm4:25pm Thursday, September 26, 2019
Location: 1A 21/22
Sireesha Muppala (Amazon Web Services), Shelbee Eigenbrode (Amazon Web Services), Randall DeFauw (Amazon Web Services)
As an increasing level of automation becomes available to data science, the balance between automation and quality needs to be maintained. Applying DevOps practices to machine learning workloads brings models to the market faster and maintains the quality and integrity of those models. Sireesha Muppala, Shelbee Eigenbrode, and Randall DeFauw explore applying DevOps practices to ML workloads. Read more.
4:35pm5:15pm Thursday, September 26, 2019
Location: 1A 23/24
Supun Kamburugamuve (Indiana University)
Big data computing and high-performance computing (HPC) evolved over the years as separate paradigms. With the explosion of the data and the demand for machine learning algorithms, these two paradigms increasingly embrace each other for data management and algorithms. Supun Kamburugamuve explores the possibilities and tools available for getting the best of HPC and big data. Read more.
4:35pm5:15pm Thursday, September 26, 2019
Location: 1E 09
Ruixin Xu (Microsoft), Long Tian (Microsoft), Yu Zhou (Microsoft)
Ruixin Xu, Long Tian, and Yu Zhou explore an experiment run using Spark and Jupyter notebooks as a replacement for existing IDE-based tools for internal DevOps. The Spark-based solution improved the diagnosis performance significantly, especially for a complex job with a large profile, and leveraging the Jupyter notebooks brings the benefit of fast iteration and easy knowledge share. Read more.

    Contact us

    confreg@oreilly.com

    For conference registration information and customer service

    partners@oreilly.com

    For more information on community discounts and trade opportunities with O’Reilly conferences

    strataconf@oreilly.com

    For information on exhibiting or sponsoring a conference

    pr@oreilly.com

    For media/analyst press inquires