Presented By O'Reilly and Cloudera
Make Data Work
September 25–26, 2017: Training
September 26–28, 2017: Tutorials & Conference
New York, NY

Data Engineering & Architecture

September 25-28, 2017
New York, NY

Ben Lorica, Strata Conference Chair

Tuesday | Wednesday | Thursday

Data Engineering and Architecture

How to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

It’s not easy. Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

All Strata Data Conference Gold and Silver passes have access to Data Engineering and Architecture sessions Tuesday-Thursday. Platinum and Bronze passes have access to Data Engineering and Architecture sessions Wednesday-Thursday.

Tuesday September 26: Tutorials (Gold & Silver passes)
Location: 1A 18 Location: 1A 23/24 Location: 1E 12/13 Location: 1E 14 Location: 1E 10 Location: 1E 15/16
12:30pm | Location: TBD
Lunch
5:00pm | Location: Expo Hall
Opening Reception
Wednesday September 27: Keynotes & Sessions (Gold, Silver & Bronze passes)
Location: 1A 15/16/17 Location: 1A 21/22 Location: 1A 23/24 Location: 1E 07/08 Location: 1E 09
8:45 | Location: 3E
Strata Data Conference Keynotes
10:50am
Morning break
12:00pm
Lunch
3:35pm
Afternoon break
6:05pm | Location: Expo Hall
Booth Crawl
7:30pm | Location: 230 Fifth Penthouse
Data After Dark
Thursday September 28: Keynotes & Sessions (Gold, Silver & Bronze passes)
Location: 1A 15/16/17 Location: 1A 21/22 Location: 1A 23/24 Location: 1E 07/08 Location: 1E 09
8:45 | Location: 3E
Strata Data Conference Keynotes
10:50am
Morning break
12:00pm
Lunch
3:35pm
Afternoon break
Add to your personal schedule
9:00am12:30pm Tuesday, September 26, 2017
Location: 1A 23/24 Level: Beginner
Secondary topics:  Cloud
Pranav Rastogi (Microsoft)
Average rating: **...
(2.50, 2 ratings)
As big data solutions are rapidly moving to the cloud, it's becoming increasingly important to know how to use Apache Hadoop, Spark, R Server, and other open source technologies in the cloud. Pranav Rastogi walks you through building big data applications on Azure HDInsight and other Azure services. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 26, 2017
Location: 1E 12/13 Level: Intermediate
Secondary topics:  Architecture
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Silicon Valley Data Science)
Average rating: ***..
(3.27, 11 ratings)
What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 26, 2017
Location: 1E 14 Level: Intermediate
Secondary topics:  Streaming
Ian Wrigley (StreamSets)
Average rating: ****.
(4.50, 4 ratings)
Ian Wrigley demonstrates how Kafka Connect and Kafka Streams can be used together to build real-world, real-time streaming data pipelines. Using Kafka Connect, you'll ingest data from a relational database into Kafka topics as the data is being generated and then process and enrich the data in real time using Kafka Streams before writing it out for further analysis. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, September 26, 2017
Location: 1E 10 Level: Intermediate
Secondary topics:  Architecture, Cloud
Jennifer Wu (Cloudera), Fahd Siddiqui (Cloudera), Paul George (Cloudera), Eugene Fratkin (Cloudera)
Average rating: *....
(1.50, 2 ratings)
Jennifer Wu, Paul George, Fahd Siddiqui, and Eugene Fratkin lead a deep dive into running data engineering workloads in a managed service capacity in the public cloud. Along the way, they share AWS infrastructure best practices and explain how data engineering workloads interoperate with data analytic workloads. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 26, 2017
Location: 1A 18 Level: Intermediate
Secondary topics:  Cloud
Mark Donsky (Cloudera), Manish Ahluwalia (Nerdwallet), Andre Araujo (Cloudera), Syed Rafice (Cloudera)
Average rating: *****
(5.00, 1 rating)
Mark Donsky, André Araujo, Syed Rafice, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 26, 2017
Location: 1E 12/13 Level: Advanced
Secondary topics:  Architecture
Jonathan Seidman (Cloudera), Gwen Shapira (Confluent), Mark Grover (Lyft)
Average rating: ****.
(4.11, 9 ratings)
Using Customer 360 and the IoT as examples, Jonathan Seidman, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 26, 2017
Location: 1E 14 Level: Beginner
Secondary topics:  Architecture, Streaming
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Arun Kejariwal (MZ), Neng Lu (Twitter), Sijie Guo (Streamlio)
Average rating: ***..
(3.00, 3 ratings)
Karthik Ramasamy, Sanjeev Kulkarni, Avrilia Floratau, Ashvin Agrawal, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming systems, algorithms, and deployment architectures, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, September 26, 2017
Location: 1E 15/16 Level: Intermediate
Secondary topics:  Architecture, Cloud
Ryan Nienhuis (Amazon Web Services (AWS)), Radhika Ravirala (Amazon Web Services (AWS)), Allan MacInnis (Amazon Web Services), Ben Snively (Amazon Web Services (AWS))
Average rating: ****.
(4.00, 2 ratings)
Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Ryan Nienhuis, Radhika Ravirala, Allan MacInnis, and Ben Snively walk you through building a big data application using a combination of open source technologies and AWS managed services. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 27, 2017
Location: 1A 15/16/17 Level: Intermediate
Secondary topics:  ecommerce
Average rating: ***..
(3.00, 1 rating)
Neelesh Srinivas Salian offers an overview of the data platform used by data scientists at Stitch Fix, based on the Spark ecosystem. Neelesh explains the development process and shares some lessons learned along the way. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Intermediate
Michelle Ufford (Netflix)
Average rating: ****.
(4.78, 9 ratings)
What if we used the wealth of data and experience at our disposal to drive improvements in data engineering? Michelle Ufford explains how Netflix is using data to find common patterns among the chaos that enable the company to automate repetitive and time-consuming tasks and discover ways to improve data quality, reduce costs, and quickly identify and respond to issues. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 27, 2017
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Geospatial, Logistics, Platform
Zhenxiao Luo (Uber), Wei Yan (Uber)
Average rating: ****.
(4.43, 7 ratings)
Uber's geospatial data is increasing exponentially as the company grows. As a result, its big data systems must also grow in scalability, reliability, and performance to support business decisions, user recommendations, and experiments for geospatial data. Zhenxiao Luo and Wei Yan explain how Uber runs geospatial analysis efficiently in its big data systems, including Hadoop, Hive, and Presto. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 27, 2017
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Streaming
Dean Wampler (Lightbend)
Average rating: ***..
(3.00, 3 ratings)
While stream processing is now popular, streaming architectures must be more reliable and scalable than ever before—more like microservice architectures in fact. Dean Wampler defines "stream" based on characteristics for such systems, using specific tools like Kafka, Spark, Flink, and Akka as examples, and argues that big data and microservices architectures are converging. Read more.
Add to your personal schedule
11:20am12:00pm Wednesday, September 27, 2017
Location: 1E 09 Level: Intermediate
Secondary topics:  IoT
Mateusz Dymczyk (H2O.ai), Mathieu Dumoulin (MapR Technologies)
Average rating: ****.
(4.00, 2 ratings)
Mateusz Dymczyk and Mathieu Dumoulin showcase a working, practical, predictive maintenance pipeline in action and explain how they built a state-of-the-art anomaly detection system using big data frameworks like Spark, H2O, TensorFlow, and Kafka on the MapR Converged Data Platform. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 27, 2017
Location: 1A 15/16/17 Level: Intermediate
Cheng Chang (Alluxio), Haoyuan Li (Alluxio)
Alluxio (formerly Tachyon) is a memory-speed virtual distributed storage system that leverages memory for managing data across different storage. Many deployments use Alluxio with Spark because Alluxio helps Spark further accelerate applications. Haoyuan Li and Cheng Chang explain how Alluxio makes Spark more effective and share production deployments of Alluxio and Spark working together. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Advanced
Secondary topics:  Financial services, Platform
Average rating: ****.
(4.57, 7 ratings)
John Hitchingham shares insights into the design and operation of FINRA's data lake in the AWS cloud, where FINRA extracts, transforms, and loads over 75B transactions per day. Users can query across petabytes of data in seconds on AWS S3 using Presto and Spark—all while maintaining security and data lineage. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 27, 2017
Location: 1A 23/24 Level: Beginner
Secondary topics:  Platform, Telecom
Travis Bakeman (T-Mobile)
Average rating: **...
(2.00, 1 rating)
Travis Bakeman shares how T-Mobile ported its large-scale network performance management platform, T-PIM, from a legacy database to a big data platform with Impala as the main reporting interface, covering the migration journey, including the challenges the team faced, how the team evaluated new technologies, lessons learned along the way, and the efficiencies gained as a result. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 27, 2017
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Architecture, IoT, Streaming
Michael Freedman (TimescaleDB | Princeton)
Average rating: ****.
(4.50, 4 ratings)
Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. Read more.
Add to your personal schedule
1:15pm1:55pm Wednesday, September 27, 2017
Location: 1E 09 Level: Intermediate
Secondary topics:  Financial services, Logistics
Riccardo Corbella (Data Reply IT), Beniamino Del Pizzo (Data Reply IT)
Average rating: ****.
(4.00, 2 ratings)
With more than 4.5 million black boxes, Italian car insurance has the most telematics clients in the world. Riccardo Corbella and Beniamino Del Pizzo explore the data management challenges that occur in a streaming context when the amount of data to process is gigantic and share a data management model capable of providing the scalability and performance needed to support massive growth. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 27, 2017
Location: 1A 15/16/17 Level: Intermediate
Secondary topics:  Architecture, Cloud
Henry Robinson (Cloudera), Greg Rahn (Cloudera)
Cloud environments will likely play a key role in your business’s future. Henry Robinson and Greg Rahn explore the workload considerations when evaluating the cloud for analytics and discuss common architectural patterns to optimize price and performance. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Intermediate
Lucy Yu (MemSQL)
Average rating: **...
(2.50, 6 ratings)
Lucy Yu demonstrates how to extend the Spark SQL abstraction to support more complex pushdown, such as group by, subqueries, and joins. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 27, 2017
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Streaming
Todd Lipcon (Cloudera)
Average rating: *****
(5.00, 3 ratings)
To date, mutable big data storage has primarily been the domain of nonrelational (NoSQL) systems such as Apache HBase. However, demand for real-time analytic architectures has led big data back to a familiar friend: relationally structured data storage systems. Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 27, 2017
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Streaming
Dustin Cote (Confluent)
Average rating: ****.
(4.00, 2 ratings)
Dustin Cote shares his experience troubleshooting Apache Kafka in production environments and explains how to avoid pitfalls like message loss or performance degradation in your environment. Read more.
Add to your personal schedule
2:05pm2:45pm Wednesday, September 27, 2017
Location: 1E 09 Level: Non-technical
Secondary topics:  Data for good, Healthcare, IoT
Julie Lockner (17 Minds Corporation)
Average rating: ****.
(4.00, 1 rating)
How can we empower individuals with special needs to reach their full potential? Julie Lockner offers an overview of a project to develop collaboration applications that use wearable device data to improve the ability to develop the best possible care and education plans. Join in to learn how real-time IoT data analytics are making this possible. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 27, 2017
Location: 1A 15/16/17 Level: Intermediate
Roy Ben-Alta (Amazon Web Services), Allan MacInnis (Amazon Web Services)
Average rating: ****.
(4.33, 3 ratings)
Speed matters. Today, decisions are made based on real-time insights, but in order to support the substantial growth of streaming data, companies are required to innovate. Roy Ben-Alta and Allan MacInnis explore AWS solutions powered by machine learning and artificial intelligence. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Intermediate
Holden Karau (IBM), Seth Hendrickson (Cloudera)
Average rating: *****
(5.00, 1 rating)
Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. Holden Karau and Seth Hendrickson introduce Spark’s ML pipelines and explain how to extend them with your own custom algorithms. Even if you don't have your own algorithm to add, you'll leave with a deeper understanding of Spark's ML pipelines. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 27, 2017
Location: 1A 23/24 Level: Beginner
Secondary topics:  Platform, Sales
Simon Chan (Salesforce)
Average rating: *****
(5.00, 1 rating)
Salesforce recently released Einstein, which brings AI into its core platform to power every business. The secret behind Einstein is an underlying platform that accelerates AI development at scale for both internal and external data scientists. Simon Chan shares his experience building this unified platform for a multitenancy, multibusiness cloud enterprise. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 27, 2017
Location: 1E 07/08 Level: Intermediate
Jun Rao (Confluent)
Average rating: *****
(5.00, 3 ratings)
Over the last few years, streaming platform Apache Kafka has been used extensively for real-time data collecting, delivering, and processing—particularly in the enterprise. Jun Rao leads a deep dive into some of the key internals that help make Kafka popular and provide strong reliability guarantees. Read more.
Add to your personal schedule
2:55pm3:35pm Wednesday, September 27, 2017
Location: 1E 09 Level: Beginner
Marc Carlson (Seattle Children's Research Institute), Sean Taylor (Seattle Children's Research Institute)
Average rating: *****
(5.00, 1 rating)
Marc Carlson and Sean Taylor offer an overview of Project Rainier, which leverages the power of HDFS and the Hadoop and Spark ecosystem to help scientists at Seattle Children’s Research Institute quickly find new patterns and generate predictions that they can test later, accelerating important pediatric research and increasing scientific collaboration by highlighting where it is needed most. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 27, 2017
Location: 1A 15/16/17 Level: Intermediate
Secondary topics:  Architecture, Streaming
Paul Curtis (MapR Technologies)
Average rating: ****.
(4.67, 3 ratings)
A microservices architecture benefits from the agility of containers for convenient, predictable deployment of applications, while persistent, performant message streaming makes both work better. Paul Curtis explores these infrastructure components and discusses the design of highly scalable real-world systems that take advantage of this powerful triad. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Beginner
Average rating: ****.
(4.00, 2 ratings)
Apache Kudu is a new, innovative distributed storage that combines low-latency data ingestion, scalable analytics, and fast data lookups. But what does it deliver in practice? Zbigniew Baranowski explains how to use Apache Kudu for scale-out database-like systems, such as those used at CERN, covering the advantages and limitations and measuring performance. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 27, 2017
Location: 1A 23/24 Level: Advanced
Secondary topics:  Architecture, Media, Platform
Barbara Eckman (Comcast)
Average rating: ***..
(3.00, 2 ratings)
Barbara Eckman offers an overview of Comcast’s streaming data platform, comprised of a variety of ingest, transformation, and storage services, which uses Apache Avro schemas to support end-to-end data governance, Apache Atlas for data discovery and lineage, and custom asynchronous messaging libraries to notify Atlas of new data and schema entities and lineage links as they are created. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 27, 2017
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Streaming
Fabian Hueske (data Artisans)
Average rating: ****.
(4.00, 1 rating)
Although the most widely used language for data analysis, SQL is only slowly being adopted by open source stream processors. One reason is that SQL's semantics and syntax were not designed with streaming data in mind. Fabian Hueske explores Apache Flink's two relational APIs for streaming analytics—standard SQL and the LINQ-style Table API—discussing their semantics and showcasing their usage. Read more.
Add to your personal schedule
4:35pm5:15pm Wednesday, September 27, 2017
Location: 1E 09 Level: Intermediate
Secondary topics:  Architecture, Platform, Streaming
Stephen Devine (Big Fish Games), Kalah Brown (Big Fish Games)
Companies are increasingly interested in processing and analyzing live-streaming data. The Hadoop ecosystem includes platforms and software library frameworks to support this work, but these components require correct architecture, performance tuning, and customization. Stephen Devine and Kalah Brown explain how they used Spark, Flume, and Kafka to build a live-streaming data pipeline. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 27, 2017
Location: 1A 15/16/17 Level: Intermediate
Secondary topics:  Cloud, Media, Platform
Josh Baer (Spotify), Alison Gilles (Spotify)
Average rating: ****.
(4.00, 1 rating)
In early 2016, Spotify decided that it didn’t want to be in the data center business. The future was the cloud. Josh Baer and Alison Gilles explain what it took to move Spotify to the cloud, covering Spotify's technology choices, challenges faced, and the lessons Spotify learned along the way. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 27, 2017
Location: 1A 21/22 Level: Advanced
Adrian Popescu (Unravel Data Systems), Shivnath Babu (Unravel Data Systems)
A roadblock in the agility that comes with Spark is that application developers can get stuck with application failures and have a tough time finding and resolving the issue. Adrian Popescu and Shivnath Babu explain how to use the root cause diagnosis algorithm and methodology to solve failure problems with ML and AI apps in Spark. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 27, 2017
Location: 1A 23/24 Level: Intermediate
Ihab Ilyas (University of Waterloo | Tamr)
Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas provides insight into various techniques and discusses how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 27, 2017
Location: 1E 07/08 Level: Beginner
Secondary topics:  Financial services, Media, Streaming
Karthik Ramasamy (Streamlio), Supun Kamburugamuve (Indiana University)
Modern enterprises are data driven and want to move at light speed. To achieve real-time performance, financial applications use streaming infrastructures for low latency and high throughput. Twitter Heron is an open source streaming engine with low latency around 14 ms. Karthik Ramasamy and Supun Kamburugamuvee explain how they ported Heron to Infiniband to achieve latencies as low as 7 ms. Read more.
Add to your personal schedule
5:25pm6:05pm Wednesday, September 27, 2017
Location: 1E 09 Level: Intermediate
Secondary topics:  Architecture, IoT
Dave Shuman (Cloudera), James Kirkland (Red Hat)
Eclipse IoT is an ecosystem of organizations that are working together to establish an IoT architecture based on open source technologies and standards. Dave Shuman and James Kirkland showcase an end-to-end architecture for the IoT based on open source standards, highlighting Eclipse Kura, an open source stack for gateways and the edge, and Eclipse Kapua, an open source IoT cloud platform. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 28, 2017
Location: 1A 15/16/17 Level: Intermediate
Secondary topics:  Cloud
Chris Mills (The Meet Group)
if(we)'s batch event processing pipeline is different from yours, but the process of migrating it from running in a data center to running in AWS is likely pretty similar. Chris Mills explains what was easier than expected, what was harder, and what the company wished it had known before starting the migration. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 28, 2017
Location: 1A 18 Level: Intermediate
Secondary topics:  Cloud
Stephen Wu (Microsoft)
Average rating: ****.
(4.00, 1 rating)
Remote storage in the cloud provides an infinitely scalable, cost-effective, and performant solution for big data customers. Adoption is rapid due to the flexibility and cost savings associated with unlimited storage capacity when separating compute and storage. Stephen Wu demonstrates how to correctly performance tune your workloads when your data is stored in remote storage in the cloud. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 28, 2017
Location: 1A 23/24 Level: Beginner
Secondary topics:  Architecture, Cloud, Streaming
Gwen Shapira (Confluent)
Average rating: ****.
(4.50, 2 ratings)
Gwen Shapira explains how the three realities of modern programming—the explosion of data and data systems, building business processes as microservices instead of monolithic applications, and the rise of the public cloud—affect how developers and companies operate today and why companies across all industries are turning to streaming data and Apache Kafka for mission-critical applications. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 28, 2017
Location: 1E 07/08 Level: Beginner
Secondary topics:  Streaming
Reuven Lax (Google)
Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. Reuven Lax offers an overview of Beam basic concepts and demonstrates that portability in action. Read more.
Add to your personal schedule
11:20am12:00pm Thursday, September 28, 2017
Location: 1E 09 Level: Intermediate
Secondary topics:  Architecture, IoT, Streaming
Michael Crutcher (Cloudera), Ryan Lippert (Cloudera)
A long time ago in a data center far, far away, we deployed complex lambda architectures as the backbone of our IoT solutions. Though hard, they enabled collection of real-time sensor data and slightly delayed analytics. Michael Crutcher and Ryan Lippert explain why Apache Kudu, a relational storage layer for fast analytics on fast data, is the key to unlocking the value in IoT data. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 28, 2017
Location: 1A 15/16/17 Level: Intermediate
Secondary topics:  Cloud
Bill Havanki (Cloudera)
Speed and reliability in deploying big data clusters is key for effectiveness in the cloud. Drawing on ideas from his book Moving Hadoop to the Cloud, which covers essential practices like baking images and automating cluster configuration, Bill Havanki explains how you can automate the creation of new clusters from scratch and use metrics gathered using the cloud provider to scale up. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 28, 2017
Location: 1A 21/22 Level: Intermediate
Secondary topics:  Data for good, Media, Platform
Andrew Otto (Wikimedia Foundation), Fangjin Yang (Imply)
The Wikimedia Foundation (WMF) is a nonprofit charitable organization. As the parent company of Wikipedia, one of the most visited websites in the world, WMF faces many unique challenges around its ecosystem of editors, readers, and content. Andrew Otto and Fangjin Yang explain how the WMF does analytics and offer an overview of the technology it uses to do so. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 28, 2017
Location: 1A 23/24 Level: Intermediate
Tony McAllister (Be the Match (National Marrow Donor Program))
The National Marrow Donor Program (Be the Match) recently moved its core transplant matching platform onto Cloudera Hadoop. Tony McAllister explains why the program chose Cloudera Hadoop and shares its big data goals: to increase the number of donors and matches, make the process more efficient, and make transplants more effective. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 28, 2017
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Streaming
Tyler Akidau (Google)
Average rating: ****.
(4.40, 5 ratings)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.
Add to your personal schedule
1:15pm1:55pm Thursday, September 28, 2017
Location: 1E 09 Level: Beginner
Secondary topics:  Architecture, Streaming
Matteo Merli (Streamlio), Sijie Guo (Streamlio)
Average rating: *****
(5.00, 2 ratings)
Modern enterprises produce data at increasingly high volume and velocity. To process data in real time, new types of storage systems have been designed, implemented, and deployed. Matteo Merli and Sijie Guo offer an overview of Apache DistributedLog and Pulsar, real-time storage systems built using Apache BookKeeper and used heavily in production. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 28, 2017
Location: 1A 15/16/17 Level: Beginner
Secondary topics:  Cloud
Michael McCune (Red Hat)
Average rating: *****
(5.00, 2 ratings)
Notebook interfaces like Apache Zeppelin and Project Jupyter are excellent starting points for sketching out ideas and exploring data-driven algorithms, but where does the process lead after the notebook work has been completed? Michael McCune offers some answers as they relate to cloud-native platforms. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 28, 2017
Location: 1A 21/22 Level: Intermediate
Sneha Rao (Spotify), Joel Östlund (Spotify)
Spotify makes data-driven product decisions. As the company grows, the magnitude and complexity of the data it cares for the most is rapid increasing. Sneha Rao and Joel Östlund walk you through how Spotify stores and exposes audience data created from multiple internal producers within Spotify. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 28, 2017
Location: 1A 23/24 Level: Advanced
Julien Le Dem (Apache Parquet)
Average rating: ****.
(4.75, 4 ratings)
Julien Le Dem explains how Parquet is improving at the storage level, with metadata and statistics that will facilitate more optimizations in query engines in the future, how the new vectorized reader from Parquet to Arrow enables much faster reads by removing abstractions, and how standard Arrow-based APIs are paving the way to breaking the silos of big data. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 28, 2017
Location: 1E 07/08 Level: Intermediate
Gwen Shapira (Confluent)
Average rating: ***..
(3.33, 3 ratings)
There are many good reasons to run more than one Kafka cluster…and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions, so you can better choose the right architecture for your needs. Read more.
Add to your personal schedule
2:05pm2:45pm Thursday, September 28, 2017
Location: 1E 09 Level: Non-technical
Secondary topics:  ecommerce, Geospatial, IoT, Logistics, Platform, Retail
Javier Esplugas (DHL Supply Chain), Kevin Parent (Conduce)
DHL has created an IoT initiative for its supply chain warehouse operations. Javier Esplugas and Kevin Parent explain how DHL has gained unprecedented insight—from the most comprehensive global view across all locations to a unique data feed from a single sensor—to see, understand, and act on everything that occurs in its warehouses with immersive operational data visualization. Read more.
Add to your personal schedule
2:55pm3:35pm Thursday, September 28, 2017
Location: 1A 15/16/17 Level: Intermediate
Secondary topics:  Architecture
Jennifer Wu (Cloudera), Philip Langdale (Cloudera), Kostas Sakellis (Cloudera)
With its scalable data store, elastic compute, and pay-as-you-go cost model, cloud infrastructure is well-suited for large-scale data engineering workloads. Jennifer Wu, Philip Langdale, and Kostas Sakellis explore the latest cloud technologies, focusing on data engineering workloads, cost, security, and ease-of-use implications for data engineers. Read more.
Add to your personal schedule
2:55pm3:35pm Thursday, September 28, 2017
Location: 1A 21/22 Level: Advanced
Kimoon Kim (Pepperdata)
There is growing interest in running Spark natively on Kubernetes. Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. Kimoon Kim demonstrates how to run HDFS inside Kubernetes to speed up Spark. Read more.
Add to your personal schedule
2:55pm3:35pm Thursday, September 28, 2017
Location: 1A 23/24 Level: Intermediate
Secondary topics:  Architecture, Media, Platform
Felix GV (LinkedIn), Yan Yan (LinkedIn)
Average rating: **...
(2.00, 1 rating)
Companies with batch and stream processing pipelines need to serve the insights they glean back to their users, an often-overlooked problem that can be hard to achieve reliably and at scale. Felix GV and Yan Yan offer an overview of Venice, a new data store capable of ingesting data from Hadoop and Kafka, merging it together, replicating it globally, and serving it online at low latency. Read more.
Add to your personal schedule
2:55pm3:35pm Thursday, September 28, 2017
Location: 1E 07/08 Level: Intermediate
Secondary topics:  Streaming
Tim Berglund (Confluent)
Average rating: **...
(2.50, 2 ratings)
Tim Berglund offers a thorough introduction to the Streams API, an important recent addition to Kafka that lets us build sophisticated stream processing systems that are as scalable and fault tolerant as Kafka itself—and also happen to align quite well with the microservices sensibilities that are so common in contemporary architectural thinking. Read more.
Add to your personal schedule
2:55pm3:35pm Thursday, September 28, 2017
Location: 1E 09 Level: Beginner
Secondary topics:  IoT
Alexandra Gunderson (Arundo Analytics)
One of the main challenges when working with industrial data is linking the large amount of data and extracting value. Alexandra Gunderson shares a comprehensive preprocessing methodology that structures and links data from different sources, converting the IIoT analytics process from an unorganized mammoth to one more likely to generate insight. Read more.
Add to your personal schedule
2:55pm3:35pm Thursday, September 28, 2017
Location: 1A 01/02 Level: Intermediate
Secondary topics:  Media
Shirshanka Das (LinkedIn), Tushar Shanbhag (LinkedIn)
Shirshanka Das and Tushar Shanbhag explore the big data ecosystem at LinkedIn and share its journey to preserve member privacy while providing data democracy. Shirshanka and Tushar focus on three foundational building blocks for scalable data management that can meet data compliance regulations: a central metadata system, an integrated data movement platform, and a unified data access layer. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 28, 2017
Location: 1A 15/16/17 Level: Intermediate
Secondary topics:  Cloud
Felipe Hoffa (Google)
Average rating: *****
(5.00, 1 rating)
With Google BigQuery anyone can easily analyze the more than five years of GitHub metadata and 42+ terabytes of open source code. Felipe Hoffa explains how to leverage this data to understand the community and code related to any language or project. Relevant for open source creators, users, and choosers, this is data that you can leverage to make better choices. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 28, 2017
Location: 1A 21/22 Level: Intermediate
Average rating: *****
(5.00, 1 rating)
Common ETL jobs used for importing log data into Hadoop clusters require a considerable amount of resources, which varies based on the input size. Thiruvalluvan M G shares a set of techniques—involving an innovative use of Spark processing and exploiting features of Hadoop file formats—that not only make these jobs much more efficient but also work well with fixed amounts of resources. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 28, 2017
Location: 1A 23/24 Level: Non-technical
Bob Eilbacher (Caserta)
Building an efficient analytics environment requires a strong infrastructure. Bob Eilbacher explains how to implement a strong DevOps practice for data analysis, starting with the necessary cultural changes that must be made at the executive level and ending with an overview of potential DevOps toolchains. Read more.
Add to your personal schedule
4:35pm5:15pm Thursday, September 28, 2017
Location: 1E 09 Level: Intermediate
Secondary topics:  IoT
Lloyd Palum (Vnomics)
Average rating: *****
(5.00, 2 ratings)
A digital twin models a real-world physical asset using mobile data, cloud computing, and machine learning to track chosen characteristics. Lloyd Palum walks you through building a tractor trailer digital twin using Python and TensorFlow. You can then use the example model to track and optimize performance. Read more.