Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Data Engineering & Architecture

March 6-8, 2018
San Jose, CA

How to build an analytics infrastructure that unlocks the value of your data

The right data infrastructure and architecture streamline your workflows, reduce costs, and scale your data analysis. The wrong architecture costs time and money that may never be recovered.

It’s not easy. Selecting and building the tools and architecture you need is complex. You need to consider scalability, tools and frameworks (open source or proprietary), platforms (on-premise, cloud, or hybrid), integration, adoption and migration, and security. That’s why good data engineers are in such demand.

These sessions will help you navigate the pitfalls, select the right tools and technologies, and design a robust data pipeline.

Tuesday March 6: Tutorials (Gold & Silver passes)
Wednesday March 7: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45 | Location: San Jose Ballroom (salon 1&2)
Strata Data Conference Keynotes
10:30am
Morning break
Thursday March 8: Keynotes & Sessions (Platinum, Gold, Silver & Bronze passes)
8:45 | Location: San Jose Ballroom (salon 1&2)
Strata Data Conference Keynotes
10:30am
Morning break
Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: LL21 A Level: Intermediate
Philip Langdale (Cloudera), Eugene Fratkin (Cloudera), Vinithra Varadharajan (Cloudera), Mala Ramakrishnan (Cloudera)
Vinithra Varadharajan, Philip Langdale, and Eugene Fratkin lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: LL21 B Level: Intermediate
Jorge A. Lopez (Amazon Web Services)
Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: 210 B/F Level: Beginner
Secondary topics:  Graphs and Time-series
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (Streamlio)
Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.
Add to your personal schedule
9:00am12:30pm Tuesday, March 6, 2018
Location: 210 C/G Level: Beginner
Tim Berglund (Confluent)
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data. Read more.
Add to your personal schedule
9:00am5:00pm Tuesday, March 6, 2018
Location: LL20 A
Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Meagan O'Leary (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Rajiv Synghal (Kaiser Permanente), Rishi Ranjan (Freddie Mac), Wayde Fleener (General Mills), Jules Malin (GoPro), Taylor Martin (O'Reilly Media, Inc. )
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 6, 2018
Location: LL21 A Level: Intermediate
Juan Yu (Cloudera)
Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu explores the cost model Impala planner uses, how Impala optimizes queries, how to identify performance bottleneck through query plan and profile, and how to drive Impala to its full potential. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 6, 2018
Location: LL21 B Level: Intermediate
Ron Bodkin (Google)
TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 6, 2018
Location: 210 B/F Level: Intermediate
Secondary topics:  Graphs and Time-series
Ted Malaska (Blizzard Entertainment)
If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data. Read more.
Add to your personal schedule
1:30pm5:00pm Tuesday, March 6, 2018
Location: 210 C/G Level: Intermediate
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 7, 2018
Location: LL21 C/D Level: Intermediate
Tom Fisher (MapR Technologies)
The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to the next generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 7, 2018
Location: LL21 E/F Level: Intermediate
Kinnary Jangla (Pinterest)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Shivnath Babu (Duke University | Unravel Data Systems), Sumit Jindal (Unravel Data Systems)
Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Sumit Jindal explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 7, 2018
Location: 230 C Level: Intermediate
Daniel Templeton (Cloudera), Andrew Wang (Cloudera)
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet. Read more.
Add to your personal schedule
11:00am11:40am Wednesday, March 7, 2018
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Gwen Shapira (Confluent)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: LL21 C/D Level: Intermediate
Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)
Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: LL21 E/F Level: Non-technical
Crystal Valentine (MapR Technologies)
DataOps—a methodology for developing and deploying data-intensive applications, especially those involving data science and machine learning pipelines—supports cross-functional collaboration and fast time to value with an Agile, self-service workflow. Crystal Valentine offers an overview of this emerging field and explains how to implement a DataOps process. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Secondary topics:  Graphs and Time-series
William Chambers (Databricks), Michael Armbrust (Databricks)
William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: 230 C Level: Intermediate
Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)
Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems. Read more.
Add to your personal schedule
11:50am12:30pm Wednesday, March 7, 2018
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Eugene Kirpichov (Google)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive SplittableDoFn. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: LL21 C/D Level: Intermediate
Kurt Brown (Netflix)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: LL21 E/F Level: Beginner
Zhen Fan (JD.com), Wei Ting Chen (Intel)
Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Debasish Ghosh (Lightbend )
Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and how they can be used to implement solutions for the fast and streaming architectures. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: 230 C Level: Beginner
Siddharth Teotia (Dremio)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.
Add to your personal schedule
1:50pm2:30pm Wednesday, March 7, 2018
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Jordan Hambleton (Cloudera), Guru Medasani (Cloudera)
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: LL21 C/D Level: Intermediate
Mark Grover (Lyft), Arup Malakar (Lyft)
Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: LL21 E/F Level: Intermediate
Manu Mukerji (Criteo)
Criteo is a global leader in commerce marketing. Manu Mukerji walks you through Criteo's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated, how the model is pushed to production, evaluated (automatically), and used, production issues that arise when applying ML at scale in production, lessons learned, and more. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Henry Cai (Pinterest), Yi Yin (Pinterest)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: 230 C Level: Intermediate
Ritesh Agrawal (Uber), Anirban Deb (Uber)
Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money. Read more.
Add to your personal schedule
2:40pm3:20pm Wednesday, March 7, 2018
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Sean Kandel (Trifacta)
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Kandel discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 7, 2018
Location: LL21 C/D Level: Beginner
Carlo Torniai (Pirelli Tyre)
Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of different contribution across cross-functional teams. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 7, 2018
Location: LL21 E/F Level: Beginner
Rahim Daya (Pinterest)
Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 7, 2018
Location: 230 A Level: Beginner
Secondary topics:  Graphs and Time-series
Sijie Guo (Streamlio)
Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 7, 2018
Location: 230 C Level: Intermediate
Gian Merlino (Imply)
Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects. Read more.
Add to your personal schedule
4:20pm5:00pm Wednesday, March 7, 2018
Location: 212 A-B
Secondary topics:  Data Integration and Data Pipelines
Dorna Bandari (Jetlore)
Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: LL21 C/D Level: Beginner
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with transparent data encryption (TDE). However, TDE can be difficult to configure and manage; issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: LL21 E/F Level: Beginner
Ted Dunning (MapR Technologies)
Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: 230 A Level: Intermediate
Fabian Hueske (data Artisans), Shuyi Chen (Uber)
Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: 212 A-B Level: Non-technical
Secondary topics:  Data Integration and Data Pipelines
Abe Gong (Superconductive Health), James Campbell (USG)
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.
Add to your personal schedule
5:10pm5:50pm Wednesday, March 7, 2018
Location: 230 C Level: Advanced
Ash Munshi (Pepperdata)
Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 8, 2018
Location: LL21 C/D Level: Intermediate
Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)
Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 8, 2018
Location: LL21 E/F Level: Intermediate
Greg Rahn (Cloudera)
For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 8, 2018
Location: 230 A Level: Intermediate
Tyler Akidau (Google)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.
Add to your personal schedule
11:00am11:40am Thursday, March 8, 2018
Location: 230 C Level: Intermediate
Jiangjie Qin (LinkedIn)
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: LL21 C/D Level: Beginner
dong meng (MapR)
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: LL21 E/F Level: Beginner
Szehon Ho (Criteo), Pawel Szostek (Criteo)
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: 230 A Level: Beginner
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and discuss how applications can benefit. Read more.
Add to your personal schedule
11:50am12:30pm Thursday, March 8, 2018
Location: 230 C Level: Intermediate
Secondary topics:  Graphs and Time-series
Alexis Roos (Salesforce), Noah Burbank (Salesforce)
In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: LL21 C/D Level: Intermediate
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: LL21 E/F Level: Intermediate
Chris Harland (Textio)
The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: 230 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Michael Freedman (TimescaleDB | Princeton)
Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. Read more.
Add to your personal schedule
1:50pm2:30pm Thursday, March 8, 2018
Location: 230 C Level: Intermediate
Holden Karau (Google), Rachel Warren (Independent)
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: LL21 C/D Level: Intermediate
Michelle Casbon (Qordoba)
Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: LL21 E/F Level: Intermediate
Ajay Mothukuri (Sapient), Arunkumar Ramanatha (Sapient), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Ajay Mothukuri, Arunkumar Ramanatha, and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: 230 A Level: Beginner
Secondary topics:  Graphs and Time-series
Stephan Ewen (data Artisans), Flavio Junqueira (Dell EMC)
Stephan Ewen and Flavio Junqueira detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.
Add to your personal schedule
2:40pm3:20pm Thursday, March 8, 2018
Location: 230 C Level: Intermediate
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: LL21 C/D Level: Advanced
Secondary topics:  Graphs and Time-series
Yu Xu (TigerGraph)
Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: LL21 E/F Level: Intermediate
Secondary topics:  Graphs and Time-series
Roy Ben-Alta (Amazon Web Services), Ira Cohen (Anodot)
Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: 230 A Level: Intermediate
Dean Wampler (Lightbend)
Dean Wampler explores two microservice streaming applications based on Kafka to compare and contrast using Akka Streams and Kafka Streams for data processing. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to chose them instead. Read more.
Add to your personal schedule
4:20pm5:00pm Thursday, March 8, 2018
Location: 230 C Level: Intermediate
Tomer Kaftan (University of Washington)
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time. Read more.