Presented By O’Reilly and Cloudera

San Jose • London • New York

Make Data Work

March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Schedule: Graphs and Time-series sessions

These two fundamental data types were part of the rise of big data. Many common and important use cases lend themselves to graph analytics or time-series analysis. We want to showcase the latest generation of tools and methods for cleaning, preparing, storing, and analyzing graphs and time-series. Improvements in both software and hardware are leading to new solutions for analysts, data scientists, and engineers.

9:00am–12:30pm Tuesday, March 6, 2018

Modern real-time streaming architectures

Data engineering and architecture, Streaming systems and real-time applications
Location: 210 B/F

Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (StreamNative), Arun Kejariwal (Independent)

Average rating:

(5.00, 2 ratings)

Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.

9:00am–12:30pm Tuesday, March 6, 2018

Learning PyTorch by building a recommender system

Big data and data science in the cloud, Data science and machine learning
Location: LL21 A

Mo Patel (Independent), Neejole Patel (Virginia Tech)

Average rating:

(2.50, 4 ratings)

Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model. Read more.

9:00am–5:00pm Tuesday, March 6, 2018

Media and Ad Tech Day

Location: LL20 B

David Boyle (Audience Strategies), Violeta Hennessey (Warner Bros.), April Chen (Civis Analytics), Sridhar Alla (BlueWhale), Noah Gift (UC Davis), Blake Irvine (Netflix), Kevin Lyons (Nielsen Marketing Cloud), Jennifer Webb (SuprFanz), Rizwan Patel (Caesars Entertainment), Anthony Accardo (Disney), Amanda Gerdes (Blizzard Entertainment), Violeta Hennessey (Warner Bros.), Aneesh Karve (Quilt), David Boyle (Audience Strategies), Pete Skomoroch (Workday)

Hear from innovators in ad tech, measurement, automation, and audience engagement about where the media industry is today—and where it's likely to go next. Read more.

1:30pm–5:00pm Tuesday, March 6, 2018

Time series data: Architecture and use cases

Data engineering and architecture
Location: 210 B/F

Ted Malaska (Capital One)

Average rating:

(2.80, 5 ratings)

If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data. Read more.

11:00am–11:40am Wednesday, March 7, 2018

Using machine learning to simplify Kafka operations

Data engineering and architecture, Streaming systems and real-time applications
Location: 230 A

Shivnath Babu (Duke University | Unravel Data Systems), mdhruvg goel (Microsoft)

Average rating:

(4.50, 2 ratings)

Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.

11:50am–12:30pm Wednesday, March 7, 2018

Streaming big data in the cloud: What to consider and why

Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Location: 230 A

Bill Chambers (Databricks), michael dddd (Databricks)

Average rating:

(4.60, 5 ratings)

William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Machine learning to tackle industrial data fusion

Big data and data science in the cloud, Data science and machine learning
Location: LL20 A

Alexandra Gunderson (Arundo Analytics)

Average rating:

(5.00, 1 rating)

Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Writing distributed graph algorithms

Data science and machine learning
Location: LL20 C

Andrew Ray (Sam’s Club Technology)

Average rating:

(3.00, 3 ratings)

Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions. Read more.

1:50pm–2:30pm Wednesday, March 7, 2018

Deep credit risk ranking with LSTM

Data science and machine learning
Location: LL21 B

Kyle Grove (Teradata)

Average rating:

(5.00, 5 ratings)

Kyle Grove explains how Teradata and some of world’s largest financial institutions are innovating credit risk ranking with deep learning techniques and AnalyticOps. With the AnalyticOps framework, these organizations have built models with increased accuracy to drive more profitable lending decisions while being explainable to regulators. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Real-time deep link analytics: The next stage of graph analytics

Big data and data science in the cloud, Data engineering and architecture, Data-driven business management, Platform security and cybersecurity, Streaming systems and real-time applications
Location: Expo Hall 1

Yu Xu (TigerGraph)

Average rating:

(5.00, 2 ratings)

Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.

2:40pm–3:20pm Wednesday, March 7, 2018

Why nobody cares about your anomaly detection

Big data and data science in the cloud, Data science and machine learning, Visualization and user experience
Location: LL20 A

Baron Schwartz (VividCortex)

Average rating:

(4.80, 5 ratings)

Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Code Property Graph: A modern, queryable data storage for source code

Big data and data science in the cloud, Data science and machine learning
Location: LL20 C

Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)

Average rating:

(4.00, 1 rating)

Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Stream storage with Apache BookKeeper

Data engineering and architecture, Streaming systems and real-time applications
Location: 230 A

Sijie Guo (StreamNative)

Average rating:

(3.67, 3 ratings)

Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Machine learning applications for the industrial internet

Data science and machine learning
Location: LL20 A

Joseph Richards (GE Digital)

Average rating:

(5.00, 1 rating)

Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas. Read more.

4:20pm–5:00pm Wednesday, March 7, 2018

Detecting time series anomalies at Uber scale with recurrent neural networks

Data science and machine learning
Location: LL21 B

Andrea Pasqua (Uber), Anny Chen (Uber)

Average rating:

(4.60, 5 ratings)

Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge. Read more.

5:10pm–5:50pm Wednesday, March 7, 2018

Continuous machine learning over streaming data

Data science and machine learning, Streaming systems and real-time applications
Location: LL20 A

Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)

Average rating:

(5.00, 8 ratings)

Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs. Read more.

11:00am–11:40am Thursday, March 8, 2018

Understanding metadata

Big data and data science in the cloud, Data-driven business management, Strata Business Summit
Location: 210 C/G

Michael Schrenk (Self-Employed)

Average rating:

(4.00, 5 ratings)

Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers. Read more.

11:00am–11:40am Thursday, March 8, 2018

Graph analysis of 200,000 tweets from Russian Twitter trolls

Data science and machine learning
Location: LL20 B

Ryan Boyd (Neo4j)

Average rating:

(5.00, 1 rating)

Ryan Boyd explains how he and his team reconstructed a subset of the Twitter network of Russian troll accounts and applied graph analytics to the data using the Neo4j graph database to uncover how these accounts were spreading fake news. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Failed experiments in infrastructure security analytics and lessons learned from fixing them

Data science and machine learning, Platform security and cybersecurity
Location: LL20 A

Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))

Average rating:

(4.00, 1 rating)

How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process. Read more.

11:50am–12:30pm Thursday, March 8, 2018

Building a contacts graph from activity data

Data engineering and architecture
Location: 230 C

Alexis Roos (Salesforce), Noah Burbank (Salesforce)

Average rating:

(3.00, 1 rating)

In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

TimescaleDB: Reengineering PostgreSQL as a time series database

Data engineering and architecture
Location: 230 A

Michael Freedman (TimescaleDB)

Average rating:

(4.50, 4 ratings)

Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. Read more.

1:50pm–2:30pm Thursday, March 8, 2018

The real-time journey from raw streaming data to AI-based analytics

Data engineering and architecture, Data science and machine learning, Streaming systems and real-time applications
Location: Expo Hall 1

Roy Ben Alta (Amazon Web Services), Ira Cohen (Anodot)

Average rating:

(5.00, 1 rating)

Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.

2:40pm–3:20pm Thursday, March 8, 2018

Unified and elastic batch and stream processing with Pravega and Apache Flink

Data engineering and architecture, Streaming systems and real-time applications
Location: 230 A

Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)

Average rating:

(3.33, 3 ratings)

Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsor

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com