Presented By O'Reilly and Cloudera
Make Data Work
March 13–14, 2017: Training
March 14–16, 2017: Tutorials & Conference
San Jose, CA

Monday, 03/13/2017

7:30am

7:30am–9:00am Monday, 03/13/2017
Location: Executive Concourse
Coffee break (1h 30m)

9:00am

Add to your personal schedule
9:00am–5:00pm Monday, 03/13/2017
Bruce Martin (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Bruce Martin walks you through applying data science methods to real-world challenges in different industries, offering preparation for data scientist roles in the field. Join in to learn how Spark and Hadoop enable data scientists to help companies reduce costs, increase profits, improve products, retain customers, and identify new opportunities. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/13/2017 Secondary topics:  Architecture, Cloud
Jesse Anderson (Big Data Institute)
Average rating: ****.
(4.00, 1 rating)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/13/2017 Secondary topics:  Deep learning
Robert Schroll (The Data Incubator)
Average rating: ***..
(3.20, 5 ratings)
Robert Schroll demonstrates TensorFlow's capabilities through its Python interface and explores TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine-learning models on real-world data. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/13/2017
Spark & beyond
Location: 212 C
Secondary topics:  Streaming
Jacob D Parr (JParr Productions)
The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. Jacob Parr employs hands-on exercises using various Wikipedia datasets to illustrate the variety of ideal programming paradigms Spark makes possible. Read more.

10:30am

10:30am–11:00am Monday, 03/13/2017
Location: Executive Concourse
Morning break (30m)

12:30pm

12:30pm–1:30pm Monday, 03/13/2017
Location: East Lobby
Break (1h)

3:00pm

3:00pm–3:30pm Monday, 03/13/2017
Location: Executive Concourse
Afternoon break (30m)

Tuesday, 03/14/2017

7:30am

7:30am–8:15am Tuesday, 03/14/2017
Location: LL Foyer and Executive Concourse
Coffee break (7:30am - 9am) (45m)

8:15am

Add to your personal schedule
8:15am–8:45am Tuesday, 03/14/2017
Location: East Lobby
Average rating: *****
(5.00, 2 ratings)
Gather before tutorials on Tuesday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. Read more.

9:00am

Add to your personal schedule
9:00am–5:00pm Tuesday, 03/14/2017
Location: LL20 A
Barbara Eckman (Comcast), Dirk Jungnickel (Emirates Integrated Telecommunications Company (du)), Kishore Papineni (Astellas Pharma), Paul Barth (Podium Data), Carlo Torniai (Pirelli Tyre), Bryan Harrison (American Express), Chris Murphy (Zurich Insurance Group), Martin Lidl (Deloitte), Maura Lynch (Pinterest), Nixon Patel (Kovid Group), Bas Geerdink (ING), Robin Li (Tapjoy), Yohan Chin (Tapjoy), Jim Harrold (NationBuilder), Lana Novikova (Heartbeat AI Technologies)
In a series of 12 half-hour talks aimed at a business audience, you’ll hear data-themed case studies from household brands and global companies, explaining the challenges they wanted to tackle, the approaches they took, and the benefits—and drawbacks—of their solutions. If you want practical insights about applied data, look no further. Read more.
Add to your personal schedule
9:00am–5:00pm Tuesday, 03/14/2017
Location: LL20 B
Michael Abbott (Kleiner Perkins Caufield & Byers), Christopher Pouliot (Nio), Jennifer Anderson, Renee DiResta (Haven), Coco Krumme (Haven | UC Berkeley), Ryan Baumann (Mapbox), Jay White Bear (IBM), Andre Luckow (BMW Group), Rajiv Paul (Yakit), Evangelos Simoudis (Synapse Partners), Roland Major (Transport for London), Rodrigo Fontecilla (Unisys), Lloyd Palum (Vnomics), Andreas Ribbrock (#zeroG, A Lufthansa Systems Company)
Data, Transportation, and Logistics Day offers a daylong deep-dive into how data science is changing transportation and logistics. We’ll investigate the latest advances in and applications of self-driving vehicles, automated drones, and embedded sensors and explore how new uses of data are challenging the industry to evolve infrastructure for the future. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Location: LL20 C
Edd Wilder-James (Silicon Valley Data Science), Ellen Friedman (Independent), Jim Scott (MapR Technologies), Gabriela de Queiroz (R-Ladies), Melanie Warrick (Google), Aneesh Karve (Quilt Data, Inc)
Data 101 introduces you to core principles of data architecture, teaches you how to build and manage successful data teams, and inspires you to do more with your data through real-world applications. Setting the foundation for deeper dives on the following days of Strata + Hadoop World, Data 101 reinforces data fundamentals and helps you focus on how data can solve your business problems. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Big data and the Cloud
Location: LL20 D
Secondary topics:  Cloud
Radhika Ravirala (Amazon Web Services (AWS)), Ryan Nienhuis (Amazon Web Services (AWS)), Ben Snively (Amazon Web Services (AWS)), Dario Rivera (Amazon Web Services (AWS))
Average rating: ***..
(3.50, 2 ratings)
Want to ramp up your knowledge of Amazon's big data web services and launch your first big data application on the cloud? Ben Snively, Radhika Ravirala, Ryan Nienhuis, and Dario Rivera walk you through building a big data application using open source technologies, such as Apache Hadoop, Spark, and Zeppelin, and AWS managed services, such as Amazon EMR, Amazon Kinesis, and more. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Big data and the Cloud
Location: LL21 A Level: Intermediate
Secondary topics:  Architecture, Cloud
Jennifer Wu (Cloudera), Eugene Fratkin (Cloudera), Andrei Savu (Cloudera), Tony Wu (Cloudera)
Average rating: ****.
(4.50, 2 ratings)
Jennifer Wu, Eugene Fratkin, Andrei Savu, and Tony Wu explore best practices for Hadoop deployments in the public cloud and provide detailed guidance for deploying, configuring, and managing Hive, Spark, and Impala in the public cloud. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Spark & beyond
Location: LL21 B Level: Intermediate
Dean Wampler (Lightbend)
Average rating: *****
(5.00, 4 ratings)
Apache Spark is written in Scala. Hence, many if not most data engineers adopting Spark are also adopting Scala, while most data scientists continue to use Python and R. Dean Wampler offers an overview of the core features of Scala you need to use Spark effectively, using hands-on exercises with the Spark APIs. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Data science & advanced analytics
Location: LL21 C/D Level: Intermediate
Secondary topics:  R
Vanja Paunic (Microsoft), Robert Horton (Microsoft), Hang Zhang (Microsoft), Srini Kumar (LevaData, Inc.), Mengyue Zhao (Microsoft), John-Mark Agosta (Microsoft), Mario Inchiosa (Microsoft), Debraj GuhaThakurta (Microsoft Corporation)
Average rating: **...
(2.50, 4 ratings)
Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Data science & advanced analytics
Location: LL21 E/F Level: Intermediate
Secondary topics:  Deep learning
Amy Unruh (Google), Yufeng Guo (Google)
Average rating: ***..
(3.69, 16 ratings)
Amy Unruh and Yufeng Guo walk you through training and deploying a machine-learning system using TensorFlow, a popular open source library. Amy and Yufeng begin by giving an overview of TensorFlow and demonstrating some fun, already-trained TensorFlow models. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Stream processing and analytics
Location: 210 A/E Level: Beginner
Secondary topics:  Streaming
Frances Perry (Google), Tyler Akidau (Google)
Average rating: ***..
(3.00, 2 ratings)
Come learn the basics of stream processing via a guided walkthrough of the most sophisticated and portable stream processing model on the planet—Apache Beam (incubating). Tyler Akidau and Frances Perry cover the basics of robust stream processing with the option to execute exercises on top of the runner of your choice—Flink, Spark, or Google Cloud Dataflow. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Spark & beyond
Location: 210 B/F Level: Intermediate
Vartika Singh (Cloudera), Jayant Shekhar (Sparkflows Inc.), Jeffrey Shmain (Cloudera)
Average rating: ***..
(3.83, 6 ratings)
Vartika Singh, Jayant Shekhar, and Jeffrey Shmain walk you through various approaches available via the machine-learning algorithms available in Spark Framework (and more) to understand and decipher meaningful patterns in real-world data in order to derive value. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017 Secondary topics:  R
Stephen Elston (Quantia Analytics, LLC), Ryan Hafen (Hafen Consulting)
Average rating: ****.
(4.12, 8 ratings)
Divide and recombine techniques provide scalable methods for exploration and visualization of otherwise intractable datasets. Stephen Elston and Ryan Hafen lead a series of hands-on exercises to help you develop skills in exploration and visualization of large, complex datasets using R, Hadoop, and Spark. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/14/2017
Spark & beyond
Location: 210 D/H Level: Intermediate
Secondary topics:  Architecture
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Silicon Valley Data Science)
Average rating: ****.
(4.60, 10 ratings)
What are the essential components of a data platform? John Akred and Stephen O'Sullivan explain how the various parts of the Hadoop, Spark, and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads. Read more.
Add to your personal schedule
9:00am–5:00pm Tuesday, 03/14/2017
Spark & beyond
Location: San Jose Ballroom, Marriott
Secondary topics:  Streaming, Text
Andy Konwinski (Databricks)
Average rating: ****.
(4.43, 7 ratings)
Andy Konwinski introduces you to Apache Spark 2.0 core concepts with a focus on Spark's machine-learning library, using text mining on real-world data as the primary end-to-end use case. Read more.

10:30am

10:30am–11:00am Tuesday, 03/14/2017
Location: Executive Concourse
Morning break sponsored by Google (30m)

12:30pm

12:30pm–1:30pm Tuesday, 03/14/2017
Location: 230 A-C
Lunch (1h)

1:30pm

Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
William Schmarzo (Dell EMC)
Average rating: *****
(5.00, 4 ratings)
Organizations need a model to measure how effectively they are using data and analytics. Once they know where they are and where they need to go, they then need a framework to determine the economic value of their data. William Schmarzo explores techniques for getting business users to “think like a data scientist” so they can assist in identifying data that makes the best performance predictors. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
Data science & advanced analytics
Location: LL20 D Level: Intermediate
Secondary topics:  Deep learning
Dave Kale (Skymind), Susan Eraly (Skymind), Josh Patterson (Skymind)
Average rating: ***..
(3.33, 3 ratings)
Dave Kale, Susan Eraly, and Josh Patterson explain how to build, train, and deploy neural networks using Deeplearning4j. Topics include the fundamentals of deep learning, ND4J and DL4J, and scalable training using GPUs and Apache Spark. You'll gain hands-on experience with several models, including convolutional and recurrent neural nets. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
Platform Security and Cybersecurity
Location: LL21 A Level: Intermediate
Mark Donsky (Cloudera), Andre Araujo (Cloudera), Michael Yoder (Cloudera), Manish Ahluwalia (Cloudera)
Average rating: ****.
(4.50, 2 ratings)
Mark Donsky, André Araujo, Michael Yoder, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
Data science & advanced analytics
Location: LL21 B Level: Intermediate
Secondary topics:  Pydata
Juliet Hougland (Cloudera)
Average rating: ****.
(4.00, 2 ratings)
Using an interactive demo format with accompanying online materials and data, data scientist Juliet Hougland offers a practical overview of the basics of using Python data tools with a Hadoop cluster. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
Data science & advanced analytics
Location: LL21 C/D Level: Intermediate
Secondary topics:  R
John Mount (Win-Vector LLC)
Average rating: ****.
(4.83, 6 ratings)
Sparklyr provides an R interface to Spark. With sparklyr, you can manipulate Spark datasets to bring them into R for analysis and visualization and use sparklyr to orchestrate distributed machine learning in Spark from R with the Spark MLlib and H2O SparkingWater libraries. John Mount demonstrates how to use sparklyr to analyze big data in Spark. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Secondary topics:  Architecture
Jonathan Seidman (Cloudera), Ted Malaska (Blizzard Entertainment), Mark Grover (Cloudera), Gwen Shapira (Confluent)
Average rating: ****.
(4.17, 6 ratings)
Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
Stream processing and analytics
Location: 210 A/E Level: Intermediate
Secondary topics:  Streaming
Ian Wrigley (Confluent)
Average rating: ****.
(4.83, 6 ratings)
Ian Wrigley demonstrates how Kafka Connect and Kafka Streams can be used together to build real-world, real-time streaming data pipelines. Using Kafka Connect, you'll ingest data from a relational database into Kafka topics as the data is being generated and then process and enrich the data in real time using Kafka Streams before writing it out for further analysis. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
Data-driven business management, Strata Business Summit
Location: 210 B/F Level: Intermediate
Edd Wilder-James (Silicon Valley Data Science), Scott Kurth (Silicon Valley Data Science)
Average rating: ****.
(4.17, 6 ratings)
Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017
Visualization & user experience
Location: 210 C/G Level: Beginner
Brian Suda (optional.is)
Average rating: ***..
(3.00, 5 ratings)
Visualizations are a key part of conveying any dataset. D3 is the most popular, easiest, and most extensible way to get your data online in an interactive way. Brian Suda outlines best practices for good data visualizations and explains how you can build them using D3. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/14/2017 Secondary topics:  Architecture, Cloud
James Malone (Google), John Mikula (Google Cloud)
Average rating: **...
(2.00, 6 ratings)
James Malone explores using managed Spark and Hadoop solutions in public clouds alongside cloud products for storage, analysis, and message queues to meet enterprise requirements via the Spark and Hadoop ecosystem. Read more.

3:00pm

3:00pm–3:30pm Tuesday, 03/14/2017
Location: Executive Concourse
Afternoon break (30m)

5:00pm

Add to your personal schedule
5:00pm–6:30pm Tuesday, 03/14/2017
Location: Hall 1, 2, 3
Grab a drink, mingle with fellow Strata + Hadoop World attendees, and see the latest technologies and products from leading companies in the data space. Read more.

6:30pm

Add to your personal schedule
6:30pm–8:00pm Tuesday, 03/14/2017
Location: Grand Ballroom
Average rating: ***..
(3.00, 1 rating)
What new companies are at the leading edge of the data space? Meet some of the best, most innovative founders as they demonstrate their game-changing ideas at the Startup Showcase. Read more.

Wednesday, 03/15/2017

6:30am

Add to your personal schedule
6:30am–7:30am Wednesday, 03/15/2017
Location: Guadalupe River Park
Average rating: ****.
(4.00, 2 ratings)
Please join Cloudera and O'Reilly Media for the Data Dash run/walk, held in conjunction with Strata + Hadoop World in San Jose. Read more.

7:30am

7:30am–8:00am Wednesday, 03/15/2017
Location: On your own
Break (30m)

8:00am

8:00am–8:15am Wednesday, 03/15/2017
Location: Grand Ballroom 220 Foyer
Coffee break (8am - 9am) (15m)

8:15am

Add to your personal schedule
8:15am–8:45am Wednesday, 03/15/2017
Location: Grand Ballroom 220 Foyer
Average rating: ****.
(4.33, 3 ratings)
Gather before keynotes on Wednesday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. Read more.

8:45am

Add to your personal schedule
8:45am–8:55am Wednesday, 03/15/2017
Location: Grand Ballroom
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Average rating: ****.
(4.00, 13 ratings)
Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes. Read more.

8:55am

Add to your personal schedule
8:55am–9:10am Wednesday, 03/15/2017
Location: Grand Ballroom
Mike Olson (Cloudera)
Average rating: ****.
(4.24, 34 ratings)
Data is powering a machine-learning renaissance. Understanding our data helps save lives, secure our personal and business information, and engage our customers with better relevance. However, as Mike Olson explains, without big data and a platform to manage big data, machine learning and artificial intelligence just don’t work. Read more.

9:10am

Add to your personal schedule
9:10am–9:30am Wednesday, 03/15/2017
Location: Grand Ballroom
Daphne Koller (Calico Labs | Coursera)
Average rating: ****.
(4.20, 35 ratings)
Daphne Koller explains how Coursera is using large-scale data processing and machine learning in online education. Building on Coursera's wealth of online learning data, Daphne discusses the role of automation in scaling access to education that is personalized and efficient at connecting people with skills and knowledge throughout their lives. Read more.

9:30am

Add to your personal schedule
9:30am–9:40am Wednesday, 03/15/2017
Location: Grand Ballroom
Ted Dunning (MapR Technologies)
Average rating: ***..
(3.26, 38 ratings)
The internet of things is turning the internet upside down, and the effects are causing all kinds of problems. We have to answer questions about how to have data where we want it and computation where we need it—and we have to coordinate and control all of this while maintaining visibility and security. Ted Dunning shares solutions for this problem from across multiple industries and businesses. Read more.

9:40am

Add to your personal schedule
9:40am–10:00am Wednesday, 03/15/2017
Location: Grand Ballroom
Phil Keslin (Niantic, Inc.), Beau Cronin (Embedding.js)
Average rating: ***..
(3.69, 51 ratings)
Pokémon GO was one of the fastest-growing games of all time, becoming a worldwide phenomenon in a matter of days. In conversation with Beau Cronin, Phil Keslin, CTO of Niantic, explains how the engineering team prepared for—and just barely survived—the experience. Read more.

10:00am

Add to your personal schedule
10:00am–10:05am Wednesday, 03/15/2017
Location: Grand Ballroom
Jason Waxman (Intel Corporation)
Average rating: ***..
(3.50, 22 ratings)
Artificial intelligence will accelerate both cancer research and the development of autonomous vehicles. Jason Waxman explains why the ultimate potential of AI will be realized through its societal benefits and positive impact on our world. Collaboration between industry, government, and academia are required to drive this societal innovation and deliver the scale and promise of AI to everyone. Read more.

10:05am

Add to your personal schedule
10:05am–10:10am Wednesday, 03/15/2017
Location: Grand Ballroom
Eric Frenkiel (MemSQL)
Average rating: ***..
(3.88, 26 ratings)
Eric Frenkiel explains how to use real-time data as a vehicle for operationalizing machine-learning models by leveraging MemSQL, exploring advanced tools, including TensorFlow, Apache Spark, and Apache Kafka, and compelling use cases demonstrating the power of machine learning to effect positive change. Read more.

10:10am

Add to your personal schedule
10:10am–10:25am Wednesday, 03/15/2017
Location: Grand Ballroom
Secondary topics:  Geospatial, Sports
Rajiv Maheswaran (Second Spectrum)
Average rating: ****.
(4.91, 35 ratings)
What happens when machines understand sports? As Rajiv Maheswaran demonstrates, everything changes, from how coaches coach and how players play to how storytellers tells stories and how fans experience the game. Read more.

10:30am

10:30am–11:00am Wednesday, 03/15/2017
Location: Exhibit Hall
Morning break sponsored by Intel (30m)

11:00am

Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017 Secondary topics:  Data Platform, Logistics
Peng Du (Uber Inc.), Randy Wei (Uber Inc.)
Average rating: ***..
(3.11, 9 ratings)
Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Sponsored
Location: LL20 B
Average rating: ***..
(3.67, 3 ratings)
James Burkhart explains how Uber supports millions of analytical queries daily across real-time data with Apollo. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Stream processing and analytics
Location: LL20 C Level: Beginner
Secondary topics:  Streaming
Jay Kreps (Confluent)
Average rating: ***..
(3.70, 10 ratings)
The move to streaming architectures from batch processing is a revolution in how companies use data. But what is the state of the union for stream processing, and what gaps remain in the technology we have? How will this technology impact the architectures and applications of the future? Jay Kreps explores the future of Apache Kafka and the stream processing ecosystem. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Stream processing and analytics
Location: LL20 D Level: Advanced
Secondary topics:  Streaming
Kenneth Knowles (Google)
Average rating: ****.
(4.80, 5 ratings)
Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Sponsored
Location: LL21 A
Jason Slepicka (DataScience)
Average rating: ****.
(4.33, 3 ratings)
Apache Spark has become the go-to system for servicing ad hoc queries, but the Catalyst optimizer still lacks many of the pushdown optimizations necessary to take advantage of native database features. Jason Slepicka explains how DataScience replaced Catalyst with Apache Calcite to achieve performance improvements of two orders of magnitude when querying SQL and NoSQL databases with Spark. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Platform Security and Cybersecurity
Location: LL21 B Level: Intermediate
Secondary topics:  Cloud
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science)), Andrew Wicker (Microsoft (Azure Security Data Science))
Average rating: ****.
(4.50, 4 ratings)
Ram Shankar Siva Kumar and Andrew Wicker explain how to operationalize security analytics for production in the cloud, covering a framework for assessing the impact of compliance on model design, six strategies and their trade-offs to generate labeled attack data for model evaluation, key metrics for measuring security analytics efficacy, and tips to scale anomaly detection systems in the cloud. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Spark & beyond
Location: LL21 C/D
Reynold Xin (Databricks)
Average rating: ****.
(4.11, 9 ratings)
Reynold Xin looks back at the history of data systems, from filesystems, databases, and big data systems (e.g., MapReduce) to "small data" systems (e.g., R and Python), covering the pros and cons of each, the abstractions they provide, and the engines underneath. Reynold then shares lessons learned from this evolution, explains how Spark is developed, and offers a peek into the future of Spark. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Secondary topics:  Architecture
Todd Lipcon (Cloudera)
Average rating: ****.
(4.75, 4 ratings)
Todd Lipcon offers a very brief refresher on the goals and feature set of the Kudu storage engine, covering the development that has taken place over the last year, including new features such as improved support for time series workloads, performance improvements, Spark integration, and highly available replicated masters. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Big data and the Cloud
Location: 210 A/E Level: Intermediate
Secondary topics:  Architecture, Cloud
Sriram Ganesan (Qubole), Prakhar Jain (Qubole)
Average rating: ***..
(3.00, 2 ratings)
Qubole started out by offering Hadoop as a service in AWS. Over time, it extended its big data capabilities beyond Hadoop and its cloud infrastructure support beyond AWS. Sriram Ganesan and Prakhar Jain explain how and why Qubole built Cloudman, a simple, cloud-agnostic, multipurpose provisioning tool that can be extended for further engines and further cloud support. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Sponsored
Location: 210 B/F
Erin Banks (Dell EMC)
Average rating: **...
(2.50, 2 ratings)
A recent study suggests that 44 % of businesses are unsure what to do about big data. Erin Banks explains how big data analytics can help transform your business and ensure your data provides the greatest value to you, covering best business practices to help you achieve insights from your analytics, extract value from your data, and drive business change. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Data science & advanced analytics
Location: 210 C/G Level: Intermediate
Secondary topics:  Deep learning, ecommerce, Retail
Feng Zhu (Microsoft), Valentine Fontama (Microsoft)
Average rating: ****.
(4.71, 7 ratings)
Although deep learning has proved to be very powerful, few results are reported on its application to business-focused problems. Feng Zhu and Val Fontama explore how Microsoft built a deep learning-based churn predictive model and demonstrate how to explain the predictions using LIME—a novel algorithm published in KDD 2016—to make the black box models more transparent and accessible. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Intermediate
Jack Norris (MapR Technologies)
Average rating: ****.
(4.33, 3 ratings)
Leading companies are integrating operations and analytics to make real-time adjustments to improve revenues, reduce costs, and mitigate risks. There are many aspects to digital transformation, but the timely delivery of actionable data is both a key enabler and an obstacle. Jack Norris explores how companies from TransUnion to Uber use event-driven processing to transform their businesses. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Data engineering and architecture, Enterprise adoption
Location: 230 A Level: Beginner
Secondary topics:  Architecture, Data Platform, Streaming
Felix Gorodishter (GoDaddy)
Average rating: ****.
(4.25, 4 ratings)
GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Sponsored
Location: 230 B
Sasi Kuppannagari (Intel Corporation)
Sasi Kuppannagari explores the innovative sports analytics solutions Intel is creating, such as using computer vision and big data analytics for athlete performance optimization. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Data science & advanced analytics
Location: 230 C Level: Intermediate
Sean Kandel (Trifacta), Karthik Sethuraman (Trifacta)
Average rating: ***..
(3.60, 5 ratings)
It's well known that data analysts spend 80% of their time preparing data and only 20% analyzing it. In order to change that ratio, organizations must build tools specifically designed for working with ad hoc (semistructured) data. Sean Kandel and Karthik Sethuraman explore a new technique leveraging machine learning to discover and profile the inherent structure in ad hoc datasets. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Visualization & user experience
Location: 212 A-B Level: Non-technical
Secondary topics:  Geospatial, R
Rumman Chowdhury (Accenture)
Average rating: ***..
(3.00, 2 ratings)
In collaboration with the Gray Area Foundation for the Arts and Metis Data Science, Rumman Chowdhury created an interactive data art installation with the purpose of educating San Franciscans about their own city. Rumman discusses the challenges of using historical, predigital-era data with D3 and R to craft a compelling and educational story residing at the intersection of art and technology. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/15/2017
Strata Business Summit
Location: 212 C
Satya Ramaswamy (Tata Consultancy Services), Sunil Karkera (Digital Enterprise Unit at Tata Consultancy Services)
Average rating: ***..
(3.00, 3 ratings)
Satya Ramaswamy and Sunil Karkeraof offer an overview of the recent technical advances that have made the current AI revolution possible, convincingly answering the "why now?" question. Read more.

11:50am

Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Data Platform, Media
Christopher Colburn (Netflix), Monal Daxini (Netflix)
Average rating: ****.
(4.00, 3 ratings)
In the past, typical real-time data processing was reserved for answering operational questions and very basic analytical questions, but with better processing frameworks and more-capable hardware, the streaming context can now enable personalization applications. Christopher Colburn and Monal Daxini explore the challenges faced when building a streaming application at scale at Netflix. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Sponsored
Location: LL20 B
Darren Chinen (Malwarebytes), Sujay Kulkarni (Malwarebytes), Manjunath Vasishta (Malwarebytes)
Darren Chinen, Sujay Kulkarni, and Manjunath Vasishta demonstrate how to use a Lambda architecture to provide real-time views into big data by combining batch and stream processing, leveraging BMC’s Control-M as a critical component of both batch processing and ecosystem management. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017 Secondary topics:  Cloud
Roger Barga (Amazon Web Services)
Average rating: ****.
(4.00, 2 ratings)
Roger Barga offers an overview of Kinesis, Amazon’s data streaming platform, which includes Kinesis Firehose, Kinesis Analytics, and Kinesis Streams, and explains how customers have architected their applications using Kinesis services for low-latency and extreme scale. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Sensors, IOT & Industrial Internet
Location: LL20 D Level: Advanced
Secondary topics:  Architecture, IoT, Streaming
Tim Gasper (Bitfusion)
Average rating: *****
(5.00, 1 rating)
Food production and preparation have always been labor and capital intensive, but with the internet of things, low-cost sensors, cloud-computing ubiquity, and big data analysis, farmers and chefs are being replaced with connected, big data robots—not just in the field but also in your kitchen. Tim Gasper explores the tech stack, data science techniques, and use cases driving this revolution. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Sponsored
Location: LL21 A
Ben Sharma (Zaloni)
Average rating: ****.
(4.50, 8 ratings)
When building your data stack, architecture could be your biggest challenge—yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin when assembling a scalable data architecture? Ben Sharma shares real-world lessons and best practices to get you started. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Platform Security and Cybersecurity
Location: LL21 B Level: Intermediate
Cesar Berho (Intel), Alan Ross (Intel)
Average rating: **...
(2.00, 3 ratings)
Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Spark & beyond
Location: LL21 C/D
Secondary topics:  Streaming
Michael Armbrust (Databricks), Tathagata Das (Databricks)
Average rating: ****.
(4.29, 7 ratings)
Apache Spark 2.0 introduced the core APIs for Structured Streaming, a new streaming processing engine on Spark SQL. Since then, the Spark team has focused its efforts on making the engine ready for production use. Michael Armbrust and Tathagata Das outline the major features of Structured Streaming, recipes for using them in production, and plans for new features in future releases. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Hadoop platform and applications
Location: LL21 E/F Level: Beginner
Sean Suchter (Pepperdata), Shekhar Gupta (Pepperdata)
Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017 Secondary topics:  Architecture, Cloud
Andrei Savu (Cloudera), Jennifer Wu (Cloudera)
Average rating: ***..
(3.00, 3 ratings)
Cloud infrastructure, with a scalable data store and elastic compute, is particularly well suited for large-scale data engineering workloads. Andrei Savu and Jennifer Wu explore the latest cloud technologies and outline cost, security, and ease-of-use considerations for data engineers. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Sponsored
Location: 210 B/F
Nitin Bandugula (MapR Technologies)
Average rating: **...
(2.67, 3 ratings)
Machine-learning algorithms can improve predictions and optimize business operations across industry verticals, but building and scoring models still presents a significant computational challenge requiring massive training data and complex pipelines. Nitin Bandugula outlines the benefits of implementing a microservices-based architecture to support a machine-learning model-scoring workflow. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Data science & advanced analytics
Location: 210 C/G Level: Intermediate
Secondary topics:  Deep learning, Hardcore Data Science, Mobile
Anirudh Koul (Microsoft)
Average rating: ****.
(4.20, 5 ratings)
Over the last few years, convolutional neural networks (CNN) have risen in popularity, especially in computer vision. Anirudh Koul explains how to bring the power of deep learning to memory- and power-constrained devices like smartphones and drones. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Non-technical
Yael Garten (LinkedIn)
Average rating: ****.
(4.91, 11 ratings)
Data science is a rewarding career. It's also really hard—not just the technical work itself but also "how to do the work well" in an organization. Yael Garten explores what data scientists do, how they fit into the broader company organization, and how they can excel at their trade and shares the hard and soft skills required, tips and tricks for success, and challenges to watch out for. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Spark & beyond
Location: 230 A Level: Intermediate
Secondary topics:  Data Platform, Financial services, Geospatial
Jasjeet Thind (Zillow)
Average rating: ****.
(4.50, 2 ratings)
Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Sponsored
Location: 230 B
Jonathan Gray (Cask)
Average rating: *****
(5.00, 1 rating)
Hadoop and Spark provide scale and flexibility at a low cost compared to data warehouses, but the messy and diverse nature of big data results in undesirable complexities and inefficiencies. Jonathan Gray explores the standardization, automation, and deep integration technologies that allow users to focus on application logic and insights rather than infrastructure and integration. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017 Secondary topics:  Data Platform, ecommerce, Hardcore Data Science, Media
Jure Leskovec (Pinterest)
Average rating: ****.
(4.82, 11 ratings)
Pinterest built a flexible, graph-based system for making recommendations to users in real time. The system uses random walks on a user-and-object graph in order to make personalized recommendations to 100+ million Pinterest users out of a catalog of over a billion items. Jure Leskovec explains how Pinterest built its modern recommendation engine and the lessons learned along the way. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Visualization & user experience
Location: 212 A-B Level: Beginner
Joe Hellerstein (UC Berkeley), Giorgio Caviglia (Trifacta), Alon Bartur (Trifacta)
Joe Hellerstein, Giorgio Caviglia, and Alon Bartur share their design philosophy for users and their experience designing UIs, illustrating their design principles with core elements from Trifacta, including the founding technology of predictive interaction, recent innovations like transform builder, and other developments in their core transformation experience. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/15/2017
Strata Business Summit
Location: 212 C
John Matchette (Accenture Analytics), Leonard Hinds (Accenture)
Average rating: ****.
(4.00, 4 ratings)
Join John Matchette and Leonard Hinds as they offer insights into how leading enterprises are unlocking new economic possibilities by embedding intelligence into the core of their business and explore five key actions businesses are taking today to realize the promise of big data and analytics. Read more.

12:30pm

Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/15/2017
Location: Hall 1, 2, 3
Average rating: ****.
(4.00, 2 ratings)
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.

1:50pm

Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Data Platform, Financial services
Average rating: *****
(5.00, 2 ratings)
Data warehouses are critical in driving business decisions—with SQL dominantly used to build ETL pipelines. While the technology has shifted from using RDBMS-centric data warehouses to data pipelines based on Hadoop and MPP databases, engineering and quality processes have not kept pace. Avinash Padmanabhan highlights the changes that Intuit's team made to improve processes and data quality. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Sponsored
Location: LL20 B
Murthy Mathiprakasam (Informatica)
Average rating: ***..
(3.00, 1 rating)
Stuck with manual, siloed, inflexible, laborious practices for big data projects? Successful teams use machine-learning-based approaches to power self-service preparation, enterprise-wide data catalogs, and real-time stream processing with role-specific tools. Murthy Mathiprakasam explains how using Informatica atop Hadoop, Spark, and Spark Streaming maximizes teamwork, trust, and timeliness. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Data engineering and architecture
Location: LL20 C Level: Intermediate
Secondary topics:  Streaming
Ryan Pridgeon (Confluent), Dustin Cote (Confluent)
Average rating: ****.
(4.67, 3 ratings)
Dustin Cote and Ryan Pridgeon share their experience troubleshooting Apache Kafka in production environments and discuss how to avoid pitfalls like message loss or performance degradation in your environment. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Sensors, IOT & Industrial Internet
Location: LL20 D Level: Non-technical
Secondary topics:  Healthcare, IoT
Julie Lockner (17 Minds Corporation)
Average rating: *****
(5.00, 1 rating)
How can we empower individuals with special needs to reach their potential? Julie Lockner offers an overview of a project to develop collaboration applications that use wearable device data to improve the ability to develop the best possible care and education plans. Join in to learn how real-time IoT data analytics are making this possible. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Sponsored
Location: LL21 A
Ken Tsai (SAP), Michael Eacrett (SAP)
Ken Tsai and Michael Eacrett explore critical components of enterprise production environments that support day-to-day business processes while ensuring security, governance, and operational administration and share best practices to ensure business value. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Ting-Fang Yen (DataVisor)
Average rating: ****.
(4.33, 3 ratings)
When it comes to visibility into account takeover, spam, and fake accounts, the cloud is making things hazy. Cloud-hosted attacks skirt IP blacklists and make fraudulent users seem like they are located somewhere they are not. Drawing on data from 500 billion events and 400 million user accounts, Ting-Fang Yen examines cloud-based attack trends across verticals and regions. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Spark & beyond
Location: LL21 C/D Level: Intermediate
Secondary topics:  Streaming
Holden Karau (IBM), Seth Hendrickson (Cloudera)
Average rating: ****.
(4.00, 8 ratings)
Structured Streaming is new in Apache Spark 2.0, and work is being done to integrate the machine-learning interfaces with this new streaming system. Holden Karau and Seth Hendrickson demonstrate how to do streaming machine learning using Structured Streaming and walk you through creating your own streaming model. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Secondary topics:  Architecture
Daniel Templeton (Cloudera)
Average rating: ****.
(4.00, 4 ratings)
Docker makes it easy to bundle an application with its dependencies and provide full isolation, and YARN now supports Docker as an execution engine for submitted applications. Daniel Templeton explains how YARN's Docker support works, why you'd want to use it, and when you shouldn't. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Big data and the Cloud
Location: 210 A/E Level: Intermediate
Secondary topics:  Architecture, Cloud
Henry Robinson (Cloudera), Alex Gutow (Cloudera)
Henry Robinson and Alex Gutow explain how to best take advantage of the flexibility and cost-effectiveness of the cloud with your BI and SQL analytic workloads using Apache Hadoop and Apache Impala (incubating) to provide the same great functionality, partner ecosystem, and flexibility of on-premises deployments. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Sponsored
Location: 210 B/F
Mark Burnette (Pentaho, a Hitachi Group Company)
Average rating: **...
(2.67, 3 ratings)
Mark Burnette outlines five keys to success with data lakes and explores several real-world data lake implementations that are changing the world. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Data science & advanced analytics
Location: 210 C/G Level: Advanced
Secondary topics:  Deep learning, Healthcare
Michael Dusenberry (IBM Spark Technology Center), Frederick Reiss (IBM Spark Technology Center)
Average rating: *****
(5.00, 2 ratings)
Estimating the growth rate of tumors is a very important but very expensive and time-consuming part of diagnosing and treating breast cancer. Michael Dusenberry and Frederick Reiss describe how to use deep learning with Apache Spark and Apache SystemML to automate this critical image classification task. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Secondary topics:  Data Platform, ecommerce, Streaming
Chandan Joarder (Macys.com)
Average rating: ***..
(3.56, 9 ratings)
Chandan Joarder shares a guide to building real-time dashboards in-house using tools such as Kafka, web frameworks, and an in-memory database, utilizing JavaScript and Scala. Along the way, Chandan also discusses the architectural principles used in these dashboards to provide up-to-the-hour business performance metrics and alerts. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Enterprise adoption
Location: 230 A Level: Intermediate
Secondary topics:  Architecture, Data Platform
Eric Richardson (American Chemical Society)
Average rating: **...
(2.50, 2 ratings)
Eric Richardson explains how ACS used Hadoop, HBase, Spark, Kafka, and Solr to create a hybrid cloud enterprise data hub that scales without drama and drives adoption by ease of use, covering the architecture, technologies used, the challenges faced and defeated, and problems yet to solve. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Sponsored
Location: 230 B
Ethan Zhang (VoltDB)
Continuous queries on streaming data play a vital role in fast data applications, providing always up-to-date results based on the most recent data. Ethan Zhang offers an overview of VoltDB, a NewSQL distributed database that supports continuous queries three orders of magnitude faster with materialized views, highlighting a transparent, automatic, and incremental-view maintenance approach. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Data science & advanced analytics
Location: 230 C Level: Intermediate
Secondary topics:  Deep learning, Healthcare, Text
David Talby (Atigeo), Claudiu Branzan (G2 Web Services)
Average rating: ****.
(4.14, 7 ratings)
David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Visualization & user experience
Location: 212 A-B Level: Beginner
Average rating: **...
(2.67, 3 ratings)
From personalized newsfeeds to curated playlists, users want tailored experiences when they interact with their devices. Ricky Hennessy and Charlie Burgoyne explain how frog’s interdisciplinary teams of designers, technologists, and data scientists create data-driven, personalized, and adaptive user experiences. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/15/2017
Strata Business Summit
Location: 212 C
Average rating: ***..
(3.80, 5 ratings)
Jerry Overton provides an executive's guide to understanding advanced analytics in the cloud—offering a comprehensive survey of cloud technologies, patterns of cloud-based architectures, and patterns of enterprise cloud adoption, describing paths to achieving a cognitive enterprise, and outlining the realistic next steps for executives. Read more.

2:40pm

Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Data engineering and architecture, Real-time applications
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Media, Streaming
Kartik Paramasivam (LinkedIn)
Average rating: *****
(5.00, 2 ratings)
LinkedIn has one of the largest Kafka installations in the world, ingesting more than a trillion messages per day. Apache Samza-based stream processing applications process this deluge of data. Kartik Paramasivam discusses key improvements and architectural patterns that LinkedIn has adopted in its data systems in order to process millions of requests per second while keeping costs in control. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Sponsored
Location: LL20 B
The massive shift of data to the cloud is exacerbating data preparation and transport complexities that slow data analytics to a crawl. Bill Dentinger explains how the deployment of FPGA/x86-based heterogeneous compute architectures by cloud vendors is giving all organizations the opportunity to speed their data analytics to unprecedented levels. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017 Secondary topics:  Media, Streaming
Michael Edwards shares experiences from operating several Kafka clusters in a real-time streaming event ingestion pathway. He'll discuss the lessons learned from working with hundreds of terabytes flowing through every day, petabytes of retention, and gigabytes of historical data streaming to and from storage. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Real-time applications
Location: LL20 D Level: Intermediate
Secondary topics:  Healthcare
Joseph Blue (MapR), carol mcdonald (MapR Technologies)
Average rating: ****.
(4.50, 2 ratings)
Joseph Blue and Carol Mcdonald walk you through a reference application that processes ECG data encoding using HL7 with a modern anomaly detector, demonstrating how combining visualization and alerting enables healthcare professionals to improve outcomes and reduce costs and sharing lessons learned from their experience dealing with real data in real medical situations. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Sponsored
Location: LL21 A
Reflecting the old horror gimmick "the call that comes from inside the house," an increasing number of data breaches are carried out by insiders. Charlotte Crain and Tyler Freckman share a unique, hybrid approach to insider threat deterrence that combines traditional detection methods and investigative methodologies with behavioral analysis to enable complete, continuous monitoring of activity. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Platform Security and Cybersecurity
Location: LL21 B Level: Intermediate
Secondary topics:  Architecture, Data Platform, Financial services, Streaming
Ajit Gaddam (VISA), Jiphun Satapathy (VISA)
Average rating: ***..
(3.83, 6 ratings)
Apache Kafka is used by over 35% of Fortune 500 companies to store and process some of their most sensitive datasets. Ajit Gaddam and Jiphun Satapathy provide a security reference architecture to secure your Kafka cluster while leveraging it to support your organization's cybersecurity requirements. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Spark & beyond
Location: LL21 C/D Level: Beginner
Secondary topics:  R
Edgar Ruiz (RStudio)
Average rating: ****.
(4.80, 5 ratings)
Sparklyr makes it easy and practical to analyze big data with R—you can filter and aggregate Spark DataFrames to bring data into R for analysis and visualization and use R to orchestrate distributed machine learning in Spark using Spark ML and H2O SparkingWater. Edgar Ruiz walks you through these features and demonstrates how to use sparklyr to create R functions that access the full Spark API. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Big data and the Cloud, Data engineering and architecture
Location: LL21 E/F Level: Beginner
Secondary topics:  Architecture, Cloud
Paige Liu (Microsoft), John Zhuge (Cloudera)
Paige Liu and John Zhuge explore the options and trade-offs to consider when building a Cloudera cluster on Microsoft Azure Cloud and explain how to deploy and scale a Cloudera cluster on Azure and how to connect a Cloudera cluster with other Azure services to build enterprise-grade end-to-end big data solutions. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Big data and the Cloud, Data engineering and architecture
Location: 210 A/E Level: Intermediate
Secondary topics:  Cloud
Shubham Tagra (Qubole)
Shubham Tagra offers an introduction to RubiX, a lightweight, cross-engine caching solution that works well with optimized columnar formats by caching only the required amount of data. RubiX can be used with any data analytics engine that reads data from remote sources via the Hadoop FileSystem interface without any changes to the source code of those engines. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Sponsored
Location: 210 B/F
Greg Michaelson (DataRobot)
Average rating: ****.
(4.50, 2 ratings)
Companies store tons of data in Hadoop in hopes of turning the data into actionable insights, but maximizing the value of this resource with artificial intelligence and machine learning eludes most organizations. Greg Michaelson defines analytic trends around Hadoop, separates fact from hype, and sets out a roadmap for fully optimizing the value of the data stored in Hadoop. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017 Secondary topics:  AI, Deep learning
James Bradbury (Salesforce Research)
Average rating: ****.
(4.00, 8 ratings)
James Bradbury offers an overview of PyTorch, a brand-new deep learning framework from developers at Facebook AI Research that's intended to be faster, easier, and more flexible than alternatives like TensorFlow. James makes the case for PyTorch, focusing on the library's advantages for natural language processing and reinforcement learning. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Vishal Bamba (Transamerica), Rocky Tiwari (Transamerica)
Vishal Bamba and Rocky Tiwari offer an overview of Transamerica's Customer 360 platform and the work done afterward to utilize this technology, including graph databases and machine learning to help create targeted segments for products and campaigns. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Big data and the Cloud, Enterprise adoption
Location: 230 A Level: Intermediate
Secondary topics:  Architecture, Data Platform
Gwen Shapira (Confluent), Bob Lehmann (Monsanto)
Average rating: ****.
(4.50, 2 ratings)
Gwen Shapira and Bob Lehmann share their experience and patterns building a cross-data-center streaming data platform for Monsanto. Learn how to facilitate your move to the cloud while "keeping the lights on" for legacy applications. In addition to integrating private and cloud data centers, you'll discover how to establish a solid foundation for a transition from batch to stream processing. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Sponsored
Location: 230 B
Average rating: ****.
(4.50, 2 ratings)
Thousands of companies have made their initial investments into next-generation data lake architecture, and they are on the verge of generating quality business returns. Chandhu Yalla and Neshad Bardoliwalla explain how enterprises have unlocked tangible value from their data lakes with adaptive information management and how their organizations are providing self-service to business units. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Data science & advanced analytics
Location: 230 C Level: Intermediate
Secondary topics:  Hardcore Data Science, Healthcare
Robert Grossman (University of Chicago)
Average rating: ***..
(3.73, 11 ratings)
When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices so that you will not be accused of p-hacking. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Emerging Technologies
Location: 212 A-B Level: Non-technical
Michael Dauber (Amplify Partners), Jacob Flomenberg (Accel), Sarah Catanzaro (Canvas Ventures), Rama Sekhar (Norwest Venture Partners), Cack Wilhelm (Scale Venture Partners)
Average rating: *****
(5.00, 3 ratings)
In a panel discussion, top-tier VCs look over the horizon and consider the big trends in big data, explaining what they think the field will look like a few years (or more) down the road. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/15/2017
Law, ethics, governance, Strata Business Summit
Location: 212 C Level: Beginner
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)
Average rating: ****.
(4.00, 3 ratings)
Big data promises enormous benefits for companies, and new innovations in this space only mean more data collection is required. Having a solid understanding of legal obligations will help you avoid the legal snafus that can come with collecting big data. Alysa Hutnik and Crystal Skelton outline legal best practices and practical tips to avoid becoming a big data “don’t.” Read more.

3:20pm

3:20pm–4:20pm Wednesday, 03/15/2017
Location: Exhibit Hall
Afternoon break sponsored by MemSQL (1h)

4:20pm

Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Real-time applications, Stream processing and analytics
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Media, Platform
Sridhar Alla (Comcast), Shekhar Agrawal (Comcast)
Average rating: *****
(5.00, 2 ratings)
Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Sponsored
Location: LL20 B
Justin Murray (VMware)
Justin Murray outlines the benefits of virtualizing Hadoop and Spark, covering the main architectural approaches at a technical level and demonstrating how the core Hadoop architecture maps into virtual machines and how those relate to physical servers. You'll gain a set of design approaches and best practices to make your application infrastructure fit well with the virtualization layer. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Data engineering and architecture
Location: LL20 C Level: Intermediate
Secondary topics:  Data Platform, Financial services, Streaming
Kevin Mao (Capital One)
Average rating: ****.
(4.67, 3 ratings)
Kevin Mao explores the value of and challenges associated with collecting raw security event data from disparate corners of enterprise infrastructure and transforming them into high-quality intelligence that can be used to forecast, detect, and mitigate cybersecurity threats. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017 Secondary topics:  Architecture, IoT, Manufacturing, Platform, Streaming
Kishore R (GE)
Average rating: ***..
(3.00, 1 rating)
Kishore Reddipalli explores how to stream data at a large scale from the edge to the cloud to the client, detect anomalies, analyze machine data in stream and rest in an industrial world, and optimize the industrial operations by providing real-time insights and recommendations using big data technologies. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Sponsored
Location: LL21 A
Scott Gnau (Hortonworks)
Average rating: ****.
(4.00, 5 ratings)
Big data is moving from science projects to mainstream, mission-critical deployments. Drawing on his interactions and conversations with business and IT leaders across the world, Scott Gnau outlines adoption trends and popular use cases. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Platform Security and Cybersecurity
Location: LL21 B Level: Intermediate
Yuliya Feldman (Dremio Corporation), Bill ODonnell (Mapr)
Average rating: **...
(2.50, 2 ratings)
Security will always be very important in the world of big data, but the choices today mostly start with Kerberos. Does that mean setting up security is always going to be painful? What if your company standardizes on other security alternatives? What if you want to have the freedom to decide what security type to support? Yuliya Feldman and Bill ODonnell discuss your options. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Spark & beyond
Location: LL21 C/D Level: Beginner
Secondary topics:  Architecture, Data Platform, Media
Average rating: ***..
(3.00, 3 ratings)
Spark powers various services in Bing, but the Bing team had to customize and extend Spark to cover its use cases and scale the implementation of Spark-based data pipelines to handle internet-scale data volume. Kaarthik Sivashanmugam explores these use cases, covering the architecture of Spark-based data platforms, challenges faced, and the customization done to Spark to address the challenges. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Secondary topics:  Architecture, Cloud
Dwai Lahiri (Cloudera)
Average rating: ****.
(4.50, 2 ratings)
Dwai Lahiri explains how to leverage private cloud infrastructure to successfully build Hadoop clusters and outlines dos, don'ts, and gotchas for running Hadoop on private clouds. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017 Secondary topics:  Cloud
Mark Donsky (Cloudera), Sudhanshu Arora (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Big data needs governance. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start—especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Sudhanshu Arora share a step-by-step approach to kick-start your big data governance initiatives. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Sponsored
Location: 210 B/F
Serdar Sahin (Peak Games)
Peak Games, a leading online and mobile company, unites 30 million monthly unique players with free, culturally relevant, community-driven games. Serdar Sahin shares the company's journey evaluating MPP columnar databases against Hadoop to find the right data infrastructure to enable the company to handle the unpredictable popularity of newly launched games. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Data science & advanced analytics, Real-time applications
Location: 210 C/G Level: Intermediate
Secondary topics:  Deep learning, Streaming
Shivnath Babu (Duke University | Unravel Data Systems)
Average rating: ***..
(3.33, 3 ratings)
Shivnath Babu offers an introduction to using deep learning to solve complex problems in IT operations analytics. Shivnath focuses on how deep learning can derive operations insights automatically for the complex big data application stack composed of systems such as Hadoop, Spark, Cassandra, Elasticsearch, and Impala, using examples of open source tools for deep learning. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Secondary topics:  Data Platform, Financial services, Media, Text
Alan Chaney (Bitvore Corp)
Average rating: ***..
(3.50, 2 ratings)
Bitvore Corp’s Bitvore for Munis personalized news surveillance system is rapidly becoming a must-have for all major fixed-income securities analysts, investors, and brokers working in the three-trillion-dollar municipal bond market in the USA. Alan Chaney explains how Bitvore delivers the few important and relevant articles out of thousands each day, saving users many hours daily. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Enterprise adoption
Location: 230 A Level: Beginner
Ganesh Prabhu (FireEye), Vivek Agate (FireEye), Alex Rivlin (FireEye)
Ganesh Prabhu, Alex Rivlin, and Vivek Agate share an approach that enabled a small team at FireEye to migrate 20 TB of RDBMS data comprised of 250+ tables and nearly 2,000 partitions to Hadoop and an adaptive platform that allows migration of a rapidly changing dataset to Hive. Along the way, they explore some of the challenges typical for a company implementing Hadoop. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Sponsored
Location: 230 B
Eric Anderson (Beachbody), Shyam Konda (Beachbody)
Average rating: ***..
(3.50, 2 ratings)
Eric Anderson and Shyam Konda explain how the IT team at Beachbody—the makers of P90X and CIZE—successfully ingested all their enterprise data into Amazon S3 and delivered self-service access in less than six months with Talend. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Data science & advanced analytics
Location: 230 C Level: Advanced
Secondary topics:  Hardcore Data Science
Ted Dunning (MapR Technologies)
Average rating: ****.
(4.50, 6 ratings)
Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case). Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Visualization & user experience
Location: 212 A-B Level: Intermediate
Sean Kandel (Trifacta), Wei Zheng (Trifacta)
Average rating: ****.
(4.33, 3 ratings)
Sean Kandel and Wei Zheng offer an overview of an entirely new approach to visualizing metadata and data lineage, demonstrating automated methods for detecting, visualizing, and interacting with potential anomalies in reporting pipelines. Join in to learn what’s required to efficiently apply these techniques to large-scale data. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/15/2017
Strata Business Summit
Location: 212 C
Teresa Tung (Accenture Labs)
Average rating: ****.
(4.25, 4 ratings)
The IoT is driven by outcomes delivered by applications, but to gain operational efficiency, many organizations are looking toward a horizontal platform for delivering and supporting a number of applications. Teresa Tung explores how to choose and implement a platform—and deal with the fact that the platform is horizontal and application outcomes are vertical. Read more.

5:10pm

Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Data engineering and architecture
Location: LL20 A Level: Advanced
Secondary topics:  Architecture, Media, Platform, Streaming
Monal Daxini (Netflix)
Average rating: ****.
(4.50, 2 ratings)
Netflix Keystone processes over a trillion events per day with at-least-once processing semantics in the cloud. Monal Daxini explores what it means to offer stream processing as a service (SPaaS), how Netflix implemented a scalable, fault-tolerant multitenant SPaaS internal offering, and how it evolved the system in flight with no downtime. Read more.
5:10pm–5:50pm Wednesday, 03/15/2017
Location: LL20 B
TBC
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017 Secondary topics:  Media, Streaming
Sijie Guo (Streamlio)
Average rating: **...
(2.00, 2 ratings)
Apache DistributedLog (incubating) is a low-latency, high-throughput replicated log service. Sijie Guo shares how Twitter has used DistributedLog as the real-time data foundation in production for years, supporting services like distributed databases, pub-sub messaging, and real-time stream computing and delivering more than 1.5 trillion (17 PB) events per day. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Real-time applications
Location: LL20 D Level: Intermediate
Secondary topics:  IoT, Streaming
Michael Freedman (Timescale | Princeton University)
Average rating: *****
(5.00, 3 ratings)
IoT applications often need more-complex queries than those supported by traditional time series databases. Michael Freedman outlines a new distributed time series database for such workloads, supporting efficient queries, including complex predicates across many metrics, while scaling out to support IoT ingest rates. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Parvez Ahammad (Instart Logic)
Average rating: ****.
(4.80, 5 ratings)
Recently, research on applying and designing ML algorithms and systems for security has grown quickly as information and communications have become more ubiquitous and more data has become available. Parvez Ahammad explores generalized system designs, underlying assumptions, and use cases for applying ML in security. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017 Secondary topics:  Cloud
Anand Iyer (Cloudera), Eugene Fratkin (Cloudera)
Average rating: *****
(5.00, 1 rating)
Both Spark workloads and use of the public cloud have been rapidly gaining adoption in mainstream enterprises. Anand Iyer and Eugene Fratkin discuss new developments in Spark and provide an in-depth discussion on the intersection between the latest Spark and cloud technologies. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Big data and the Cloud
Location: LL21 E/F Level: Intermediate
Secondary topics:  Architecture, Cloud, Geospatial
Naghman Waheed (Monsanto), Martin Mendez-Costabel (Monsanto)
Average rating: ****.
(4.00, 1 rating)
Recently, the volume of data collected from farmers' fields via sensors, rovers, drones, in-cabin technologies, and other sources has forced Monsanto to rethink its geospatial processing capabilities. Naghman Waheed and Martin Mendez-Costabel explain how Monsanto built a scalable geospatial platform using cloud and open source technologies. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Big data and the Cloud
Location: 210 A/E
Secondary topics:  Architecture, Cloud
Dale Kim (Arcadia Data)
Big data applications in the cloud are becoming more about the global distribution and access of data than about easier deployments. Dale Kim shares insights on architecting big data applications for the cloud, using an example reference application his team built and published as context for describing several key requirements for cloud-based environments. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Sponsored
Location: 210 B/F
Siva Raghupathy (Amazon Web Services), Ben Snively (Amazon Web Services (AWS))
Average rating: ****.
(4.00, 3 ratings)
Siva Raghupathy and Ben Snively explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, they explain when and how you can use serverless technologies to streamline data processing and share a reference architecture using a combination of cloud and open source technologies. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Data science & advanced analytics
Location: 210 C/G Level: Intermediate
Secondary topics:  Deep learning, Hardcore Data Science
Stephen Merity (Salesforce Research)
Average rating: ****.
(4.67, 3 ratings)
While attention and memory have become important components in many state-of-the-art deep learning architectures, it's not always obvious where they may be most useful. Even more challenging, such models can be very computationally intensive for production. Stephen Merity discusses the most recent techniques, what tasks they show the most promise in, and when they make sense in production systems. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Beginner
Secondary topics:  Media
Viral Bajaria (6Sense)
Average rating: ****.
(4.00, 1 rating)
What if companies could predict what products people will buy, how much they will buy, and when? It would be a game changer—and it’s already possible with the power of predictive intelligence. Viral Bajaria explores how BlueJeans Network was able to leverage predictive analytics to uncover buyers earlier, convert them at a 20x higher rate, and build a $33M pipeline. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Enterprise adoption
Location: 230 A Level: Intermediate
Marcel Kornacker (Cloudera), Mostafa Mokhtar (Cloudera)
Average rating: ****.
(4.50, 2 ratings)
Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Sponsored
Location: 230 B
Xiatian Zhang (TalkingData Ltd.)
Large-scale machine learning is a big challenge in industry due to the huge computing resources required and the difficulty of parameter tuning. Xiatian Zhang offers an overview of Fregata, TalkingData's open source machine-learning library based on Spark, which provides a lightweight, fast, memory-efficient, and parameter-free solution for large-scale machine learning. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017 Secondary topics:  Hardcore Data Science
Alice Zheng (Amazon)
Average rating: ****.
(4.50, 6 ratings)
In the machine-learning pipeline, feature engineering takes up the majority amount of time yet is seldom discussed. Alice Zheng leads a tour of popular feature engineering methods for text, logs, and images, giving you an intuitive and actionable understanding of tricks of the trade. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Data science & advanced analytics
Location: 212 A-B Level: Intermediate
Secondary topics:  Financial services
Matar Haller (Winton Capital)
Average rating: *****
(5.00, 2 ratings)
With the exploding growth of video and audio content online, there's an increasing need for indexable and searchable audio. Matar Haller demonstrates how to automatically identify who is speaking when in a recorded conversation using machine learning applied to a corpus of audio recordings. Matar shares how she approached the problem, the algorithms used, and steps taken to validate the results. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/15/2017
Strata Business Summit
Location: 212 C
Ashish Verma (Deloitte)
Average rating: ****.
(4.50, 8 ratings)
Ashish Verma explores the challenges organizations face after investing in hardware and software to power their analytics projects and the missteps that lead to inadequate data practices. Ashish explains how to course-correct and implement an insight-driven organization (IDO) framework that enables you to derive tangible value from your data faster. Read more.

5:50pm

Add to your personal schedule
5:50pm–6:50pm Wednesday, 03/15/2017
Location: Hall 1, 2, 3
Quench your thirst with vendor-hosted libations (plus snacks) while you check out all the exhibitors in the Expo Hall. Read more.

7:00pm

Add to your personal schedule
7:00pm–9:00pm Wednesday, 03/15/2017
Location: The Tech Museum of Innovation, 201 S Market St, San Jose, CA 95113
Average rating: ****.
(4.50, 4 ratings)
Don't miss the social highlight of Strata + Hadoop World hosted at The Tech Museum of Innovation. Read more.

Thursday, 03/16/2017

8:00am

8:00am–8:15am Thursday, 03/16/2017
Location: Grand Ballroom 220 Foyer
Coffee break (8am - 9am) (15m)

8:15am

Add to your personal schedule
8:15am–8:45am Thursday, 03/16/2017
Location: Grand Ballroom 220 Foyer
Gather before keynotes on Thursday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. Read more.

8:45am

Add to your personal schedule
8:45am–8:50am Thursday, 03/16/2017
Location: Grand Ballroom
Roger Magoulas (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Average rating: ***..
(3.00, 3 ratings)
Program chairs Roger Magoulas, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes. Read more.

8:50am

Add to your personal schedule
8:50am–9:00am Thursday, 03/16/2017
Location: Grand Ballroom
Tom Reilly (Cloudera), Khalid Al-Kofahi (Thomson Reuters)
Average rating: ****.
(4.07, 29 ratings)
Data helps us understand our market in new and novel ways. In today's world, sifting through the noise in modern journalism means navigating enormous amounts of data, news, and tweets. Tom Reilly and Khalid Al-Kofahi explain how Thomson Reuters is leveraging big data and machine learning to chase down leads, verify sources, and determine what's newsworthy. Read more.

9:00am

Add to your personal schedule
9:00am–9:15am Thursday, 03/16/2017
Location: Grand Ballroom
Secondary topics:  AI
Andra Keay (Silicon Valley Robotics)
Average rating: ****.
(4.09, 46 ratings)
Let’s stop talking about bad robots and start talking about what makes a robot good. A good or ethical robot must be carefully designed. Andra Keay outlines five principles of good robot design and discusses the implications of implicit bias in our robots. Read more.

9:15am

Add to your personal schedule
9:15am–9:25am Thursday, 03/16/2017
Location: Grand Ballroom
Vijay Narayanan (Microsoft)
Average rating: ****.
(4.04, 28 ratings)
Vijay Narayanan takes you on an inspiring journey exploring how the cloud, data, and artificial intelligence are powering and accelerating the genomic revolution—saving and changing lives in the process. Read more.

9:25am

Add to your personal schedule
9:25am–9:40am Thursday, 03/16/2017
Location: Grand Ballroom
Michael Jordan (UC Berkeley)
Average rating: ****.
(4.29, 31 ratings)
Keynote with Michael I. Jordan Read more.

9:40am

Add to your personal schedule
9:40am–9:45am Thursday, 03/16/2017
Location: Grand Ballroom
Ron Bodkin (Teradata)
Average rating: **...
(2.52, 25 ratings)
It is no surprise that reducing operational IT expenditures and increasing software capabilities is a top priority for large enterprises. Given its advantages, open source software has proliferated across the globe. Ron Bodkin explains how Teradata drives open source adoption inside enterprises using open source data management and AI techniques leveraged across the analytical ecosystem. Read more.

9:45am

Add to your personal schedule
9:45am–10:00am Thursday, 03/16/2017
Location: Grand Ballroom
Secondary topics:  Data for good
Desiree Matel-Anderson (The Field Innovation Team)
Average rating: ***..
(3.26, 27 ratings)
Data to the rescue. Desi Matel-Anderson offers an immersive deep dive into the world of the Field Innovation Team, who routinely find themselves on the frontier of disasters working closely with data to save lives, at times while risking their own. Read more.

10:00am

Add to your personal schedule
10:00am–10:05am Thursday, 03/16/2017
Location: Grand Ballroom
Average rating: ***..
(3.11, 18 ratings)
Which is more important: the model or the data? Dinesh Nirmal explains how your data can help you build the right cognitive systems to learn about, reason with, and engage with your customers. Read more.

10:05am

Add to your personal schedule
10:05am–10:10am Thursday, 03/16/2017
Location: Grand Ballroom
Rob Craft (Google)
Average rating: ***..
(3.81, 27 ratings)
Rob Craft shares some of the ways machine learning is being used inside of Google, explores cloud-based neural networks, and discusses some customer use cases. Read more.

10:10am

Add to your personal schedule
10:10am–10:30am Thursday, 03/16/2017
Location: Grand Ballroom
Maya Shankar (White House Office of Science & Technology Policy)
Average rating: ***..
(3.36, 25 ratings)
Maya Shankar discusses the motivation for and impact of the White House Social and Behavioral Sciences Team and shares lessons learned building a startup within the federal government. Read more.

10:30am

10:30am–11:00am Thursday, 03/16/2017
Location: Exhibit Hall
Morning break sponsored by Teradata (30m)

11:00am

Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Data engineering and architecture, Real-time applications
Location: LL20 A Level: Intermediate
Secondary topics:  Data Platform
Tony Xing (Microsoft)
Average rating: ***..
(3.00, 2 ratings)
Tony Xing offers an overview of Microsoft's common anomaly detection platform, an API service built internally to provide product teams the flexibility to plug in any anomaly detection algorithms to fit their own signal types. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Sponsored
Location: LL20 B
Roger Rea (IBM Information Management), Jorge Castanon (IBM)
Roger Rea and Jorge Castanon outline the top enterprise use cases for streaming and machine learning. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Stream processing and analytics
Location: LL20 C Level: Beginner
Secondary topics:  Streaming
Tyler Akidau (Google)
Average rating: ***..
(3.00, 2 ratings)
Join Tyler Akidau for a whirlwind tour of the conceptual building blocks of massive-scale data processing systems over the last decade, as Tyler compares and contrasts systems at Google with popular open source systems in use today. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Stream processing and analytics
Location: LL20 D Level: Intermediate
Secondary topics:  Data Platform, Media, Streaming
Bill Graham (Twitter), Avrilia Floratau (Microsoft), Ashvin Agrawal (Microsoft)
Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Data science & advanced analytics
Location: LL21 A Level: Beginner
June Andrews (Pinterest), Frances Haugen (Pinterest)
Average rating: *****
(5.00, 5 ratings)
An experiment at Pinterest revealed somewhat shocking results. When nine data scientists and ML engineers were asked the same constrained question, they gave nine spectacularly different answers. The implications for business are astronomical. June Andrews and Frances Haugen explore the aspects of analysis that cause differences in conclusions and offer some solutions. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Platform Security and Cybersecurity
Location: LL21 B Level: Beginner
Secondary topics:  ecommerce, Media
Yinglian Xie (DataVisor)
How many of your users are really fraudsters waiting to strike? These sleeper cells exist in all online communities. Using data from more than 400M users and 500B events from online services across the world, Yinglian Xie explores sleeper cells, explains sophisticated attack techniques being used to evade detection, and shows how Spark's in-memory big data security analytics can help. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Spark & beyond
Location: LL21 C/D
Yin Huai (Databricks)
Average rating: ***..
(3.86, 7 ratings)
Just like any six-year-old, Apache Spark does not always do its job and can be hard to understand. Yin Huai looks at the top causes of job failures customers encountered in production and examines ways to mitigate such problems by modifying Spark. He also shares a methodology for improving resilience: a combination of monitoring and debugging techniques for users. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Secondary topics:  Architecture, Streaming
Todd Lipcon (Cloudera), Marcel Kornacker (Cloudera)
Average rating: ****.
(4.00, 1 rating)
Todd Lipcon and Marcel Kornacker offer an introduction to using Impala and Kudu to power your real-time data-centric applications for use cases like time series analysis (fraud detection, stream market data), machine data analytics, and online reporting. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017 Secondary topics:  AI, Deep learning
Rajat Monga (Google)
Average rating: ***..
(3.86, 7 ratings)
Rajat Monga offers an overview of TensorFlow progress and adoption in 2016 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Sponsored
Location: 210 B/F
Wee Hyong Tok (Microsoft), Danielle Dean (Microsoft)
Average rating: *****
(5.00, 1 rating)
Wee Hyong Tok and Danielle Dean explain how the global, trusted, and hybrid Microsoft platform can enable you to do intelligence at scale, describing real-life applications where big data, the cloud, and AI are making a difference and how this is accelerating the digital transformation for these organizations at a lighting pace. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Non-technical
Mehmet Irmak Sirer (Datascope Analytics)
Average rating: ***..
(3.83, 12 ratings)
In a data-driven organization, vice presidents, directors, and managers play a crucial role as translators between senior leadership and data science teams. They don’t need to be full-fledged data scientists, but they do need data science "street smarts” in order to succeed in this critical task. Mehmet Irmak Sirer outlines the skills they need and gives practical ways to improve them. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Secondary topics:  Data Platform, Streaming, Telecom
Todd Mostak (MapD), Abdul Subhan (Verizon Wireless)
Average rating: ****.
(4.00, 2 ratings)
With more than 91M customers, Verizon produces oceans of data. The challenge this onslaught presents isn’t one of storage—it’s one of speed. The solution? Harnessing the power of GPUs to access insights in less than a millisecond. Todd Mostak and Abdul Subhan explain how Verizon solved its data challenge by implementing GPU-tuned analytics and visualization. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017 Secondary topics:  Hardcore Data Science, Media, Text
Dorna Bandari (Pinterest Inc.)
Average rating: ****.
(4.00, 2 ratings)
Most internet companies record a constant stream of logs as a user interacts with their application. Depending on the complexity of the application, the logs can be extremely difficult to decipher. Dorna Bandari presents a novel NLP-based method for clustering user sessions in consumer internet applications, which has proved to be extremely effective in both driving strategy and personalization. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Sponsored
Location: 230 B
Average rating: ***..
(3.25, 4 ratings)
Teradata joined the Presto community in 2015 and is now a leading contributor to this open source SQL engine, originally created by Facebook. Join Kamil Bajda-Pawlikowski to learn about Presto, Teradata's recent enhancements in query performance, security integrations, and ANSI SQL coverage, and its roadmap for 2017 and beyond. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017 Secondary topics:  Cloud, Deep learning, Hardcore Data Science
Anima Anandkumar (UC Irvine)
Average rating: ****.
(4.67, 3 ratings)
Anima Anandkumar demonstrates how to use preconfigured Deep Learning AMIs and CloudFormation templates on AWS to help speed up deep learning development and shares use cases in computer vision and natural language processing. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/16/2017
Ask Me Anything
Location: 212 A-B
John Akred (Silicon Valley Data Science), Scott Kurth (Silicon Valley Data Science), Julie Steele (Silicon Valley Data Science), Stephen O'Sullivan (Silicon Valley Data Science)
John Akred, Julie Steele, Stephen O'Sullivan, and Scott Kurth field a wide range of detailed questions about developing a modern data strategy, architecting a data platform, and best practices for and the evolving role of the CDO. Even if you don’t have a specific question, join in to hear what others are asking. Read more.

11:50am

Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture, Media, Platform
Kurt Brown (Netflix)
Average rating: ****.
(4.90, 10 ratings)
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Sponsored
Location: LL20 B
Victoria Livschitz (Grid Dynamics)
Average rating: *****
(5.00, 1 rating)
Victoria Livschitz outlines key business drivers for real-time analytics applications in retail and describes the emerging architectures based on in-stream processing (ISP) technologies. Victoria shares a complete open blueprint for an ISP platform—including a demo application for real-time Twitter sentiment analytics—designed with 100% open source components and deployable to any cloud. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Stream processing and analytics
Location: LL20 C Level: Intermediate
Secondary topics:  Streaming
Slava Chernyak (Google)
Average rating: *****
(5.00, 2 ratings)
Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Stream processing and analytics
Location: LL20 D Level: Intermediate
Secondary topics:  Media, Streaming
Arun Kejariwal (Machine Zone), Karthik Ramasamy (Twitter)
Average rating: ***..
(3.00, 1 rating)
Anomaly detection plays a key role in the context of analysis of real-time streams. This is exemplified by, say, detection incidents in real life from tweet storms. Arun Kejariwal and Karthik Ramasamy walk you through how anomaly detection is supported in real-time data streams in Heron—the streaming system built in-house at Twitter (and open sourced) for real-time computation. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Shoumik Palkar (Stanford University)
Modern data applications combine functions from many libraries and frameworks and cannot achieve peak hardware performance due to data movement across functions. Shoumik Palkar offers an overview of Weld, an optimizing runtime that enables optimizations across disjoint libraries, and explains how to integrate it into frameworks such as Spark SQL for performance gains with no changes to user code. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Law, ethics, governance
Location: LL21 B Level: Non-technical
Secondary topics:  Data for good
Craig Hibbeler (MasterCard Advisors), David Goodman (Nethope), Mike Olson (Cloudera), Laura Eisenhardt (iKnow Solutions), Steven Totman (Cloudera)
In a panel moderated by Steve Totman, Mike Olson, Laura Eisenhardt, Craig Hibbeler, and David Goodman discuss real-world projects using big data as a force for good to address problems ranging from Zika to child trafficking. If you’re interested in how big data can benefit humankind, join in to learn how to get involved. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Big data and the Cloud
Location: LL21 C/D Level: Beginner
Secondary topics:  Architecture
Haoyuan Li (Alluxio), Gene Pang (Alluxio)
Average rating: ****.
(4.00, 1 rating)
Alluxio (formerly Tachyon) is an open source memory-speed virtual distributed storage system. The project has experienced a tremendous improvement in performance and scalability and was extended with key new features. Haoyuan Li and Gene Pang explore Alluxio's goal of making its product accessible to an even wider set of users through a focus on security, new language bindings, and APIs. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Hadoop platform and applications
Location: LL21 E/F Level: Intermediate
Secondary topics:  Architecture
Yang Li (Kyligence)
Average rating: *****
(5.00, 3 ratings)
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Spark & beyond
Location: 210 A/E
Secondary topics:  Deep learning, Hardcore Data Science
Joseph Bradley (Databricks), Tim Hunter (Databricks, Inc.)
Average rating: ***..
(3.75, 4 ratings)
Joseph Bradley and Tim Hunter share best practices for building deep learning pipelines with Apache Spark, covering cluster setup, data ingest, tuning clusters, and monitoring jobs—all demonstrated using Google’s TensorFlow library. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Sponsored
Location: 210 B/F
Vahid Fereydouny (VMware), Anjaneya Chagam (Intel Corporation)
Vahid Fereydouny and Anjaneya Chagam share the results of running Hadoop workloads on a standard all-flash vSAN cluster, unleashing the simplicity and power of big data in a hyperconverged environment. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017 Secondary topics:  Media
Brian Lange (Datascope)
Average rating: ****.
(4.00, 2 ratings)
The goal of RCSA's Scialog conferences is to foster collaboration between scientists with different specialties and approaches, and, working with Datascope, the company has been doing so in a quantitative way for the last six years. Brian Lange discusses how Datasope and RCSA arrived at the problem, the design choices made in the survey and optimization, and how the results were visualized. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Secondary topics:  Data for good, Healthcare
Emily Spahn (ProKarma)
Average rating: ****.
(4.25, 4 ratings)
Many hospitals combine early warning systems with rapid response teams (RRT) to detect patient decline and respond with elevated care. Predictive models can minimize RRT events by identifying at-risk patients, but modeling is difficult because events are rare and features are varied. Emily Spahn explores the creation of one such patient-risk model and shares lessons learned along the way. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Data science & advanced analytics
Location: 230 A Level: Beginner
Secondary topics:  AI, Hardcore Data Science
Michael Lee Williams (Fast Forward Labs)
Average rating: ***..
(3.80, 5 ratings)
Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Sponsored
Location: 230 B
Luhui Hu (Futurewei Technologies)
Average rating: ****.
(4.00, 3 ratings)
With Huawei's big data cloud ecosystem, you can define and setup your data pipelines quickly and easily, whether you’re looking for batch processing or stream analytics. Luhui Hu shares best practices for designing a big data pipeline in the cloud and explains how to implement serverless big data solutions and intelligent data clouds. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017 Secondary topics:  Hardcore Data Science
Carlos Guestrin (University of Washington | Apple)
Average rating: *****
(5.00, 4 ratings)
Carlos Guestrin offers an overview of anchors and aLIME, a novel, high-precision explanation technique for the predictions of any classifier in an interpretable and faithful manner, demonstrating the flexibility of these methods by explaining different models for text, image classification, and visual question answering and exploring the usefulness of explanations via novel experiments. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/16/2017
Ask Me Anything
Location: 212 A-B
Mark Grover (Cloudera), Jonathan Seidman (Cloudera), Ted Malaska (Blizzard Entertainment)
Mark Grover and Jonathan Seidman, the authors of Hadoop Application Architectures, share considerations and recommendations for the architecture and design of applications using Hadoop. Come with questions about your use case and its big data architecture or just listen in on the conversation. Read more.

12:30pm

Add to your personal schedule
12:30pm–1:50pm Thursday, 03/16/2017
Location: Hall 1, 2, 3
Birds of a Feather (BoF) discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.

1:50pm

Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Spark & beyond
Location: LL20 A
Secondary topics:  Platform, Streaming
Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. Vinoth Chandar and Prasanna Rajaperumal introduce Hoodie, a newly open sourced system at Uber that adds new incremental processing primitives to existing Hadoop technologies to provide near-real-time data at 10x reduced cost. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Stream processing and analytics
Location: LL20 C Level: Intermediate
Secondary topics:  Streaming
Jamie Grier (data Artisans)
Average rating: *****
(5.00, 4 ratings)
Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics include queryable state, dynamic scaling, streaming SQL, very large state support, and whatever is the latest and greatest in March 2017. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Data engineering and architecture, Emerging Technologies
Location: LL20 D Level: Advanced
Secondary topics:  Architecture
Julien Le Dem (Apache Parquet), Jacques Nadeau (Dremio)
Average rating: ****.
(4.00, 2 ratings)
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, such as RDMA, SSDs, and nonvolatile memory. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Ira Cohen (Anodot)
Average rating: ***..
(3.50, 2 ratings)
Apps have so many moving parts that a simple change to one element can cause havoc somewhere else. The resulting issues annoy users and cause revenue leaks. Ira Cohen outlines ways to use anomaly detection to monitor all areas of an app, from the code to the user behavior to partner integrations and more, to fully optimize your mobile app. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Data engineering and architecture
Location: LL21 B Level: Beginner
Shirshanka Das (LinkedIn), Yael Garten (LinkedIn)
Average rating: ****.
(4.75, 4 ratings)
Shirshanka Das and Yael Garten share best practices learned using Kafka and Hadoop as the foundation of a petabyte-scale data warehouse at LinkedIn, offering concrete suggestions to help you process data seamlessly. Along the way, Shirshanka and Yael discuss their experience running governance to empower data teams. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Spark & beyond
Location: LL21 C/D Level: Intermediate
Holden Karau (IBM), Joey Echeverria (Rocana)
Average rating: ***..
(3.67, 3 ratings)
Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Data science & advanced analytics
Location: LL21 E/F Level: Advanced
Secondary topics:  Hardcore Data Science, Media
Chao Zhong (Microsoft)
Average rating: ****.
(4.67, 3 ratings)
Chao Zhong offers an overview of a new predictive model for customer lifetime value (LTV) in a cloud-computing business. This model is also the first known application of the Fader RFM approach to a cloud business—a Bayesian approach that predicts a customer's LTV with a symmetric absolute percentage error (SAPE) of only 3% on an out-of-time testing dataset. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Data science & advanced analytics
Location: 210 A/E Level: Intermediate
Matt Brandwein (Cloudera), Tristan Zajonc (Cloudera)
Average rating: ***..
(3.33, 3 ratings)
Self-service data science is easier said than delivered, especially on Apache Hadoop. Most organizations struggle to balance the diverging needs of the data scientist, data engineer, operator, and architect. Matt Brandwein and Tristan Zajonc cover the underlying root causes of these challenges and introduce new capabilities being developed to make self-service data science a reality. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Sponsored
Location: 210 B/F
Rob Craft (Google)
Average rating: *****
(5.00, 1 rating)
Rob Craft explores machine learning and predictive analytics, explaining how you can leverage the power of ML whether you have a machine-learning team of your own or just want to use ML as a service. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017 Secondary topics:  Data for good
Gillian Docherty (The Data Lab)
Average rating: ****.
(4.33, 3 ratings)
Gillian Docherty shares her experience leading The Data Lab, an innovation center focused on helping organizations drive economic and social benefit through data science and analytics. Along the way, Gillian discusses some of the projects her teams have supported, from multinationals to startups, and explains how they leverage academic capability to help drive innovation from data. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017 Secondary topics:  Data Platform, Geospatial, Logistics
Mike Koelemay (Sikorsky Aircraft, Lockheed Martin)
Average rating: *****
(5.00, 2 ratings)
Sikorsky collects data onboard thousands of helicopters deployed worldwide that is used for fleet management services, engineering analyses, and business intelligence. Mike Koelemay offers an overview of the data platform that Sikorsky has built to manage the ingestion, processing, and serving of this data so that it can be used to rapidly generate information to drive decision making. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Data science & advanced analytics
Location: 230 A Level: Intermediate
Secondary topics:  ecommerce, Media
Michelangelo D'Agostino (Civis Analytics), Bill Lattner (Civis Analytics)
Average rating: ****.
(4.00, 2 ratings)
How do we know that an advertisement or promotion truly drives incremental revenue? Michelangelo D'Agostino and Bill Lattner share their experience developing machine-learning techniques for predicting treatment responsiveness from randomized controlled experiments and explore the use of these “persuasion” models at scale in politics, social good, and marketing. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Sponsored
Location: 230 B
Jagane Sundar (WANdisco)
Jagane Sundar shares a strongly consistent replication service for replicating between cloud object stores, HDFS, NFS, and other S3- and Hadoop-compatible filesystems. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017 Secondary topics:  Media, Text
Grace Huang (Pinterest)
Average rating: ***..
(3.33, 3 ratings)
With over 75 billion pins, the Pinterest content corpus is one of the largest human-curated collection of ideas. Grace Huang walks you through the lifecycle of a piece of content in Pinterest, a portfolio of metrics developed to monitor the health of the content corpus, and the story of creating a cross-functional initiative to preserve a healthy, sustainable content ecosystem. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/16/2017
Ask Me Anything
Location: 212 A-B
Tyler Akidau (Google), Frances Perry (Google), Kenneth Knowles (Google), Slava Chernyak (Google)
Average rating: ***..
(3.00, 1 rating)
Join Tyler Akidau, Frances Perry, Kenneth Knowles, and Slava Chernyak to discuss anything related to Apache Beam. Read more.

2:40pm

Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017 Secondary topics:  Architecture, Streaming
Gwen Shapira (Confluent)
Average rating: *****
(5.00, 3 ratings)
There are many good reasons to run more than one Kafka cluster. . .and a few bad reasons too. Great architectures are driven by use cases, and multicluster deployments are no exception. Gwen Shapira offers an overview of several use cases, including real-time analytics and payment processing, that may require multicluster solutions to help you better choose the right architecture for your needs. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Data engineering and architecture
Location: LL20 C Level: Beginner
Teresa Tung (Accenture Labs), Jurgen Weichenberger (Accenture Analytics), Ishmeet Grewal (Accenture Technology Labs)
Average rating: ***..
(3.80, 5 ratings)
As Accenture scaled to millions of predictive models, it needed automation to manage models at scale, ensure accuracy, prevent false alarms, and preserve trust as models are created, tested, and deployed into production. Teresa Tung, Jürgen Weichenberger, and Ishmeet Grewal share their approach to implementing DevOps for models and employing a self-healing approach to model lifecycle management. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Real-time applications
Location: LL20 D Level: Beginner
Manny Puentes (Rebel AI)
Average rating: ***..
(3.00, 2 ratings)
In 2016, digital advertising overtook TV in spend, requiring companies to cut through the noise to reach their audience. Manny Puentes explains how Rebel AI decides which ads to serve across devices and how it delivers multidimension reporting in milliseconds. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Christopher Bergh (DataKitchen), Gil Benghiat (DataKitchen)
Average rating: ****.
(4.50, 2 ratings)
Data analysts, data scientists, and data engineers are already working on teams delivering insight and analysis, but how do you get the team to support experimentation and insight delivery without ending up in an IT versus data engineer versus data scientist war? Christopher Bergh and Gil Benghiat present the seven shocking steps to get these groups of people working together. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Business case studies
Location: LL21 B Level: Non-technical
Secondary topics:  Data for good, Smart cities
Joel Gurin (Center for Open Data Enterprise)
Average rating: ****.
(4.67, 3 ratings)
Open government data—free public data than anyone can use and republish—is a major resource for entrepreneurs and innovators. The Center for Open Data Enterprise has partnered with the White House, government, and businesses to show how this resource can create economic value. Joel Gurin and Katherine Garcia share case studies of how open data is being used and a vision for its future. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Platform Security and Cybersecurity, Spark & beyond
Location: LL21 C/D Level: Advanced
Secondary topics:  Hardcore Data Science
Alexander Ulanov (Hewlett Packard Labs), Manish Marwah (Hewlett Packard Labs)
Alexander Ulanov and Manish Marwah explain how they implemented a scalable version of loopy belief propagation (BP) for Apache Spark, applying BP to large web-crawl data to infer the probability of websites to be malicious. Applications of BP include fraud detection, malware detection, computer vision, and customer retention. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Stream processing and analytics
Location: LL21 E/F Level: Intermediate
Secondary topics:  Streaming
David Yan (DataTorrent, Inc.)
David Yan offers an overview of Apache Apex, a stream processing engine used in production by several large companies for real-time data analytics. With Apex, you can build applications that scalably and reliably process their data with high throughput and low latency. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Real-time applications
Location: 210 A/E Level: Advanced
Secondary topics:  Financial services, Hardcore Data Science, IoT, Streaming
Jeffrey Yau (Silicon Valley Data Science)
Average rating: ***..
(3.20, 5 ratings)
Thanks to frameworks such as Spark's GraphX and GraphFrames, graph-based techniques are increasingly applicable to anomaly, outlier, and event detection in time series. Jeffrey Yau offers an overview of applying graph-based techniques in fraud detection, IoT processing, and financial data and outlines the benefits of graphs relative to other techniques. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017 Secondary topics:  ecommerce, Retail
Eric Colson (Stitch Fix)
Average rating: ****.
(4.36, 14 ratings)
Data scientists blend the skills of statisticians, software engineers, and domain experts to create new roles. Data science isn't merely an amalgam of disciplines but rather a gestalt which synthesizes the ethos of various fields. This merits new thinking when it comes to organization. Eric Colson explores some novel—and often unintuitive—ways to unleash the value of your data science team. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Secondary topics:  Media
Romit Jadhwani (Pinterest)
Average rating: ****.
(4.75, 4 ratings)
Over the course of just six years, Pinterest has helped over 100 million pinners discover and collect over 75+ billion ideas to plan their everyday lives. Romit Jadhwani walks you through the different phases of this hypergrowth journey and explores the focuses, thought processes, and decisions of Pinterest’s data team as they scaled and enabled this growth. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Data science & advanced analytics
Location: 230 A Level: Intermediate
Secondary topics:  Media
Michelle Casbon (Qordoba)
Average rating: *****
(5.00, 1 rating)
Supporting multiple locales involves the maintenance and generation of localized strings. Michelle Casbon explains how machine learning and natural language processing are applied to the underserved domain of localization using primarily open source tools, including Scala, Apache Spark, Apache Cassandra, and Apache Kafka. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017 Secondary topics:  Architecture, Data Platform, ecommerce
Gleicon Moraes (luc.id), Arthur Grava (Luizalabs)
Average rating: ****.
(4.00, 3 ratings)
Gleicon Moraes and Arthur Grava share war stories about developing and deploying a cloud-based large-scale recommender system for a top-three Brazilian ecommerce company. The system, which uses Cassandra and graph traversal, led to a more than 15% increase in sales. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/16/2017
Ask Me Anything
Location: 212 A-B
Vartika Singh (Cloudera), Jayant Shekhar (Sparkflows Inc.), Jeffrey Shmain (Cloudera)
Join Vartika Singh, Jayant Shekha, and Jeffrey Shmain to ask questions about their tutorial Unraveling Data with Spark Using Machine Learning or anything else Spark related. Read more.

3:20pm

3:20pm–4:20pm Thursday, 03/16/2017
Location: Exhibit Hall
Afternoon break sponsored by IBM (1h)

4:20pm

Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Data engineering and architecture
Location: LL20 A Level: Intermediate
Secondary topics:  Architecture
Nischal HP (Unnati Data Labs), Raghotham Sripadraj (Unnati Data Labs)
Average rating: ****.
(4.67, 3 ratings)
Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault-tolerant tools. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Enterprise adoption
Location: LL20 C Level: Intermediate
Average rating: *****
(5.00, 2 ratings)
Visa is transforming the way it manages data: database appliances are giving way to Hadoop and HBase, and proprietary ETL is being replaced by Spark. Nandu Jayakumar and Rajesh Bhargava discuss the adoption of big data practices at this conservative financial enterprise and contrasts it with the adoption of the same ideas at Nandu's previous employer, a web/ad-tech company. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017 Secondary topics:  Deep learning
Shivnath Babu (Duke University | Unravel Data Systems)
Average rating: *****
(5.00, 1 rating)
Shivnath Babu offers an introduction to using deep learning to solve complex problems in IT operations analytics. Shivnath focuses on how deep learning can derive operations insights automatically for the complex big data application stack composed of systems such as Hadoop, Spark, Cassandra, Elasticsearch, and Impala, using examples of open source tools for deep learning. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Warren Reed (US Treasury’s Office of Financial Research)
Average rating: ***..
(3.50, 2 ratings)
Warren Reed explains how he and his team at the US Treasury’s Office of Financial Research leverage data visualization techniques to build interactive data products for risk measurement and monitoring. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Spark & beyond
Location: LL21 C/D Level: Beginner
Secondary topics:  Financial services
Bryan Cheng (BlockCypher), Karen Hsu (BlockCypher)
Average rating: *****
(5.00, 2 ratings)
Bryan Cheng and Karen Hsu describe how they built machine-learning and graph traversal systems on Apache Spark to help government organizations and private businesses stay informed in the brave new world of blockchain technology. Bryan and Karen also share lessons learned combining these two bleeding-edge technologies and explain how these techniques can be applied to private and federated chains. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Spark & beyond
Location: LL21 E/F
Jiri Simsa (Alluxio), Calvin Jia (Alluxio)
Average rating: ****.
(4.67, 3 ratings)
Alluxio bridges Spark applications with various storage systems and further accelerates data-intensive applications. Gene Pang and Jiri Simsa introduce Alluxio, explain how Alluxio can help Spark be more effective, show benchmark results with Spark RDDs and DataFrames, and describe production deployments with both Alluxio and Spark working together. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Law, ethics, governance
Location: 210 A/E
Max Ogden (Independent)
Average rating: *****
(5.00, 1 rating)
Max Ogden offers an overview of Data Refuge, a nationwide volunteer effort led by librarians, scientists, and coders to discover and back up research data at risk of disappearing. Max discusses his work to uncover hundreds of federal data servers containing petabytes of publicly funded research data and his plan to keep it online and useful to researchers in the future. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Robert Cohen (Economic Strategy Institute)
Programmable enterprises are developing their businesses around cloud computing, big data, and the internet of things. Robert Cohen explores how infrastructure changes will alter corporate use of software, skilled employees, and strategies, the business and economic impacts of these changes, and the broader impacts of these shifts on our economy and society. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Business case studies, Strata Business Summit
Location: 210 D/H Level: Intermediate
Secondary topics:  Data Platform, ecommerce, Streaming, Text
Mahesh Goud T (Ticketmaster)
Average rating: **...
(2.00, 1 rating)
Mahesh Goud shares success stories using Ticketmaster's large-scale contextual bandit platform for SEM, which determines the optimal keyword bids under evolving keyword contexts to meet different business requirements, and explores Ticketmaster's streaming pipeline, consisting of Storm, Kafka, HBase, the ELK Stack, and Spring Boot. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Data science & advanced analytics
Location: 230 A Level: Intermediate
Eduardo Arino de la Rubia (Domino Data Lab)
Average rating: ****.
(4.71, 7 ratings)
The promise of the automated statistician is as old as statistics itself. Eduardo Arino de la Rubia explores the tools created by the open source community to free data scientists from tedium, enabling them to work on the high-value aspects of insight creation. Along the way, Eduardo compares open source tools such as TPOT and auto-sklearn and discusses their place in the DS workflow. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Data science & advanced analytics
Location: 230 C Level: Advanced
Secondary topics:  Hardcore Data Science
Frederick Reiss (IBM Spark Technology Center), Arvind Surve (IBM)
Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/16/2017
Ask Me Anything
Location: 212 A-B
Gwen Shapira (Confluent)
Join Confluent system architect Gwen Shapira to discuss Apache Kafka and its use cases, data streaming platforms, and microservices. Read more.

5:00pm

Add to your personal schedule
5:00pm–6:00pm Thursday, 03/16/2017
Location: The Hub
Average rating: ***..
(3.50, 2 ratings)
Join attendees, speakers, and exhibitors as we end the conference on a sweet note with some ice cream. Read more.