Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Monday, 03/05/2018

9:00am

Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Training
Data science and machine learning
Location: 212 A-B
Brooke Wenig (Databricks)
Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Training
Data science and machine learning
Location: 212 C
Brian Bloechle (Cloudera), Glynn Durham (Cloudera)
Average rating: *****
(5.00, 1 rating)
Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Training
Strata Business Summit
Location: 212 D
Angie Ma (ASI), Maria Diaz (ASI Data Science)
Average rating: ****.
(4.00, 2 ratings)
Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Training
Data science and machine learning
Location: 111
Delip Rao (R7 Speech Science), Brian McMahan (Joostware)
Average rating: *****
(5.00, 1 rating)
PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Jesse Anderson (Big Data Institute)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Training
Data science and machine learning
Location: San Jose Ballroom (salon 1&2), Marriott
Robert Schroll (The Data Incubator)
The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs, making it ideal for implementing neural networks and other machine learning algorithms. Robert Schroll demonstrates TensorFlow's capabilities and walks you through building machine learning models on real-world data. Read more.
Add to your personal schedule
9:00am–5:00pm Monday, 03/05/2018
Training
Data science and machine learning
Location: Willow Glen (1&2), Marriott
Zachary Glassman (The Data Incubator)
Zachary Glassman demonstrates how to build intelligent business applications using machine learning, taking you through each step in developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend your knowledge by building two applications from real-world datasets. Read more.

10:30am

10:30am–11:00am Monday, 03/05/2018
Location: Executive Concourse
Morning break (30m)

12:30pm

12:30pm–1:30pm Monday, 03/05/2018
Location: The Hub
Lunch (1h)

3:00pm

3:00pm–3:30pm Monday, 03/05/2018
Location: Executive Concourse
Afternoon break (30m)

7:00pm

Add to your personal schedule
7:00pm–10:00pm Monday, 03/05/2018
Event
Location: Various locations
Get to know your fellow attendees over dinner. We've made reservations for you at some of the most sought-after restaurants in town. This is a great chance to make new connections and sample some of the great cuisine San Jose has to offer. Read more.

Tuesday, 03/06/2018

9:00am

Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Location: LL20 A
Madhav Madaboosi (BP), Meenakshisundaram Thandavarayan (Infosys), Matt Conners (Microsoft), Katie Malone (Civis Analytics), Mike Prorock (mesur.io), Thomas Miller (Northwestern University), Ann Nguyen (Whole Whale), Jennie Shin (Kaiser Permanente), Val Bercovici (PencilDATA), Wayde Fleener (General Mills), Joe Dumoulin (Next IT), Jules Malin (GoPro), Taylor Martin (O'Reilly Media), Divya Ramachandran (Captricity)
Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.
Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Location: LL20 B
David Boyle (MasterClass), Violeta Hennessey (Warner Bros.), April Chen (Civis Analytics), Sridhar Alla (Comcast), Noah Gift (UC Davis), Blake Irvine (Netflix), Kevin Lyons (Nielsen Marketing Cloud), Jennifer Webb (SuprFanz), Rizwan Patel (Caesars Entertainment), Anthony Accardo (Disney), Amanda Gerdes (Blizzard Entertainment), Violeta Hennessey (Warner Bros.), Aneesh Karve (Quilt), David Boyle (MasterClass), Peter Skomoroch (SkipFlag)
Hear from innovators in ad tech, measurement, automation, and audience engagement about where the media industry is today—and where it's likely to go next. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Tutorial
Data engineering and architecture, Law, ethics, and governance
Location: LL20 C Level: Intermediate
Mark Donsky (Cloudera), Andre Araujo (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera)
Average rating: **...
(2.00, 1 rating)
New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR. Read more.
Add to your personal schedule
9:00am–5:00pm Tuesday, 03/06/2018
Tutorial
Data science and machine learning
Location: LL20 D
Joseph Kambourakis (databricks)
Join Joseph Kambourakis for an introduction to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018 Secondary topics:  Graphs and Time-series
Mo Patel (Independent), Neejole Patel (Virginia Tech)
Average rating: **...
(2.50, 4 ratings)
Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Tutorial
Big data and data science in the cloud, Data engineering and architecture
Location: LL21 B Level: Intermediate
Jorge A. Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services (AWS)), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)
Average rating: ****.
(4.50, 2 ratings)
Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Tutorial
Data science and machine learning
Location: LL21 C/D Level: Intermediate
Mario Inchiosa (Microsoft), Vanja Paunic (Microsoft), Robert Horton (Microsoft), Debraj GuhaThakurta (Microsoft), Ali Zaidi (Microsoft), Tomas Singliar (Microsoft), John-Mark Agosta (Microsoft)
Average rating: ****.
(4.00, 4 ratings)
R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Tutorial
Data science and machine learning
Location: LL21 E/F Level: Intermediate
Martin Görner (Google)
Average rating: *****
(5.00, 3 ratings)
Martin Görner walks you through training and deploying a machine learning system using popular open source library TensorFlow. Martin takes you from a conceptual overview all the way to building complex classifiers and explains how you can apply deep learning to complex problems in science and industry. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)
Average rating: ****.
(4.44, 9 ratings)
Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018 Secondary topics:  Graphs and Time-series
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (Streamlio), Arun Kejariwal (MZ)
Average rating: *****
(5.00, 2 ratings)
Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Tim Berglund (Confluent)
Average rating: ****.
(4.36, 11 ratings)
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data. Read more.
Add to your personal schedule
9:00am–12:30pm Tuesday, 03/06/2018
Tutorial
Big data and data science in the cloud, Data engineering and architecture
Location: 210 D/H Level: Intermediate
Jason Wang (Cloudera), Mala Ramakrishnan (Cloudera), Stefan Salandy (Cloudera), Aishwarya Venkataraman (Cloudera), Vinithra Varadharajan (Cloudera), Aaron Myers (Cloudera, Inc.)
Average rating: ***..
(3.25, 4 ratings)
Aishwarya Venkataraman, Jason Wang, Mala Ramakrishnan, Stefan Salandy, and Vinithra Varadharajan lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices. Read more.

10:30am

10:30am–11:00am Tuesday, 03/06/2018
Location: Executive Concourse
Morning break (30m)

12:30pm

12:30pm–1:30pm Tuesday, 03/06/2018
Location: 230 A-C
Lunch (1h)

1:30pm

Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Tutorial
Data science and machine learning
Location: LL20 C Level: Intermediate
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
Average rating: *****
(5.00, 1 rating)
Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP, using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Tutorial
Data engineering and architecture
Location: LL21 A Level: Intermediate
Juan Yu (Cloudera)
Average rating: ****.
(4.75, 4 ratings)
Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Tutorial
Data engineering and architecture
Location: LL21 B Level: Intermediate
Ron Bodkin (Google), Brian Foo (Google)
Average rating: ***..
(3.00, 2 ratings)
TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)
Average rating: ****.
(4.00, 3 ratings)
Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Tutorial
Data science and machine learning, Media, entertainment, and advertising
Location: LL21 E/F Level: Intermediate
Abhishek Kumar (SapientRazorfish), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Average rating: ****.
(4.00, 3 ratings)
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Tutorial
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Non-technical
Nick Elprin (Domino Data Lab)
Average rating: *****
(5.00, 2 ratings)
The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Tutorial
Data engineering and architecture
Location: 210 B/F Level: Intermediate
Secondary topics:  Graphs and Time-series
Ted Malaska (Blizzard Entertainment)
Average rating: **...
(2.80, 5 ratings)
If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Average rating: ***..
(3.50, 2 ratings)
Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.
Add to your personal schedule
1:30pm–5:00pm Tuesday, 03/06/2018
Tutorial
Data science and machine learning, Visualization and user experience
Location: 210 D/H Level: Intermediate
James Bednar (Anaconda), Philipp Rudiger (Anaconda)
Average rating: ****.
(4.50, 2 ratings)
Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code. Read more.

3:00pm

3:00pm–3:30pm Tuesday, 03/06/2018
Location: Executive Concourse
Afternoon break (30m)

5:00pm

Add to your personal schedule
5:00pm–6:30pm Tuesday, 03/06/2018
Event
Location: Hall 1, 2, 3
Join us after tutorials on Tuesday in the Expo Hall. Grab a drink and mingle with fellow Strata attendees while you check out all of the exhibitors. Read more.

6:30pm

Add to your personal schedule
6:30pm–8:00pm Tuesday, 03/06/2018
Event
Location: Grand Ballroom 220
Average rating: *****
(5.00, 2 ratings)
Ignite is happening at Strata on Tuesday, March 6. Join us for a fun, high-energy evening of five-minute talks—all aspiring to live up to the Ignite motto: Enlighten us, but make it quick. Read more.

Wednesday, 03/07/2018

8:00am

Add to your personal schedule
8:00am–8:30am Wednesday, 03/07/2018
Event
Location: Concourse foyer
Gather before keynotes on Wednesday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees. Read more.

8:45am

Add to your personal schedule
8:45am–8:50am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Average rating: ****.
(4.00, 2 ratings)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes. Read more.

8:50am

Add to your personal schedule
8:50am–9:05am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Hilary Mason (Cloudera Fast Forward Labs)
Average rating: ****.
(4.80, 10 ratings)
The power of machine learning is very real, but so too is the hype and confusion about when, where, and how to apply it. Hilary Mason explores practical business applications for intelligent machines and details the tools and processes required to implement machine learning successfully. Read more.

9:05am

Add to your personal schedule
9:05am–9:20am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Li Fan (Pinterest)
Average rating: ****.
(4.38, 8 ratings)
Li Fan shares insights into how Pinterest improves products based on usage and explains how the company is using AI to predict what’s in an image, what a user wants, and what they’ll want next, answering subjective questions better than machines or humans alone could achieve. Read more.

9:20am

Add to your personal schedule
9:20am–9:30am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Tobias Ternstrom (Microsoft)
Average rating: ***..
(3.38, 8 ratings)
The emergence of the cloud combined with open source software ushered in an explosive use of a broad range of technologies. Tobias Ternstrom explains why you should step back and attempt to objectively evaluate the problem you are trying to solve before choosing the tool to fix it. Read more.

9:30am

Add to your personal schedule
9:30am–9:40am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220 Level: Non-technical
Ben Lorica (O'Reilly Media)
Average rating: ****.
(4.00, 8 ratings)
Ben Lorica shares emerging security best practices for business intelligence, machine learning, and mobile computing products and explores new tools, methods, and products that can help ease the way for companies interested in deploying secure and privacy-preserving analytics. Read more.

9:40am

Add to your personal schedule
9:40am–9:55am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Nancy Lublin (Crisis Text Line)
Average rating: ****.
(4.70, 10 ratings)
Nancy Lublin shares insights from Crisis Text Line. Read more.

9:55am

Add to your personal schedule
9:55am–10:00am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Nikita Shamgunov (MemSQL)
Average rating: ***..
(3.14, 7 ratings)
We live in a world that’s always connected. As a result, today’s intelligent applications need to react immediately to changing conditions. To achieve this, applications require a foundation that is latency free. Nikita Shamgunov shares a vision of latency-free life supported by modern data architectures. Read more.

10:00am

Add to your personal schedule
10:00am–10:10am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Natalie Evans Harris (BrightHive)
Average rating: ****.
(4.17, 6 ratings)
Natalie Evans Harris explores the Community Principles on Ethical Data Practices (CPEDP), a community-driven code of ethics for data collection, sharing, and utilization that provides people in the data science community a standard set of easily digestible, recognizable principles for guiding their behaviors. Read more.

10:10am

Add to your personal schedule
10:10am–10:15am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Average rating: ***..
(3.43, 7 ratings)
Machine learning research and incubation projects are everywhere, but less common, and far more valuable, is the innovation unlocked once you bring machine learning out of research and into production. Dinesh Nirmal explains how real-world machine learning reveals assumptions embedded in business processes and in the models themselves that cause expensive and time-consuming misunderstandings. Read more.

10:15am

Add to your personal schedule
10:15am–10:30am Wednesday, 03/07/2018
Keynote
Location: Grand Ballroom 220
Alex Smola (Amazon)
Average rating: ***..
(3.40, 5 ratings)
In this talk Alex will discuss lessons learned from AWS SageMaker, an integrated framework for handling all stages of analysis. AWS uses open source components such as Jupyter, Docker containers, Python and well established deep learning frameworks such as Apache MxNet and TensorFlow for an easy to learn workflow. Read more.

10:30am

10:30am–11:00am Wednesday, 03/07/2018
Location: Hall 1, 2, 3
Morning break sponsored by MemSQL (30m)

11:00am

Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: 212 A-B Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Gwen Shapira (Confluent)
Average rating: ****.
(4.93, 14 ratings)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL20 A Level: Intermediate
Mike Lee Williams (Cloudera Fast Forward Labs)
Average rating: ****.
(4.86, 7 ratings)
Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of the open source, model-agnostic tool LIME. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Sponsored
Location: LL20 B
Ian Swanson (DataScience.com)
Average rating: ****.
(4.50, 2 ratings)
Ian Swanson shares strategies for leading more productive data science teams, along with steps you can take today to meet growing demands for AI and machine learning use cases. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL20 C Level: Non-technical
Daniel Lurie (1989)
All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify, and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Wee Hyong Tok (Microsoft), Danielle Dean (Microsoft)
Average rating: ***..
(3.50, 2 ratings)
Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Sponsored
Location: LL21 A
Santosh Rao (NetApp)
Average rating: *****
(5.00, 1 rating)
Santosh Rao explores the architecture of a data pipeline from edge to core to cloud and across various data sources and processing engines and explains how to build a solution architecture that enables businesses to maximize the competitive differentiation with the ability to unify data insights in compelling yet efficient ways. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL21 B
Rajat Monga (Google)
Average rating: ****.
(4.40, 5 ratings)
Rajat Monga offers an overview of TensorFlow's progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Tom Fisher (MapR Technologies)
The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to next-generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: LL21 E/F Level: Intermediate
Kinnary Jangla (Pinterest)
Average rating: **...
(2.25, 8 ratings)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Strata Business Summit
Location: 210 A/E Level: Intermediate
mark madsen (Think Big Analytics), Shant Hovsepian (Arcadia Data)
Average rating: ***..
(3.29, 7 ratings)
There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian outline the trade-offs between a number of architectures that provide self-service access to data and discuss the pros and cons of architectures, deployment strategies, and examples of BI on big data. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Location: 210 B/F
Natalie Evans Harris (BrightHive)
Average rating: *****
(5.00, 2 ratings)
Join Natalie Evans Harris for a brainstorming session on data and ethics. You'll cover the current Community Principles on Ethical Data Practices (CPEDP) and next steps, existing tools that support ethical data practices, how the community can support the needs of the individual, and whether or not the community needs to be held accountable to regulations (or something more like fiduciary duty). Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Law, ethics, and governance, Strata Business Summit
Location: 210 C/G Level: Intermediate
Anne Buff (SAS)
Average rating: ****.
(4.50, 2 ratings)
Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Strata Business Summit
Location: 210 D/H Level: Intermediate
Mauro Damo (Dell EMC), Wei Lin (Dell EMC)
Average rating: ***..
(3.50, 2 ratings)
Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Shivnath Babu (Duke University | Unravel Data Systems), mdhruvg goel (Microsoft)
Average rating: ****.
(4.50, 2 ratings)
Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Dhruv Goel explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Sponsored
Location: 230 B
Tobias Ternstrom (Microsoft)
Average rating: ****.
(4.00, 1 rating)
Tobias Ternstrom leads a deep dive into case studies from three Microsoft customers who put technology before solutions. Tobias examines the decisions that brought them there and outlines how they got back on track and solved their business problems. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: 230 C Level: Intermediate
Daniel Templeton (Cloudera), Andrew Wang (Cloudera)
Average rating: ****.
(4.67, 6 ratings)
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet. Read more.
Add to your personal schedule
11:00am–11:40am Wednesday, 03/07/2018
Session
Data science and machine learning
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall
Siddha Ganju (Deep Vision)
Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts. Read more.

11:50am

Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Eugene Kirpichov (Google)
Average rating: ****.
(4.75, 4 ratings)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Dan Crankshaw (UC Berkeley RISELab)
Average rating: ****.
(4.25, 4 ratings)
Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Sponsored
Location: LL20 B
Vijay Kotu (Oath)
Average rating: ****.
(4.67, 3 ratings)
Vijay Kotu details how Oath is using MicroStrategy to combine elements of data science, enterprise mobility, information design, and data lakes in its transformation into an intelligent enterprise. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL20 C Level: Non-technical
Miryung Kim (UCLA), Muhammad Gulzar (UCLA)
Average rating: ***..
(3.50, 2 ratings)
Even though we know that there are more data scientists in the workforce today, neither what those data scientists actually do nor what we even mean by data scientists has been studied quantitatively. Miryung Kim and Muhammad Gulzar share the results of a large-scale survey with 793 professional data scientists and detail several trends about data scientists in the software engineering context. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Average rating: *****
(5.00, 1 rating)
When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Andrew Mattarella-Micke shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices he's learned along the way. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Sponsored
Location: LL21 A
Average rating: ***..
(3.00, 1 rating)
Companies that want to become truly digital must take a journey of three steps: data transformation, data science transformation, and digital transformation. This also requires transforming the business with machine learning to fundamentally change the relationship with customers. Seth Dobrin explains the detailed steps along the way to digital transformation—and the pitfalls. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data science and machine learning
Location: LL21 B Level: Intermediate
Sergey Ermolin (Intel), Shivaram Venkataraman (Microsoft Research)
Average rating: ***..
(3.00, 1 rating)
The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)
Average rating: ****.
(4.00, 1 rating)
Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Manu Mukerji (8x8)
Average rating: ****.
(4.22, 9 ratings)
Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Strata Business Summit
Location: 210 A/E Level: Non-technical
Yishay Carmiel (IntelligentWire | Spoken Labs)
Average rating: ****.
(4.00, 3 ratings)
One of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean and to take particular actions based on that information. This goal is the essence of conversational AI. Yishay Carmiel explores the latest breakthroughs and revolutions in this field and the challenges still to come. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Sponsored
Location: 210 B/F
Billy Liu (Kyligence)
As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Beginner
Matthew Granade (Domino Data Lab)
Average rating: **...
(2.00, 1 rating)
Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams need to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Ari Gesher (Kairos Aerospace)
A warming planet needs precise, localized predictions about the effects of climate change to make good long-term and medium-term economic decision making. Ari Gesher demonstrates how to use a mix of physical simulation, enhanced scientific models, machine learning verification, and high-scale computing to predict and package climate predictions as data products. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
William Chambers (Databricks), Michael Armbrust (Databricks)
Average rating: ****.
(4.60, 5 ratings)
William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Sponsored
Location: 230 B
Adam Ahringer (Disney-ABC TV Digital Media)
Average rating: ***..
(3.20, 5 ratings)
Adam Ahringer explains how Disney-ABC TV leverages Amazon Kinesis and MemSQL to provide real-time insights based on user telemetry as well as the platform for traditional data warehousing activities. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: 230 C Level: Intermediate
Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)
Average rating: ****.
(4.00, 4 ratings)
Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems. Read more.
Add to your personal schedule
11:50am–12:30pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall
David Talby (Pacific AI), Santosh Kulkarni (Kaiser Permanente)
Average rating: ***..
(3.50, 2 ratings)
David Talby and Santosh Kulkarni explain how Kaiser Permanente uses the open source NLP library for Apache Spark to tackle one of the most common challenges with applying natural language process in practice: integrating domain-specific NLP as part of a scalable, performant, measurable, and reproducible machine learning pipeline. Read more.

12:30pm

Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Event
Location: Hall 1, 2, 3
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.
Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Event
Location: Almaden Ballroom, San Jose Hilton
Average rating: *****
(5.00, 2 ratings)
Want to network with other women attending Strata Data Conference? Then be sure to come to the Women in Big Data Luncheon on Wednesday. Read more.
Add to your personal schedule
12:30pm–1:50pm Wednesday, 03/07/2018
Event
Location: San Jose Ballroom, Marriott
Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers. Read more.

1:50pm

Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Jordan Hambleton (Cloudera), Guru Medasani (Domino Data Lab)
Average rating: ****.
(4.25, 4 ratings)
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data science and machine learning
Location: LL20 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Alexandra Gunderson (Arundo Analytics)
Average rating: *****
(5.00, 1 rating)
Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks or even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Sponsored
Location: LL20 B
Ted Dunning (MapR Technologies)
Average rating: ****.
(4.67, 3 ratings)
Getting value from data at large scale and on a variety of time scales is hard. True, it's not as hard as it used to be, but you still don’t win by default. Ted Dunning explains why it takes good design, the right technology, and a pragmatic approach to succeed. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL20 C Level: Advanced
Secondary topics:  Graphs and Time-series
Andrew Ray (Sam’s Club Technology)
Average rating: ***..
(3.00, 3 ratings)
Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL20 D Level: Intermediate
Weisheng Xie (Orange Finance), Peng Meng (Intel)
Average rating: *****
(5.00, 1 rating)
Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on with Spark ML and introduce the methodology behind Intel's work on SparkML optimization. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL21 B Level: Intermediate
Secondary topics:  Graphs and Time-series
Kyle Grove (Teradata)
Average rating: *****
(5.00, 5 ratings)
Kyle Grove explains how Teradata and some of world’s largest financial institutions are innovating credit risk ranking with deep learning techniques and AnalyticOps. With the AnalyticOps framework, these organizations have built models with increased accuracy to drive more profitable lending decisions while being explainable to regulators. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Kurt Brown (Netflix)
Average rating: ****.
(4.19, 16 ratings)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Zhen Fan (JD.com), Wei Ting Chen (Intel Corporate)
Average rating: ****.
(4.00, 4 ratings)
Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Non-technical
Frances Haugen (Pinterest), Patrick Phelps (Pinterest)
Average rating: ****.
(4.67, 3 ratings)
Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Law, ethics, and governance, Strata Business Summit
Location: 210 C/G Level: Beginner
John Mertic (The Linux Foundation), Maryna Strelchuk (ING)
John Mertic and Maryna Strelchuk detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share how companies like ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Strata Business Summit
Location: 210 D/H Level: Beginner
Ayin Vala (Foundation for Precision Medicine)
Average rating: ****.
(4.33, 3 ratings)
Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Debasish Ghosh (Lightbend )
Average rating: ***..
(3.33, 3 ratings)
Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Sponsored
Location: 230 B
Advanced analytics and AI workloads require a scalable and optimized architecture, from hardware and storage to software and applications. Kevin Huiskes and Radhika Rangarajan share best practices for accelerating analytics and AI and explain how businesses globally are leveraging Intel’s technology portfolio, along with optimized frameworks and libraries, to build AI workloads at scale. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Siddharth Teotia (Dremio)
Average rating: *****
(5.00, 1 rating)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow. Read more.
Add to your personal schedule
1:50pm–2:30pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: Expo Hall 1 Level: Non-technical
Secondary topics:  Expo Hall
Harish Doddi (Datatron Technologies), Jerry Xu (Datatron Technologies)
Average rating: ****.
(4.00, 3 ratings)
Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory. Read more.

2:40pm

Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Data Integration and Data Pipelines
Sean Ma (Trifacta)
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Ma discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Baron Schwartz (VividCortex)
Average rating: ****.
(4.80, 5 ratings)
Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Sponsored
Location: LL20 B
Guy Ernest (Amazon Web Services)
Average rating: ****.
(4.50, 4 ratings)
Amazon SageMaker is platform to build, train, and deploy machine learning models at any scale. Guy Ernest explores the scalable algorithms that SageMaker provides, distributed training with Apache MXNet and TensorFlow, automatic tuning of hyperparameters, and model deployments. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Data engineering and architecture, Data science and machine learning
Location: LL20 C Level: Intermediate
Evan Sparks (Determined AI)
Average rating: *****
(5.00, 1 rating)
Deep learning has shown tremendous improvements in a number of areas and has justifiably generated enormous excitement. However, several key challenges—from prohibitive hardware requirements to immature software offerings—are impeding widespread enterprise adoption. Evan Sparks details fundamental challenges facing organizations looking to adopt deep learning and shares possible solutions. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data science and machine learning
Location: LL20 D Level: Intermediate
Joseph Bradley (Databricks)
Average rating: *****
(5.00, 1 rating)
Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Sponsored
Location: LL21 A
Chuck Yarbrough (Hitachi Vantara)
Intelligently managing the data pipeline is the key to driving business acceleration and reducing costs. Chuck Yarbrough outlines ways to gain control over the data pipeline. Along the way, you’ll learn how cloud, big data, and machine learning models intersect and how streaming and cloud integration can help create the connected enterprise. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)
Average rating: ***..
(3.50, 2 ratings)
Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Mark Grover (Lyft), Arup Malakar (Lyft)
Average rating: ****.
(4.00, 2 ratings)
Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: LL21 E/F Level: Non-technical
Ellen Friedman (MapR Technologies)
Average rating: ****.
(4.43, 7 ratings)
DataOps—a culture and practice for building data-intensive applications, including machine learning pipelines—expands DevOps philosophy to include data-heavy roles such as data engineering and data science. DataOps is based on cross-functional collaboration resulting in fast time to value and an agile workflow. Ellen Friedman offers an overview of DataOps and explains how to implement it. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Strata Business Summit
Location: 210 A/E
Michael Chui (McKinsey Global Institute)
Average rating: ****.
(4.83, 6 ratings)
After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Non-technical
Katie Malone (Civis Analytics), Skipper Seabold (Civis Analytics)
Average rating: *****
(5.00, 1 rating)
A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Law, ethics, and governance, Strata Business Summit
Location: 210 D/H Level: Beginner
Or Herman-Saffar (Dell), Ran Taig (Dell EMC)
Average rating: *....
(1.67, 3 ratings)
What if we could predict when and where crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Henry Cai (Pinterest), Yi Yin (Pinterest)
Average rating: ***..
(3.00, 1 rating)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Sponsored
Location: 230 B
Dave Abercrombie (Sharethrough)
Average rating: ***..
(3.50, 2 ratings)
Dave Abercrombie explains how Sharethrough used Snowflake to build an analytic and reporting platform that handles petabyte-scale data with ease. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: 230 C Level: Intermediate
Ritesh Agrawal (Uber), Anirban Deb (Uber)
Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money. Read more.
Add to your personal schedule
2:40pm–3:20pm Wednesday, 03/07/2018 Secondary topics:  Expo Hall, Graphs and Time-series
Yu Xu (TigerGraph)
Average rating: *****
(5.00, 2 ratings)
Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups. Read more.

3:20pm

3:20pm–4:20pm Wednesday, 03/07/2018
Location: Hall 1, 2, 3
Afternoon break sponsored by IBM (1h)

4:20pm

Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: 212 A-B
Secondary topics:  Data Integration and Data Pipelines
Dorna Bandari (Jetlore)
Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL20 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Joseph Richards (GE Digital)
Average rating: *****
(5.00, 1 rating)
Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Sponsored
Location: LL20 B
Procter & Gamble relies heavily on data, particularly for BI. Running compute where the data lives is critical for performance, and the company has found added benefits to this architecture, which complements its Hadoop and BI needs. Terry McFadden offers an overview of P&G's modern analytics architecture and explains how it differs from traditional approaches. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data science and machine learning
Location: LL20 C Level: Intermediate
Secondary topics:  Graphs and Time-series
Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)
Average rating: ****.
(4.00, 1 rating)
Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL20 D Level: Intermediate
Rachita Chandra (IBM Watson Health)
Average rating: ***..
(3.00, 1 rating)
Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL21 B Level: Intermediate
Secondary topics:  Graphs and Time-series
Andrea Pasqua (Uber), Anny Chen (Uber)
Average rating: ****.
(4.60, 5 ratings)
Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Carlo Torniai (Pirelli Tyre)
Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of contributions across cross-functional teams. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data engineering and architecture, Visualization and user experience
Location: LL21 E/F Level: Beginner
Rahim Daya (Pinterest)
Average rating: ***..
(3.50, 4 ratings)
Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Intermediate
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Average rating: ****.
(4.67, 3 ratings)
Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Strata Business Summit
Location: 210 C/G
Moderated by:
Lisha Li (Amplify Partners)
Panelists:
Katherine Boyle (General Catalyst), Wayne Hu (SignalFire), Andrew Parker (Spark Capital), Brandon Reeves (Lux Capital)
Average rating: ****.
(4.00, 1 rating)
To anticipate who will succeed and to invest wisely, investors spend a lot of time trying to understand the longer-term trends within an industry. In this panel discussion, top-tier VCs look over the horizon to consider the big trends in how data is being put to work in startups and share what they think the field will look like in a few years (or more). Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Intermediate
Kapil Surlaker (LinkedIn), Ya Xu (LinkedIn)
Average rating: *****
(5.00, 3 ratings)
Metrics measurement and experimentation play crucial roles in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Sijie Guo (Streamlio)
Average rating: ***..
(3.67, 3 ratings)
Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Sponsored
Location: 230 B
Alexander Ryabov (Wargaming), Jonathan Crow (Wargaming)
Alexander Ryabov and Jonathan Crow explain how Wargaming is winning the battle for bigger profits in the virtual world of online gaming using a best-in-class business intelligence solution to equip its business units with decision-making tools. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: 230 C Level: Intermediate
Gian Merlino (Imply)
Average rating: ****.
(4.00, 2 ratings)
Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects. Read more.
Add to your personal schedule
4:20pm–5:00pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: Expo Hall 1
Secondary topics:  Expo Hall
One of the key application domains leveraging live data is smart cities, but success depends on the availability of generic platforms that support high throughput and ultralow latency. Arun Kejariwal and Francois Orsini offer an overview of Satori's live data platform and walk you through a country-scale case study of its implementation. Read more.

5:10pm

Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: 212 A-B Level: Non-technical
Secondary topics:  Data Integration and Data Pipelines
Abe Gong (Superconductive Health), James Campbell (USG)
Average rating: *****
(5.00, 4 ratings)
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018 Secondary topics:  Graphs and Time-series
Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)
Average rating: *****
(5.00, 8 ratings)
Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: LL20 C Level: Intermediate
Mike Conover (SkipFlag)
Average rating: *****
(5.00, 4 ratings)
Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)
Average rating: ***..
(3.00, 2 ratings)
Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Big data and data science in the cloud, Data science and machine learning
Location: LL21 B Level: Intermediate
Sergey Ermolin (Intel), Suqiang Song (Mastercard)
Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Data engineering and architecture, Platform security and cybersecurity
Location: LL21 C/D Level: Beginner
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: LL21 E/F Level: Beginner
Ted Dunning (MapR Technologies)
Average rating: *****
(5.00, 1 rating)
Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Strata Business Summit
Location: 210 A/E
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)
Average rating: *****
(5.00, 1 rating)
Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.” Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: 210 B/F Level: Intermediate
Secondary topics:  Data Integration and Data Pipelines
Gwen Shapira (Confluent)
Average rating: *****
(5.00, 3 ratings)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices, and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Non-technical
Stephanie Beben (Booz Allen Hamilton)
How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Stephanie Beben shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.” Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Strata Business Summit
Location: 210 D/H
Derek Ruths (CAI)
Unreasonable sales forecasts, badly overstocked inventory, misguided investments . . . bad analyses happen all the time, leading to bad decisions and costing businesses millions of dollars. Derek Ruths shares the five most common issues that lead to bad data-informed thinking. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Fabian Hueske (data Artisans), Shuyi Chen (Uber)
Average rating: *****
(5.00, 1 rating)
Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Sponsored
Location: 230 B
Average rating: ***..
(3.00, 1 rating)
As the internet of things grows, there is an increasing need for sophisticated but lightweight analytics at the edge. Evan Guarnaccia walks you through a multiphase analytics approach to IoT data, analyzing data at rest to discover patterns of interest and develop analytical models that can be easily deployed into a streaming analytics engine out at the edge, in the fog, or in the cloud. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Data engineering and architecture
Location: 230 C Level: Advanced
Ash Munshi (Pepperdata)
Average rating: *****
(5.00, 1 rating)
Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series. Read more.
Add to your personal schedule
5:10pm–5:50pm Wednesday, 03/07/2018
Session
Data science and machine learning
Location: Expo Hall 1
Secondary topics:  Expo Hall
Rodney Mullen (Almost Skateboards)
Average rating: *****
(5.00, 2 ratings)
The essence of modern skating is learning tricks that couple with specific terrain. Activision’s video game franchise testifies to the nearly endless possibilities. Rodney Mullen offers a nuanced look at how skaters nudge the endpoints of disparate submovements to create new combinations that may shine a different light on ideas in machine learning—plus it’s a lot of fun. Read more.

5:50pm

Add to your personal schedule
5:50pm–6:50pm Wednesday, 03/07/2018
Event
Location: Hall 1, 2, 3
Average rating: *****
(5.00, 1 rating)
Join us for vendor-hosted libations (plus snacks) after sessions on Wednesday. Read more.

7:00pm

Add to your personal schedule
7:00pm–9:30pm Wednesday, 03/07/2018
Event
Location: San Pedro Market
Average rating: *****
(5.00, 3 ratings)
Join us at San Pedro Square Market for an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in San Jose. Read more.

Thursday, 03/08/2018

8:00am

Add to your personal schedule
8:00am–8:30am Thursday, 03/08/2018
Event
Location: Concourse foyer
Average rating: ****.
(4.00, 1 rating)
Gather before keynotes on Thursday morning for a speed networking event. Enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with fellow attendees. Read more.

8:45am

Add to your personal schedule
8:45am–8:50am Thursday, 03/08/2018
Keynote
Location: Grand Ballroom 220
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Average rating: *****
(5.00, 1 rating)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes. Read more.

8:50am

Add to your personal schedule
8:50am–9:00am Thursday, 03/08/2018
Keynote
Location: Grand Ballroom 220
Janelle Shane (aiweirdness.com)
Average rating: ****.
(4.89, 9 ratings)
At AIweirdness.com Janelle Shane posts the results of neural network experiments gone delightfully wrong. But machine learning mistakes can also be very embarrassing or even dangerous. Using silly datasets as examples, Janelle talks about some ways that algorithms fail. Read more.

9:00am

Add to your personal schedule
9:00am–9:10am Thursday, 03/08/2018
Keynote
Location: Grand Ballroom 220
Anoop Dawar (MapR Technologies)
Average rating: ***..
(3.20, 5 ratings)
We are inundated with ideas and technology news in today’s data-rich but attention-deficit economy. In this environment, competitive advantage comes not from what is abundant (i.e., data) but from what is scarce—the ability to deploy insights in real time. Anoop Dawar explains how your peers are succeeding in shrinking the insight-to-action cycle and achieving great results. Read more.

9:10am

Add to your personal schedule
9:10am–9:30am Thursday, 03/08/2018
Keynote
Location: Grand Ballroom 220 Level: Intermediate
Eric Colson (Stitch Fix)
Average rating: ****.
(4.50, 10 ratings)
While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization. Read more.

9:30am

Add to your personal schedule
9:30am–9:40am Thursday, 03/08/2018
Keynote
Location: Grand Ballroom 220
Amr Awadallah (Cloudera), Sangeeth Ponathil (Pizza Hut)
Average rating: **...
(2.62, 8 ratings)
Amr Awadallah explains why the cloud requires a different approach to machine learning and analytics and what you can do about it. Read more.

9:40am

Add to your personal schedule
9:40am–9:45am Thursday, 03/08/2018
Keynote
Location: Grand Ballroom 220
Average rating: ***..
(3.40, 5 ratings)
William Vambenepe explains how a pivot toward machine learning and artificial intelligence has created clearer separation among clouds than ever before. William walks you through an interesting use case of machine learning in action and discusses the central role AI will play in big data analysis moving forward. Read more.

9:45am

Add to your personal schedule
9:45am–10:05am Thursday, 03/08/2018
Keynote
Location: Grand Ballroom 220
Ajey Gore (GO-JEK)
Average rating: ****.
(4.10, 10 ratings)
Ajey Gore details GO-JEK's evolution from a small bike-hailing startup to a technology-focused unicorn in the areas of transportation, lifestyle, payments, and social enterprise and explains how the company is focusing its attention beyond urban Indonesia to impact more than a million people across the country's rural areas. Read more.

10:05am

Add to your personal schedule
10:05am–10:25am Thursday, 03/08/2018
Keynote
Location: Grand Ballroom 220
Seth Stephens-Davidowitz (New York Times)
Average rating: *****
(5.00, 6 ratings)
Seth Stephens-Davidowitz explains how to use Google searches to uncover behaviors or attitudes that may be hidden from traditional surveys, such as racism, sexuality, child abuse, and abortion. Read more.

10:30am

10:30am–11:00am Thursday, 03/08/2018
Location: Hall 1, 2, 3
Morning break (30m)

11:00am

Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Ask Me Anything
Location: 212 A-B
Burcu Baran (LinkedIn), Wei Di (LinkedIn)
Join Burcu Baran and Wei Di to discuss big data in business analytics, machine learning in business analytics, and achieving actionable insights from big data. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Data science and machine learning, Data-driven business management
Location: LL20 A Level: Intermediate
Clare Gollnick (Terbium Labs)
Average rating: ****.
(4.86, 7 ratings)
At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL20 B
Secondary topics:  Graphs and Time-series
Ryan Boyd (Neo4j)
Average rating: *****
(5.00, 1 rating)
Ryan Boyd explains how he and his team reconstructed a subset of the Twitter network of Russian troll accounts and applied graph analytics to the data using the Neo4j graph database to uncover how these accounts were spreading fake news. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL20 C Level: Beginner
Evan Kriminger (ZestFinance)
Average rating: ****.
(4.40, 5 ratings)
What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL20 D Level: Intermediate
Josh Wills (Slack)
Average rating: ****.
(4.00, 3 ratings)
Josh Wills describes recent data science and machine learning projects at Slack. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Sponsored
Location: LL21 A
Jim Scott (MapR Technologies)
Average rating: **...
(2.00, 1 rating)
The value of data is not strictly a function of its size but rather is in the value that can be extracted from it. Jim Scott explains how to identify the right data to leverage to monitor the pulse of fast changing business environments, the best way to integrate analytics into your business processes, and the importance of cross-application data flows. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL21 B Level: Intermediate
Karthik Ramasamy (Uber), Lenny Evans (Uber)
Average rating: *****
(5.00, 1 rating)
Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)
Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: LL21 E/F Level: Intermediate
Greg Rahn (Cloudera)
Average rating: ***..
(3.40, 5 ratings)
For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Strata Business Summit
Location: 210 A/E
Mike Olson (Cloudera)
Average rating: ****.
(4.75, 4 ratings)
Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Michael Schrenk (Self-Employed)
Average rating: ****.
(4.00, 5 ratings)
Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Law, ethics, and governance, Strata Business Summit
Location: 210 D/H Level: Beginner
Average rating: **...
(2.00, 1 rating)
Sugreev Chawla offers an overview of Spotlight, a tool created by Thorn, a nonprofit that uses technology to fight online child sexual exploitation. It allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking, using graph analysis, time series analysis, and NLP techniques to surface important networks of ads and characterize their behavior over time. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Tyler Akidau (Google)
Average rating: *****
(5.00, 4 ratings)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Sponsored
Location: 230 B
Ryan Lippert (Google Cloud)
Average rating: *****
(5.00, 2 ratings)
If your company isn't good at analytics, it's not ready for AI. Ryan Lippert explains how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Data engineering and architecture
Location: 230 C Level: Intermediate
Jiangjie Qin (LinkedIn)
Average rating: ****.
(4.00, 3 ratings)
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention. Read more.
Add to your personal schedule
11:00am–11:40am Thursday, 03/08/2018
Session
Data engineering and architecture, Streaming systems and real-time applications
Location: Expo Hall 1 Level: Intermediate
Secondary topics:  Expo Hall
Dean Wampler (Lightbend)
Average rating: *****
(5.00, 1 rating)
Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead. Read more.

11:50am

Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Ask Me Anything
Location: 212 A-B
Dr. Vijay Srinivas Agneeswaran (SapientRazorfish), Abhishek Kumar (SapientRazorfish)
Join Vijay Srinivas Agneeswaran and Abhishek Kumar to discuss recommender systems—particularly deep learning-based recommender systems in TensorFlow—or ask any other questions you have about deep learning. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data science and machine learning, Platform security and cybersecurity
Location: LL20 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))
Average rating: ****.
(4.00, 1 rating)
How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data engineering and architecture
Location: LL20 B Level: Beginner
Umur Cubukcu (Citus Data)
Average rating: ****.
(4.00, 3 ratings)
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL20 C Level: Intermediate
Stephen O'Sullivan (Data Whisperers)
Average rating: ****.
(4.25, 4 ratings)
Stephen O'Sullivan takes you along the data science journey, from onboarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL20 D
Average rating: ***..
(3.00, 3 ratings)
Ashivni Shekhawat explains how Lyft uses a mix of online learning, optimization, and control theory to operate its ride-sharing marketplace at an efficient price point. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data engineering and architecture
Location: LL21 A Level: Intermediate
Kurt Brown (Netflix)
Average rating: *****
(5.00, 2 ratings)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Jennie Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)
Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
dong meng (MapR)
Average rating: ***..
(3.33, 3 ratings)
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Szehon Ho (Criteo), Pawel Szostek (Criteo)
Average rating: ****.
(4.50, 2 ratings)
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Intermediate
David Talby (Pacific AI)
Average rating: ***..
(3.50, 4 ratings)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Beginner
Paco Nathan (O'Reilly Media)
Average rating: ****.
(4.25, 4 ratings)
Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Intermediate
Average rating: *****
(5.00, 2 ratings)
With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
Average rating: ****.
(4.00, 1 rating)
Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and demonstrate how your applications will benefit from using them. Read more.
Add to your personal schedule
11:50am–12:30pm Thursday, 03/08/2018
Session
Data engineering and architecture
Location: 230 C Level: Intermediate
Secondary topics:  Graphs and Time-series
Alexis Roos (Salesforce), Noah Burbank (Salesforce)
Average rating: ***..
(3.00, 1 rating)
In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data. Read more.
11:50am–12:30pm Thursday, 03/08/2018
Location: Expo Hall 1
TBC

12:30pm

Add to your personal schedule
12:30pm–1:50pm Thursday, 03/08/2018
Event
Location: Hall 1, 2, 3
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.
Add to your personal schedule
12:30pm–1:50pm Thursday, 03/08/2018
Event
Location: San Jose Ballroom, Marriott
Join Strata Business Summit speakers and attendees for a networking lunch on Thursday. Read more.

1:50pm

Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Ask Me Anything
Location: 212 A-B
Nick Elprin (Domino Data Lab)
Join Nick Elprin to discuss the challenges associated with evolving from random acts of data science to data science as a core competency, common pitfalls and best practices for implementing process, hiring people, and deploying diverse technology, designing and running data science organizations, and more. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Big data and data science in the cloud, Data science and machine learning
Location: LL20 A Level: Intermediate
Ram Sriharsha (Databricks)
Average rating: ****.
(4.75, 4 ratings)
How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Data engineering and architecture
Location: LL20 B Level: Intermediate
Josef Viehhauser (BMW Group), Tobias Bürger (BMW Group)
Average rating: *****
(5.00, 1 rating)
The BMW Group IT team drives the usage of data-driven technologies and forms the nucleus of a data-centric culture inside of the organization. Josef Viehhauser and Tobias Bürger discuss the E-to-E relationship of data and models and share best practices for scaling applications in real-world environments. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Data science and machine learning, Data-driven business management
Location: LL20 C Level: Non-technical
Veronica Mapes (Pinterest), Garner Chung (Pinterest)
Average rating: *****
(5.00, 3 ratings)
Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Jennifer Prendki (Atlassian)
Average rating: ***..
(3.00, 1 rating)
Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Sponsored
Location: LL21 A
Jeff Smits (RingCentral)
Jeff Smits explains how RingCentral is utilizing the cloud, data integration, self-service, and APIs to harvest the immense potential of connected systems. Read more.
1:50pm–2:30pm Thursday, 03/08/2018
Location: LL21 B
TBC
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Emre Velipasaoglu (Lightbend)
Average rating: ****.
(4.00, 1 rating)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Data engineering and architecture
Location: LL21 E/F Level: Intermediate
Chris Harland (Textio)
The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Law, ethics, and governance, Strata Business Summit
Location: 210 A/E Level: Intermediate
Mark Donsky (Cloudera), Steven Ross (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Strata Business Summit
Location: 210 C/G
Alex Rosenblat (Data & Society Research Institute )
Average rating: *****
(5.00, 1 rating)
Ride-hail drivers work alone, but they’re banding together online to compare notes, uncover new policies, and help each other navigate a workplace characterized by information scarcity. Alex Rosenblat explores how ride-hail workers are using online forums to create their own workplace culture as employment relationships grow more remote and algorithms replace human managers. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Intermediate
Chris Chapo (Gap Inc.)
Average rating: ****.
(4.20, 5 ratings)
Chris Chapo walks you through real-world examples of companies that are driving transformational change by leveraging data science and analytics, paying particular attention to established organizations where these capabilities are newer concepts. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Session
Data engineering and architecture
Location: 230 A Level: Intermediate
Secondary topics:  Graphs and Time-series
Michael Freedman (TimescaleDB)
Average rating: ****.
(4.50, 4 ratings)
Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018
Holden Karau (Google), Rachel Warren (Salesforce Einstein)
Average rating: ***..
(3.40, 5 ratings)
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka). Read more.
Add to your personal schedule
1:50pm–2:30pm Thursday, 03/08/2018 Secondary topics:  Expo Hall, Graphs and Time-series
Roy Ben-Alta (Amazon Web Services), Ira Cohen (Anodot)
Average rating: *****
(5.00, 1 rating)
Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution. Read more.

2:40pm

Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Ask Me Anything
Location: 212 A-B
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Average rating: *****
(5.00, 1 rating)
Join Dean Wampler and Boris Lublinsky to discuss all things streaming, from architecture and implementation to streaming engines and frameworks. Be sure to bring your questions about techniques for serving machine learning models in production, traditional big data systems, or software architecture in general. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL20 A Level: Intermediate
Ian Cook (Cloudera)
Average rating: ****.
(4.75, 4 ratings)
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Sponsored
Location: LL20 B
Ivan Jibaja (Pure Storage)
Pure Storage redefined QA testing. Using open source technologies like Spark and Kafka, the company deployed a streaming big data analytics pipeline that processes over 70 billion events per day to prioritize, classify, deduplicate, and understand test failures. Ivan Jibaja discusses use cases for big data analytics technologies, the underlying elastic infrastructure, and lessons learned. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Data science and machine learning, Law, ethics, and governance
Location: LL20 C Level: Intermediate
Pramit Choudhary (DataScience.com)
Average rating: *****
(5.00, 3 ratings)
Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL20 D Level: Intermediate
Simon Hughes (Dice.com), Yuri Bykov (Dice.com)
Average rating: ****.
(4.00, 1 rating)
Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Sponsored
Location: LL21 A
Tendu Yogurtcu (Syncsort)
Average rating: *....
(1.00, 1 rating)
Chefs must be able to trust the authenticity, quality, and origin of their ingredients; data analysts must be able to do the same of their data—and what happens to it along the way. Tendü Yoğurtçu explains how to seamlessly track the lineage and quality of your data—on and off the cluster, on-premises or in the cloud—to deliver meaningful insights and meet regulatory compliance requirements. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL21 B Level: Intermediate
Patrick Harrison (S&P Global)
Average rating: ****.
(4.33, 3 ratings)
Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through how it works its magic. Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: LL21 C/D Level: Intermediate
Michelle Casbon (Google Cloud Platform Developer Relations)
Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Ajay Mothukuri (Sapient), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Non-technical
Anjali Thakur (Accenture)
Average rating: *....
(1.00, 5 ratings)
Whether you are a technology or a services provider, understanding your value in the ecosystem and focusing on the right partners to reach your market goals is critical. Anjali Thakur shares examples of teaming models and leading practices for accelerating value from your ecosystem strategy. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 C/G Level: Beginner
Brian Karfunkel (Pinterest)
Average rating: ****.
(4.50, 2 ratings)
When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Matt Derda (Trifacta), Harrison Lynch (Consensus Corporation)
Average rating: **...
(2.00, 1 rating)
Matt Derda and Harrison Lynch explain how Consensus leverages the combined power of data wrangling and machine learning to more efficiently identify and reduce retail fraud and how adopting data wrangling technology has helped Trifacta reduce time spent data wrangling from six weeks to one week. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018 Secondary topics:  Graphs and Time-series
Fabian Hueske (data Artisans), Flavio Junqueira (Dell EMC)
Average rating: ***..
(3.33, 3 ratings)
Flavio Junqueira and Fabian Hueske detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Sponsored
Location: 230 B
Andreas Pfadler (TalkingData)
Andreas Pfadler offers an overview of current technological trends for on-device deep learning and edge computing. Along the way, Andreas explores major players and platforms and computational challenges and solutions. Andreas concludes with a discussion of TalkingData's vision for the future of mobile deep learning. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: 230 C Level: Intermediate
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Average rating: *****
(5.00, 3 ratings)
Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies. Read more.
Add to your personal schedule
2:40pm–3:20pm Thursday, 03/08/2018 Secondary topics:  Expo Hall
Chris Fregly (PipelineAI)
Average rating: *****
(5.00, 1 rating)
Chris Fregly demonstrates how to extend existing Spark-based data pipelines to include TensorFlow model training and deploying and offers an overview of TensorFlow’s TFRecord format, including libraries for converting to and from other popular file formats such as Parquet, CSV, JSON, and Avro stored in HDFS and S3. Read more.

3:20pm

3:20pm–4:20pm Thursday, 03/08/2018
Location: Hall 1, 2, 3
Afternoon break (1h)

4:20pm

Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL20 A Level: Intermediate
Keno Fischer (Julia Computing)
Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkeley, LBNL, and Julia Computing. Read more.
4:20pm–5:00pm Thursday, 03/08/2018
Location: LL20 C
TBC
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Session
Big data and data science in the cloud, Data science and machine learning
Location: LL20 D Level: Intermediate
Goodman Gu (Atlassian)
Average rating: *****
(5.00, 3 ratings)
Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Session
Data science and machine learning
Location: LL21 B
Jeff Dean (Google)
Average rating: ****.
(4.89, 9 ratings)
The Google Brain team conducts research on difficult problems in artificial intelligence and builds large-scale computer systems for machine learning research, both of which have been applied to dozens of Google products. Jeff Dean highlights some of Google Brain's projects with an eye toward how they can be used to solve challenging problems. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)
Average rating: *****
(5.00, 1 rating)
There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: LL21 E/F Level: Intermediate
Shenghu Yang (Lyft)
Average rating: *****
(5.00, 1 rating)
Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 A/E Level: Non-technical
Jesse Anderson (Big Data Institute)
Average rating: ****.
(4.00, 1 rating)
There's been an explosion of new architectures, but is this because engineers love new things or is there a good business reason for these changes? Jesse Anderson explores new architectures and the actual business problems they solve. You may find out that your team would be far more productive if you moved to these architectures. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Felix Gorodishter (GoDaddy)
Average rating: ***..
(3.00, 2 ratings)
GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Session
Data-driven business management, Strata Business Summit
Location: 210 D/H Level: Beginner
Marcin Pilarczyk (Ryanair)
Average rating: *****
(5.00, 2 ratings)
Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Matteo Merli (Streamlio)
Average rating: *....
(1.00, 1 rating)
Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty. Read more.
Add to your personal schedule
4:20pm–5:00pm Thursday, 03/08/2018
Session
Big data and data science in the cloud, Data engineering and architecture
Location: 230 C Level: Intermediate
Tomer Kaftan (University of Washington)
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time. Read more.