Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

If you are looking for slides and video from 2017, visit the Strata conference in San Jose 2017 site.

All
Data engineering and architecture
Kurt Brown (Netflix)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.
Big data and data science in the cloud, Data engineering and architecture
Philip Langdale (Cloudera), Eugene Fratkin (Cloudera), Vinithra Varadharajan (Cloudera), Mala Ramakrishnan (Cloudera)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Vinithra Varadharajan, Philip Langdale, and Eugene Fratkin lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.
Big data and data science in the cloud, Data science and machine learning, Data-driven business management
Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Pavel Dmitriev, and Paul Raff lead an introduction to A/B texting and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.
Big data and data science in the cloud, Data science and machine learning
Shivaram Venkataraman (UC Berkeley), Sergey Ermolin (Intel), Ding Ding (Intel)
The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman, Sergey Ermolin, and Ding Ding outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.
Data engineering and architecture
Kinnary Jangla (Pinterest)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.
Data engineering and architecture, Data-driven business management, Law, ethics, and governance
Ajay Mothukuri (Sapient), Arunkumar Ramanatha (Sapient), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Ajay Mothukuri, Arunkumar Ramanatha, and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.
Law, ethics, and governance, Strata Business Summit
Or Herman-Saffar (Dell), Ran Taig (Dell)
What if we could predict when and where next crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future.
Big data and data science in the cloud, Data engineering and architecture
Greg Rahn (Cloudera)
For many organizations, the cloud will likely be the destination of their next big data warehouse. Greg Rahn shares considerations when evaluating the cloud for analytics and big data warehousing in order to help you get the most from the cloud. You'll leave with an understanding of different architectural approaches and impacts for moving analytic workloads to the cloud.
Data science and machine learning
Brooke Wenig (Databricks)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.
Data engineering and architecture, Streaming systems and real-time applications
Debasish Ghosh (Lightbend )
Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and how they can be used to implement solutions for the fast and streaming architectures.
Data-driven business management, Strata Business Summit
With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components.
Big data and data science in the cloud, Data science and machine learning
Jennie Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)
Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.
Data science and machine learning, Data-driven business management, Visualization and user experience
Wayde Fleener (General Mills)
Decision makers are busy. Businesses can hire people to analyze data for them, but most companies are resource constrained and can’t hire a small army to look through all their data. Wayde Fleener explains how General Mills implemented automation to enable decision makers to quickly focus on the metrics that matter and cut through everything else that does not.
Data science and machine learning
Siddha Ganju (Deep Vision)
Siddha Ganju explains how the FDL lab at NASA uses artificial intelligence to improve and automate the identification of meteors above human-level performance using meteor shower images and recover known meteor shower streams and characterize previously unknown meteor showers using orbital data—research aimed at providing more warning time for long-period comet impacts.
Big data and data science in the cloud, Data science and machine learning
Joseph Bradley (Databricks)
Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving.
Data engineering and architecture
Ted Dunning (MapR Technologies)
Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime.
Data science and machine learning, Data-driven business management, Strata Business Summit
Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.
Rizwan Patel (Caesars Entertainment)
Rizwan Patel explains how the gaming industry can leverage Cloudera’s big data platform to adapt to the change in patron dynamics (both in terms of demographics as well as in spending patterns) to create a new paradigm for customer (micro) segmentation.
Strata Business Summit
Mauro Damo (Dell EMC), Wei Lin (Dell EMC)
Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive.
Data science and machine learning
Daniel Lurie (Pinterest)
All successful startups thrive on tight product-market fit, which can produce homogenous initial user bases. To become the next big thing, your user base will need to diversify and your product must change to accommodate new needs. Daniel Lurie explains how Pinterest leverages external data to measure racial and income diversity in its user base and changed user modeling to drive growth.
Data engineering and architecture
Alexis Roos (Salesforce), Noah Burbank (Salesforce)
In the customer age, being able to extract relevant communications information in real time and cross-reference it with context is key. Alexis Roos and Noah Burbank explain how Salesforce uses data science and engineering to enable salespeople to monitor their emails in real time to surface insights and recommendations using a graph modeling contextual data.
Data-driven business management, Strata Business Summit
Katie Malone (Civis Analytics), Skipper Seabold (Civis Analytics)
A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup.
Data engineering and architecture, Data science and machine learning, Data-driven business management
Rajiv Synghal (Kaiser Permanente)
As healthcare data becomes increasingly digitized, medical centers are able to leverage data in new ways to improve patient care. Each year, as many as 49,000 people die in the US alone. Rajiv Synghal explains how Kaiser Permanente developed a sophisticated flu predictor model to better determine where resources were needed and how to reduce outbreaks.
Data science and machine learning
Simon Hughes (Dice.com), Yuri Bykov (Dice.com)
Dice.com recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production.
Big data and data science in the cloud, Data engineering and architecture
Jorge A. Lopez (Amazon Web Services)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.
Data engineering and architecture
Dorna Bandari (Jetlore)
Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth.
Data science and machine learning
Keno Fischer (Julia Computing)
Julia is rapidly becoming a popular language at the forefront of scientific discovery. Keno Fischer explores one of the most ambitious use cases for Julia: using machine learning to catalog astronomical objects to derive a catalog from multiterabyte size astronomical image datasets. This work was a collaboration between MIT, UC Berkley, LBNL, and Julia Computing.
Data engineering and architecture
Ash Munshi (Pepperdata)
Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series.
Big data and data science in the cloud, Data engineering and architecture
Tom Fisher (MapR Technologies)
The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to the next generation platforms and applications is the challenge today. Tom Fisher explores technological approaches and solutions that make this possible while delivering data-driven applications and operations.
Keynotes
Details to come.
Keynotes
Details to come.
Big data and data science in the cloud, Data science and machine learning
Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)
Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.
Big data and data science in the cloud, Data engineering and architecture
Michelle Casbon (Qordoba)
Michelle Casbon explains how to speed up the development of ML models by using open source tools such as Kubernetes, Docker, Scala, Apache Spark, and Weave Flux, detailing how to build resilient systems so that you can spend more of your time on product improvement rather than triage and uptime.
Data science and machine learning, Streaming systems and real-time applications
Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)
Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data, focusing on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs.
Data science and machine learning, Visualization and user experience
James Bednar (Anaconda), Philipp Rudiger (Anaconda)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Python lets you solve data science problems by stitching together packages from its ecosystem, but it can be difficult to choose packages that work well together. James Bednar and Philipp Rudiger walk you through a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints—all in just 30 lines of Python code.
Big data and data science in the cloud, Data engineering and architecture
Tomer Kaftan (University of Washington)
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.
Data engineering and architecture
Chris Harland (Textio)
The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models, creating a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, Chris Harland walks you through building a data product.
Big data and data science in the cloud, Data engineering and architecture
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Most organizations manage 5 to 15 copies of their data in multiple systems and formats to support different analytical use cases. Tomer Shiran and Jacques Nadeau introduce a new approach called data reflections, which dramatically reduces the need for data copies, demonstrate an open source implementation built with Apache Calcite, and explore two production case studies.
Brian Bloechle (Cloudera, Inc.)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.
Data science and machine learning
Josh Wills (Slack)
Josh Wills describes recent data science and machine learning projects at Slack.
Angie Ma (ASI)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.
Media, entertainment, and advertising, Strata Business Summit
Ray Bernard (SuprFanz), Jennifer Webb (SuprFanz)
Ray Bernard and Jennifer Webb explain how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.
Data-driven business management, Strata Business Summit
Matt Derda (Trifacta), Jonathon Whitton (PRGX USA)
PRGX is a global leader in recovery audit and source-to-pay (S2P) analytics services, serving around 75% of the top 20 global retailers. Matt Derda and Jonathon Whitton explain how PRGX uses Trifacta and Cloudera to scale current processes and increase revenue for the products and services it offers clients.
Data-driven business management, Strata Business Summit
Marcin Pilarczyk (Ryanair)
Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions.
Data engineering and architecture
Crystal Valentine (MapR Technologies)
DataOps—a methodology for developing and deploying data-intensive applications, especially those involving data science and machine learning pipelines—supports cross-functional collaboration and fast time to value with an Agile, self-service workflow. Crystal Valentine offers an overview of this emerging field and explains how to implement a DataOps process.
Big data and data science in the cloud, Data science and machine learning
Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)
Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.
Keynotes, Data science and machine learning
Jeff Dean (Google)
Keynote with Jeff Dean
Data science and machine learning, Media, entertainment, and advertising
Abhishek Kumar (SapientRazorfish), Dr. Vijay Srinivas Agneeswaran (SapientRazorfish)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.
Data science and machine learning, Streaming systems and real-time applications
Dan Crankshaw (UC Berkeley RISE Lab)
Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes.
Data engineering and architecture
Ron Bodkin (Google)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.
Data science and machine learning
Andrea Pasqua (Uber), Anny Chen (Uber)
Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.
Keynotes, Data-driven business management, Media, entertainment, and advertising, Strata Business Summit
Eric Colson (Stitch Fix)
While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization.
Big data and data science in the cloud, Data science and machine learning, Platform security and cybersecurity
Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Daniel Rubin (Stanford)
Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Daniel Rubin outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.
Big data and data science in the cloud, Data engineering and architecture
dong meng (MapR)
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.
Data engineering and architecture
Mark Grover (Lyft), Arup Malakar (Lyft)
Mark Grover and Arup Malakar offer an overview of how Lyft leverages application metrics, logs, and auditing to monitor and troubleshoot its data platform and share how the company dogfoods the platform to provide security, auditing, alerting, and replayability. They also detail some of the internal services and tools Lyft has developed to make its data more robust, scalable, and self-serving.
Data engineering and architecture, Data science and machine learning, Visualization and user experience
Jules Malin (GoPro)
Drones and smart devices are generating billions of event logs for companies, presenting the opportunity to discover insights that inform product, engineering, and marketing team decisions. Jules Malin explains how technologies like Spark and analytics and visualization tools like Python and Plotly enable those insights to be discovered in the data.
Data engineering and architecture, Streaming systems and real-time applications
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
Stream processing systems must support a number of different types of processing semantics due to the diverse nature of streaming applications. Karthik Ramasamy and Sanjeev Kulkarni explore effectively once, exactly once, and other types of stateful processing techniques, explain how they are implemented in Heron, and discuss how applications can benefit.
Data science and machine learning
Stephen O'Sullivan (Data Whisperers)
Stephen O'Sullivan takes you along the data science journey, from on-boarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team.
Strata Business Summit
Michael Chui (McKinsey Global Institute)
After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge.
Strata Business Summit
Mark Madsen (Third Nature)
If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. A panel of experts details the trade-offs between a number of architectures that provide self-service access to data.
Data-driven business management, Strata Business Summit
Frances Haugen (Pinterest), Patrick Phelps (Pinterest)
Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.
Law, ethics, and governance, Strata Business Summit
Mark Donsky (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky outlines the capabilities your data environment needs to simplify compliance with GDPR and future regulations.
Strata Business Summit
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)
Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”
Strata Business Summit
Mike Olson (Cloudera)
Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.
Data-driven business management, Strata Business Summit
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.
Strata Business Summit
Yishay Carmiel (IntelligentWire | Spoken Labs)
One of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean and to take particular actions based on that information. This goal is the essence of conversational AI. Yishay Carmiel explores the latest breakthroughs and revolutions in this field and the challenges still to come.
Data-driven business management, Strata Business Summit
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
Data science and machine learning
Mike Ruberry (ZestFinance)
What does it mean to explain a machine learning model, and why is it important? Mike Ruberry offers an overview of several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques presents a different perspective, and their clever application can reveal new insights and solve business requirements.
Data science and machine learning, Platform security and cybersecurity
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))
How should you best debug a security data science system: change the ML approach, redefine the security scenario, or start over from scratch? Ram Shankar answers this question by sharing the results of failed experiments and the lessons learned when building ML detections for cloud lateral movement, identifying anomalous executables, and automating incident response process.
Data science and machine learning
Mike Conover (SkipFlag)
Mike Conover offers an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis.
Law, ethics, and governance, Strata Business Summit
Sugreev Chawla (Thorn)
Sugreev Chawla offers an overview of Spotlight, a tool created by Thorn, a nonprofit that uses technology to fight online child sexual exploitation. It allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking, using graph analysis, time series analysis and NLP techniques to surface important networks of ads and characterize their behavior over time.
Data engineering and architecture, Streaming systems and real-time applications
Tyler Akidau (Google)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.
Data engineering and architecture, Data-driven business management, Streaming systems and real-time applications
Rishi Ranjan (Freddie Mac)
Rishi Ranjan explains how Freddie Mac used Apache NiFi and Apache Atlas to build a centralized production operational data store on a Hadoop cluster. NiFi reduced the time to build a new data pipeline from months to hours and provided a robust data governance capability at the same time.
Data-driven business management
Katie Malone (Civis Analytics)
The 2012 Obama campaign ran the first personalized presidential campaign in history. The data team was made up of people from diverse backgrounds who embraced data science in service of the goal. Civis Analytics emerged from this team and today enables organizations to use the same methods outside politics. Katie Malone shares lessons learned from these experiences for building effective teams.
Data science and machine learning
Yufeng Guo (Google), Amy Unruh (Google)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.
Big data and data science in the cloud, Data science and machine learning, Media, entertainment, and advertising, Platform security and cybersecurity
Delip Rao (R7 Speech Science)
Spoken conversations have rich information beyond what was said in words. Delip Rao details the potential of spoken conversational datasets, including identifying speakers and their demographic attributes, understanding intent and dynamics between speakers, and so on. Delip also discusses some of the latest science, including some of the work developed at R7.
Zachary Glassman (The Data Incubator)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
Instructors from the Data Incubator demonstrate how to build intelligent business applications using machine learning, taking you through each step in developing a machine learning pipeline, from prototyping to production. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend your knowledge by building two applications from real-world datasets.
Big data and data science in the cloud, Data engineering and architecture, Media, entertainment, and advertising
Szehon Ho (Criteo), Pawel Szostek (Criteo)
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.
Big data and data science in the cloud, Data science and machine learning
Wee Hyong Tok (Microsoft), Danielle Dean (Microsoft)
Artificial intelligence (AI) has tremendous potential to extend our capabilities and empower organizations to accelerate their digital transformation. Wee Hyong Tok and Danielle Dean demystify AI for big data professionals and explain how they can leverage and evolve their valuable big data skills by getting started with AI.
Derek Ruths (CAI)
Unreasonable sales forecasts, badly overstocked inventory, misguided investments . . . bad analyses happen all the time, leading to bad decisions and costing businesses millions of dollars. Derek Ruths shares the five most common issues that lead to bad data-informed thinking.
Data engineering and architecture, Streaming systems and real-time applications
Jordan Hambleton (Cloudera), Guru Medasani (Cloudera)
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.
Data engineering and architecture, Platform security and cybersecurity
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with transparent data encryption (TDE). However, TDE can be difficult to configure and manage; issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them.
Data-driven business management
Meagan O'Leary (Microsoft)
Microsoft’s finance organization is reinventing forecasting using machine learning that its leaders describe as game changing. Meagan O'Leary shares the lessons the data sciences and finance teams learned while bringing machine learning forecasting to the office of the CFO by improving forecast accuracy and frequency and driving cultural change through a finance center of excellence.
Data engineering and architecture
Juan Yu (Cloudera)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu explores the cost model Impala planner uses, how Impala optimizes queries, how to identify performance bottleneck through query plan and profile, and how to drive Impala to its full potential.
Big data and data science in the cloud, Data-driven business management, Strata Business Summit, Streaming systems and real-time applications
Steven Levine (Weight Watchers ), Nicolas Chikhani (Weight Watchers International)
For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. Michael Lysaght, Steven Levine, and Nicolas Chikhani discuss Weight Watchers's transition from a traditional BI organization to one that uses data effectively, covering the company's needs, the changes that were required, and the technologies and architecture used to achieve its goals.
Big data and data science in the cloud, Strata Business Summit, Visualization and user experience
Mike Driscoll (Metamarkets)
There’s a make-or-break step ahead for AI development. AI tools shouldn’t be designed to replace humans; they should be built with them in mind. We need to focus on translating data from machine learning models into beautiful, intuitive visuals. Mike Driscoll shares advice for creators of next-gen predictive algorithms from his experience turning big data into interactive visualizations.
Data-driven business management, Strata Business Summit
Paco Nathan (O'Reilly Media)
Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media.
Data science and machine learning, Law, ethics, and governance
Pramit Choudhary (DataScience.com)
Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation.
Data science and machine learning, Data-driven business management
Veronica Mapes (Pinterest), Garner Chung (Pinterest)
Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.
Data-driven business management, Strata Business Summit
Kapil Surlaker (LinkedIn), Ya Xu (LinkedIn)
Metrics measurement and experimentation plays a crucial role in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.
In this talk we will briefly explore some of the technologies and methodologies we can use to gain insights into the customers experience on the platform to gain understanding as to what content is working better than other and how we could personalize the content to enhance customer experience.
Sridhar Alla (Comcast)
Personalization is becoming very prevalent on all sorts of interactive platform and is increasingly being used as an effective method to enhance the customer experience on the platform. Now a days , Set Top Boxes are becoming more advanced than ever with a lot of interactive personalizable content.
Big data and data science in the cloud, Data science and machine learning
Sergey Ermolin (Intel), Suqiang Song (MasterCard)
Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach.
Data science and machine learning
Michael Lee Williams (Fast Forward Labs)
Interpretable models result in more accurate, safer, and more profitable machine learning products. But interpretability can be hard to ensure. Michael Lee Williams explores the growing business case for interpretability and its concrete applications, including churn, finance, and healthcare. Along the way, Michael offers an overview of open source, model-agnostic tool LIME.
Data engineering and architecture, Streaming systems and real-time applications
Dean Wampler (Lightbend)
Dean Wampler explores two microservice streaming applications based on Kafka to compare and contrast using Akka Streams and Kafka Streams for data processing. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to chose them instead.
Keynotes
Li Fan (Pinterest)
Keynote with Li Fan
Keynotes
Nancy Lublin (Crisis Text Line)
Keynote with Nancy Lublin
Keynotes
Natalie Evans Harris (BrightHive)
Keynote with Natalie Evans Harris
Keynotes
Wayne Peacock (Blizzard Entertainment)
Keynote with Wayne Peacock
Big data and data science in the cloud, Data science and machine learning
Mo Patel (Independent), Neejole Patel (Virginia Tech)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.
Keynotes, Strata Business Summit
Seth Stephens-Davidowitz (Everybody Lies | NY Times)
Seth Stephens-Davidowitz explains how to use Google searches to uncover behaviors or attitudes that may be hidden from traditional surveys, such as racism, sexuality, child abuse, and abortion.
Data science and machine learning
Vartika Singh (Cloudera), Jeffrey Shmain (Cloudera)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Vartika Singh and Jeffrey Shmain outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.
Data science and machine learning
Joseph Richards (GE Digital)
Deploying ML software applications for use cases in the industrial internet presents a unique set of challenges. Data-driven problems require approaches that are highly accurate, robust, fast, scalable, and fault tolerant. Joseph Richards shares GE's approach to building production-grade ML applications and explores work across GE in industries such as power, aviation, and oil and gas.
Big data and data science in the cloud, Data science and machine learning
Alexandra Gunderson (Arundo Analytics)
Heavy industries, such as oil and gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks and even months to create a comprehensive dataset from all of the various data sources. Alexandra Gunderson details the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources.
Big data and data science in the cloud, Data engineering and architecture, Media, entertainment, and advertising, Streaming systems and real-time applications
Manu Mukerji (Criteo)
Criteo is a global leader in commerce marketing. Manu Mukerji walks you through Criteo's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated, how the model is pushed to production, evaluated (automatically), and used, production issues that arise when applying ML at scale in production, lessons learned, and more.
Delip Rao (R7 Speech Science), Brian McMahan (Joostware)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.
Data science and machine learning
Robert Schroll (The Data Incubator), Dana Mastropole (The Data Incubator)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
The TensorFlow library enables the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs, making it ideal for implementing neural networks and other machine learning algorithms. Robert Schroll and Dana Mastropole demonstrate TensorFlow's capabilities and walk you through building machine learning models on real-world data.
Data engineering and architecture, Streaming systems and real-time applications
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications.
Big data and data science in the cloud, Data science and machine learning
Ram Sriharsha (Databricks)
How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity.
Data-driven business management, Strata Business Summit
Matthew Granade (Domino Data Lab)
Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams needs to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale.
Data-driven business management, Strata Business Summit
Nick Elprin (Domino Data Lab)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.
Data-driven business management, Media, entertainment, and advertising
Kevin Lyons (Nielsen Marketing Cloud)
Consumer behavior is in a constant state of flux. Adapting to these changes is especially hard, given the staggering amount of big data marketers need to understand and act on. In this session, Kevin Lyons, SVP of Data Science at Nielsen, will introduce ‘Online Learning’, a cutting-edge AI technology that uses event-level data streams to build and adapt models in real time.
Big data and data science in the cloud, Data-driven business management
Madhav Madaboosi and Meenakshisundaram Thandavarayan offer an overview of BP's self-service operational data lake, which improved operational efficiency, boosting productivity through fully identifiable data and reducing risk of a data swamp. They cover the path and big data technologies that BP chose, lessons learned, and pitfalls encountered along the way.
Data engineering and architecture, Streaming systems and real-time applications
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (Streamlio)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.
Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Henry Cai (Pinterest), Yi Yin (Pinterest)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.
Data science and machine learning
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Natural language processing is a key component in many data science systems. David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial on scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.
Data engineering and architecture
Gian Merlino (Imply)
Gian Merlino discusses the SQL layer recently added to the open source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database." Gian explains how Druid and Calcite are integrated and why you should stop worrying and learn to love relational algebra in your own projects.
Big data and data science in the cloud, Data science and machine learning
Goodman Gu (Atlassian)
Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker.
Alistair Croll (Solve For Interesting)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Strata Data Conference program chair Alistair Croll welcomes you to the Data Case Studies tutorial.
Data engineering and architecture
Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)
Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.
Data-driven business management, Strata Business Summit
Mauro Augusto Sartin (Cartao ELO)
Elo is the first Brazilian credit card brand, created to deliver broad payment services. To achieve success, Elo needed to quickly process and analyze their data in a secure environment, which was challenging with their traditional data warehouse. This session will cover how Elo implemented a PCI-compliant Hadoop environment to gain insights across their business in ways not previously possible.
Data engineering and architecture, Visualization and user experience
Rahim Daya (Pinterest)
Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user.
Big data and data science in the cloud, Data engineering and architecture
Abe Gong (Superconductive Health), James Campbell (USG)
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.
Big data and data science in the cloud, Data engineering and architecture
Carlo Torniai (Pirelli Tyre)
Carlo Torniai shares the architectural challenges Pirelli faced in building Pirelli Connesso, an IoT cloud-based system providing information on tire operating conditions, consumption, and maintenance, and highlights the operative approaches that enabled the integration of different contribution across cross-functional teams.
Data engineering and architecture, Streaming systems and real-time applications
Holden Karau (Google), Rachel Warren (Independent)
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Big data and data science in the cloud, Data engineering and architecture
Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)
Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.
Big data and data science in the cloud, Data engineering and architecture
Ritesh Agrawal (Uber), Anirban Deb (Uber)
Presto has emerged as the de facto query engine to quickly process petabytes of data. However, rogue SQL queries can waste a significant amount of critical compute resource and reduce Presto's throughput. Ritesh Agrawal and Anirban Deb explain how Uber uses machine learning to identify and stop rogue queries, saving both computational power and money.
Law, ethics, and governance, Strata Business Summit
Anne Buff (SAS Institute)
Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement.
Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Eugene Kirpichov (Google)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive SplittableDoFn.
Big data and data science in the cloud, Data engineering and architecture, Data-driven business management, Platform security and cybersecurity, Streaming systems and real-time applications
Yu Xu (TigerGraph)
Graph databases are the fastest growing category in data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real-world applications require deep link analytics that traverse far more than three hops. Yu Xu offers an overview of a fraud detection system that manages 100 billion graph elements to detect risk and fraudulent groups.
Big data and data science in the cloud
Jesse Anderson (Big Data Institute)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.
Strata Business Summit
Ayin Vala (Foundation for Precision Medicine)
Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient.
Big data and data science in the cloud, Data engineering and architecture, Visualization and user experience
Sean Kandel (Trifacta)
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. Sean Kandel discusses methods for detecting, visualizing, and resolving inconsistencies between source and target data models across these pipelines.
Data science and machine learning, Media, entertainment, and advertising
April Chen (Civis Analytics), John Davis (Civis Analytics)
Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. April Chen and John Davis detail shortcomings of these models and propose a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.
Data science and machine learning
Adam Greenhall explains how Lyft uses simulation to test out new algorithms, help develop new features, and study the economics of ride-sharing markets as they grow.
Data-driven business management, Strata Business Summit, Streaming systems and real-time applications, Visualization and user experience
Mike Prorock (mesur.io)
Mike Prorock offers an overview of mesur.io, a game-changing climate awareness solution that combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market. Mesur.io enables growers to monitor areas of concern, providing immediate benefits to crop yield, supply costs, farm labor overhead, and water consumption.
Data science and machine learning
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Join in for an introduction to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case.
Data engineering and architecture
Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)
Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.
Data science and machine learning
Vincent Xie (Intel), Peng Meng (Intel)
Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on Spark ML and introduce the methodology behind Intel's work on SparkML optimization.
Data science and machine learning
David Talby (Pacific AI), Ganesh Thondikulam (Kaiser Permanente)
David Talby and Ganesh Thondikulam explain how Kaiser Permanente uses the open source NLP library for Apache Spark to tackle one of the most common challenges with applying natural language process in practice: integrating domain-specific NLP as part of a scalable, performant, measurable, and reproducible machine learning pipeline.
Big data and data science in the cloud, Data engineering and architecture, Visualization and user experience
Zhen Fan (JD.com), Wei Ting Chen (Intel)
Zhen Fan and Wei Ting Chen explain how JD.com uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.
Data science and machine learning
Ian Cook (Cloudera)
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.
Keynotes
Janelle Shane (aiweirdness.com)
At AIweirdness.com Janelle Shane posts the results of neural network experiments gone delightfully wrong. But machine learning mistakes can also be very embarrassing, or even dangerous. Using silly datasets as examples, Shane talks about some ways that algorithms fail.
Data engineering and architecture, Streaming systems and real-time applications
Tim Berglund (Confluent)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.
Data engineering and architecture, Streaming systems and real-time applications
Sijie Guo (Streamlio)
Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads, has been widely adopted by enterprises like Twitter, Yahoo, and Salesforce to store and serve mission-critical data. Sijie Guo explains how Apache BookKeeper satisfies the needs of stream storage.
Data engineering and architecture, Streaming systems and real-time applications
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to chose them instead.
Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
William Chambers (Databricks), Michael Armbrust (Databricks)
William Chambers and Michael Armbrust discuss the motivation and basics of Apache Spark's Structured Streaming processing engine and share lessons they've learned running hundreds of Structured Streaming workloads in the cloud.
Data engineering and architecture, Streaming systems and real-time applications
Fabian Hueske (data Artisans), Shuyi Chen (Uber)
Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.
Data science and machine learning
Rajat Monga (Google)
Rajat Monga offers an overview of TensorFlow progress and adoption in 2017 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas.
Data engineering and architecture
Gwen Shapira (Confluent)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.
Data science and machine learning, Data-driven business management
Clare Gollnick (Terbium Labs)
At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project.
Data-driven business management, Strata Business Summit
Angela Zutavern (Booz Allen Hamilton)
How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Angela Zutavern shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.”
Data engineering and architecture, Data science and machine learning, Streaming systems and real-time applications
Roy Ben-Alta (Amazon Web Services), Ira Cohen (Anodot)
Many domains, such as mobile, web, the IoT, ecommerce, and more, have turned to analyzing streaming data. However, this presents challenges both in transforming the raw data to metrics and automatically analyzing the metrics in to produce the insights. Roy Ben-Alta and Ira Cohen share a solution implemented using Amazon Kinesis as the real-time pipeline feeding Anodot's anomaly detection solution.
Law, ethics, and governance, Strata Business Summit
John Mertic (The Linux Foundation), Ferd Scheepers (ING)
John Mertic and Ferd Scheepers detail the benefits of a vendor-neutral approach to data governance, explain the need for an open metadata standard, and share insight around how companies ING, IBM, Hortonworks, and more are delivering solutions to this challenge as an open source initiative.
Data science and machine learning, Law, ethics, and governance, Streaming systems and real-time applications
Jennifer Prendki (Atlassian)
Jennifer Prendki explains how to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing, covering techniques ranging from contextual bandits to document vector representation.
Data engineering and architecture
Jiangjie Qin (LinkedIn)
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.
Keynotes
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
Data engineering and architecture
Ted Malaska (Blizzard Entertainment)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
If you have data that has a time factor to it, then you need to think in terms of time series datasets. Ted Malaska explores time series in all of its forms, from tumbling windows to sessionization in batch or in streaming. You'll gain exposure to the tools and background you need to be successful in the world of time-oriented data.
Data engineering and architecture
Michael Freedman (TimescaleDB | Princeton)
Michael Freedman offers an overview of TimescaleDB, a new scale-out database designed for time series workloads yet open-sourced and engineered up as a plugin to Postgres. Unlike most time series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries.
Data science and machine learning
Rachita Chandra (IBM Watson Health)
Rachita Chandra outlines challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment.
Data-driven business management, Strata Business Summit
Brian Karfunkel (Pinterest)
When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users.
Big data and data science in the cloud, Data-driven business management, Strata Business Summit
Michael Schrenk (Self-Employed)
Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. Michael Schrenk explains how metadata is created and used to gain competitive advantages, predict troop strength, or even guess Social Security numbers.
Data engineering and architecture, Streaming systems and real-time applications
Stephan Ewen (data Artisans), Flavio Junqueira (Dell EMC)
Stephan Ewen and Flavio Junqueira detail an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams) that offers an unprecedented way of handling “everything as a stream” that includes unbounded streaming storage and unified batch and streaming abstraction and dynamically accommodates workload variations in a novel way.
Data science and machine learning
Karthik Ramasamy (Uber), Lenny Evans (Uber)
Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.
Data engineering and architecture, Streaming systems and real-time applications
Shivnath Babu (Duke University | Unravel Data Systems), Sumit Jindal (Unravel Data Systems)
Getting the best performance, predictability, and reliability for Kafka-based applications is a complex art. Shivnath Babu and Sumit Jindal explain how to simplify the process by leveraging recent advances in machine learning and AI and outline a methodology for applying statistical learning to the rich and diverse monitoring data that is available from Kafka.
Media, entertainment, and advertising, Visualization and user experience
Ann Nguyen (Whole Whale)
Power Poetry is the largest online platform for young poets, with over 350K users. Ann Nguyen explains how Power Poetry is extending the learning potential with machine learning and covers the technical elements of the Poetry Genome, a series of ML tools to analyze and break down similarity scores of the poems added to the site.
Data science and machine learning
Mario Inchiosa (Microsoft), Vanja Paunic (Microsoft), Robert Horton (Microsoft), Debraj GuhaThakurta (Microsoft), Ali Zaidi (Microsoft), Tomas Singliar (Microsoft), John-Mark Agosta (Microsoft)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
R and Python top the list of languages used in data science and machine learning, and data scientists and engineers fluent in one of these languages are increasingly marketable. Come learn how to build and operationalize machine learning models using distributed functions and do scalable, end-to-end data science in R and Python on single machines, Spark clusters, and cloud-based infrastructure.
Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Siddharth Teotia (Dremio)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.
Data science and machine learning, Streaming systems and real-time applications
Diane Chang (Intuit)
When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. Diane Chang shares how Intuit's data science team preps, cleans, organizes, and augments training data along with best practices she's learned along the way.
Keynotes
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
Data science and machine learning, Media, entertainment, and advertising
Noah Gift (UC Davis)
Noah Gift uses data science and machine learning to explore NBA team valuation and attendance as well as individual player performance. Questions include: What drives the valuation of teams (e.g., attendance, local real estate market)? Does winning bring more fans to games? Does salary correlate with social media performance?
Data engineering and architecture
Daniel Templeton (Cloudera), Andrew Wang (Cloudera)
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Data science and machine learning
Miryung Kim (UCLA), Muhammad Gulzar (UCLA)
Even though we know that there are more data scientists in the workforce today, what those data scientists actually do and what we mean by data scientists have not been studied quantitatively. In this talk, we present a large-scale survey with 793 professional data scientists. Our study should inform managers on how to leverage data science capability effectively within their teams.
Big data and data science in the cloud, Data science and machine learning, Visualization and user experience
Baron Schwartz (VividCortex)
Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view.
Data science and machine learning
Patrick Harrison (S&P Global)
Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through of how it works its magic. Along the way, Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more.
Data-driven business management
Thomas Miller (Northwestern University)
Sports analytics today is more than a matter of analyzing box scores and play-by-play statistics. Faced with detailed on-field or on-court data from every game, sports teams face challenges in data management, data engineering, and analytics. Thomas Miller details the challenges faced by a Major League Baseball team as it sought competitive advantage through data science and deep learning.
Data science and machine learning
Andrew Ray (Sam’s Club Technology)
Andrew Ray offers a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX, drawing on real-world examples, and provides historical context for the evolution between these three abstractions.
Data-driven business management
AI is transformative for business, but it’s not magic; it’s data. The instructor shares how Next IT's global enterprise customers have transformed their businesses with AI solutions and outlines how companies should build AI strategies, utilize data to develop and evolve conversational intelligence and business intents, and ultimately how to increase ROI.