Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

If you are looking for slides and video from 2017, visit the Strata conference in San Jose 2017 site.

All
Data engineering and architecture
Kurt Brown (Netflix)
How can you get the most out of your data infrastructure? Come and find out what we do at Netflix and why. We'll run through 20 principles & practices that we've refined and embraced over time. For each one, we'll weave in how they interplay with the technologies we use at Netflix (e.g. S3, Spark, Presto, Druid, Python, Jupyter,...).
Big data and data science in the cloud, Data engineering and architecture
Philip Langdale (Cloudera), Eugene Fratkin (Cloudera), Vinithra Varadharajan (Cloudera), Jennifer Wu (Cloudera)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Vinithra Varadharajan, Philip Langdale, Eugene Fratkin, and Jennifer Wu lead a deep dive into running data analytic workloads in a managed service capacity in the public cloud and highlight cloud infrastructure best practices.
Big data and data science in the cloud, Data science and machine learning, Data-driven business management
Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Pavel Dmitriev (Microsoft), Paul Raff (Microsoft)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Controlled experiments, including A/B tests, have revolutionized the way software is being developed, with new ideas objectively evaluated with real users. We provide an intro and lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft and executing over 10K experiments/year.
Big data and data science in the cloud, Data science and machine learning
Shivaram Venkataraman (UC Berkeley), Sergey Ermolin (Intel Santa Clara), Ding Ding (Intel)
The BigDL framework scales deep learning for large data sets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. In this talk we propose a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception, VGG etc.
Data engineering and architecture
Kinnary Jangla (Pinterest)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.
Data engineering and architecture, Data-driven business management, Law, ethics, and governance
We explain how we have used open source BlockChain technologies such as HyperLedger to implement the European Union's General Data Protection Regulation (GDPR) regulation. The key takeaways are: 1. Introduction to GDPR – a step further on data privacy 2. Why Blockchain is a suitable candidate for implementing GDPR 3. Lessons learnt in our blockchain implementation of GDPR compliance.
Law, ethics, and governance, Strata Business Summit
Or Herman-Saffar (Dell), Ran Taig (Dell)
What if we could predict when and where next crimes will be committed? Crimes in Chicago is a publicly published data set which reflect the reported incidents of crime that occurred in Chicago since 2001. Using this data, we would like not only be able to explore specific crimes to find interesting trends, but also predict how many crimes will be taking place next week, and even next month.
Big data and data science in the cloud, Data engineering and architecture
Greg Rahn (Cloudera)
For many organizations, the cloud will likely be the destination of their next big data warehouse. The speakers will discuss considerations when evaluating the cloud for analytics and big data warehousing in order to steer attendees down the path of success allowing them to get the most from the cloud. Attendees will leave with an understanding of different architectural approaches and impacts.
Data science and machine learning
Brooke Wenig (Databricks)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.
Data engineering and architecture, Streaming systems and real-time applications
Debasish Ghosh (Lightbend )
Discusses the role that approximation data structures like bloom filter, the sketches, hyperloglog etc. play in processing streaming data. Typically streams are unbounded in space and time. Any processing has to be online using sublinear space. I discuss probabilistic bounds that these data structures offer and how they can be used to implement solutions for the fast and streaming architectures.
Data engineering and architecture
John Akred (Silicon Valley Data Science), Stephen O'Sullivan (Silicon Valley Data Science)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark and big data ecosystems fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.
Data-driven business management, Strata Business Summit
With so many business intelligence tools in the Hadoop ecosystem and no common measure to identify the efficiency of each tool, where do you begin to build or modify your enterprise data lake strategy? Sagar Kewalramani shares real-world BI problems and how they were resolved with Hadoop tools and demonstrates how to build an effective data lake strategy with open source tools and components.
Big data and data science in the cloud, Data science and machine learning
Jennie Wang (Intel), Valentina Pedoia (Radiology and Biomedical Imaging, UCSF; Center of Digital Health Innovation, UCSF), Berk Norman (Radiology and Biomedical Imaging, UCSF; Center of Digital Health Innovation, UCSF), Yulia Tell (Intel Corporation)
Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage poses the advantage for quicker and more accurate diagnosis at the time of an MRI scan. We will talk about building this classification system with 3D convolutional neural networks using BigDL on Apache Spark.
Data science and machine learning, Data-driven business management, Visualization and user experience
Wayde Fleener (General Mills)
Decision Makers are busy. Businesses can hire people to analyze data for them, but most companies are resource constrained and can’t hire a small army to look through all their data. In this session, General Mills will share how we built automation so Decision Makers can quickly focus on metrics that matter and cut through everything else that does not.
Data science and machine learning
Siddha Ganju (Deep Vision)
We will talk about how the FDL lab at NASA uses artificial intelligence to (1) improve and automate the identification of meteors above human level performance using meteor shower images, and (2) recover known meteor shower streams and characterize previously unknown meteor showers using orbital data. This is aimed at providing more warning time for long period comet impacts.
Data engineering and architecture
Ted Dunning (MapR Technologies)
Ted Dunning offers an overview of the rendezvous architecture, developed to be the "continuous integration" system for machine learning models. It allows always-hot zero latency rollout and rollback of new models and supports extensive metrics and diagnostics so models can be compared as they process production data. It can even hot-swap the framework itself with no downtime.
Rizwan Patel (Caesars Entertainment)
Leveraging Cloudera’s Big Data Platform to adapt to the change in patron dynamics (both in terms of demographics as well as in spending patterns, with the tilt being more towards non-gaming spending – shows, retail, dining, entertainment) to create a new paradigm for customer (micro) segmentation
Strata Business Summit
Mauro Damo (DELL), wei lin (DELLEMC)
Image recognition classification diseases is expected to improve and support physicians decisions. Application of Deep Learning techniques to recognize diseases in organs will minimize the possibility of medical mistakes, improve patient treatment and speed up patient diagnosis.
Data science and machine learning
Daniel Lurie (Pinterest)
All successful start-ups thrive on tight product-market fit, which can produce homogenous initial user-bases. To become the Next Big Thing your user base will need to diversify and your product change to accommodate new needs. This talk discusses how Pinterest leveraged external data to begin measuring racial & income diversity in our user base and changed user modeling to drive growth.
Data engineering and architecture
Alexis Roos (Salesforce), Noah Burbank (Salesforce)
In the customer age, being able to extract relevant communications information in real-time and cross reference it with context is key. Salesforce is using data science and engineering to enable salespeople to monitor their emails in real-time to surface insights and recommendations using a graph modeling contextual data.
Data engineering and architecture, Data science and machine learning, Data-driven business management
Rajiv Synghal (Kaiser Permanente)
As healthcare data becomes increasingly digitized, medical centers are leveraging data in new ways to improve patient care. At Kaiser Permanente, one such initiative is focused on the flu. Each year, as many as 49,000 people die in the U.S. alone. Kaiser will discuss how they developed a sophisticated flu predictor model to better determine where resources were needed and how to reduce outbreaks.
Data science and machine learning
Simon Hughes (Dice.com), Yuri Bykov (dice.com)
At Dice.com we recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. We will discuss how we applied different machine learning algorithms to solve each of these problems, and the technologies used to build, deploy and monitor these solutions in production.
Big data and data science in the cloud, Data engineering and architecture
Jorge A. Lopez (Amazon Web Services)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Want to learn how to use Amazon's big data web services to launch your first big data application on the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.
Data engineering and architecture
Dorna Bandari (Jetlore)
Dorna Bandari offers an overview of the machine learning pipeline at B2B AI startup Jetlore and explains why even small B2B startups in AI should invest in a flexible machine learning pipeline. Dorna covers the design choices, the trade-offs made when implementing and maintaining the pipeline, and how it has accelerated Jetlore's product development and growth.
Data-driven business management, Media, entertainment, and advertising
Kevin Lyons (Nielsen Marketing Cloud)
Consumer behavior is in a constant state of flux. Adapting to these changes is especially hard given the staggering amount of “Big Data” marketers need to understand & act on. Learn how financial news publisher InvestingChannel and its technology partner, Nielsen, are using an advanced form of AI, online machine learning, to adapt to real-time changes in audience behavior & market conditions.
Data science and machine learning
Keno Fischer (Julia Computing)
Julia is rapidly becoming a prevalent language at the forefront of scientific discovery. This talk will highlight one of the most ambitious recent use cases for julia: Using Machine Learning to Catalogue astronomical objects to derive a catalogue from multi-terabyte size astronomical image data sets. This work was a collaboration between MIT, UC Berkley, LBNL and Julia Computing.
Big data and data science in the cloud, Data engineering and architecture
Tom Fisher (MapR Technologies)
The monolithic cloud is dying. Delivering capabilities across multiple clouds and, simultaneously, transitioning to the next generation platforms and applications is the challenge today. This presentation will talk through technological approaches and solutions that make this possible while delivering data driven applications and operations.
Big data and data science in the cloud, Data science and machine learning
Vlad A Ionescu (ShiftLeft Inc), Fabian Yamaguchi (ShiftLeft Inc./GmbH)
While in the earlier days, code would generate data, with CPG we now generate data for the code so that we can understand it better.
Big data and data science in the cloud, Data engineering and architecture
Michelle Casbon (Qordoba)
Michelle Casbon describes how to speed up the development of ML models by using open-source tools such as Kubernetes, Docker, Scala, Apache Spark, & Weave Flux. Her lessons-learned approach details how to build resilient systems so that engineers and data scientists can spend more of their time on product improvement rather than triage & uptime.
Data science and machine learning, Streaming systems and real-time applications
Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services (AWS))
In this talk, we present continuous machine learning algorithms that discover useful information in streaming data. We focus on explainable machine learning, including anomaly detection with attribution, ability to reduce false positives through user feedback, and detection of anomalies in directed graphs. We describe the algorithms, implementation, and application in real customer use cases.
Data science and machine learning, Visualization and user experience
James Bednar (Anaconda, Inc.), Philipp Rudiger (Anaconda Inc.)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Python lets you solve data-science problems by stitching together packages from the Python ecosystem, but it can be difficult to choose packages that work well together. Here we take you through a small number of lines of Python code that provide a concise, fast, easily customizable, and fully reproducible recipe for interactive visualization of millions or billions of datapoints.
Big data and data science in the cloud, Data engineering and architecture
Tomer Kaftan (University of Washington)
Cuttlefish is a lightweight framework, prototyped in Apache Spark, for developers to adaptively improve the performance of data processing applications. Developers use Cuttlefish by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.
Data-driven business management, Strata Business Summit
Marcin Pilarczyk (Ryanair)
Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. This session highlights the main aspects of fuel management of a modern airline and provides an overview of machine learning methods supporting long term planning and daily decisions.
Data engineering and architecture
Chris Harland (Microsoft)
The number of resources explaining how to build a machine learning model from data greatly overshadows information on how to make real data products from such models. This creates a gap between what machine learning engineers and data scientists know is possible and what users experience. Using examples from Textio's augmented writing platform, we'll go end to end in building a data product.
Brian Bloechle (Cloudera, Inc.)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
Brian Bloechle demonstrates how to implement typical data science workflows using Apache Spark. You'll learn how to wrangle and explore data using Spark SQL DataFrames and how to build, evaluate, and tune machine learning models using Spark MLlib.
Data science and machine learning
Josh Wills (Slack)
Josh Wills describes recent data science and machine learning projects at Slack.
Angie Ma (ASI)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
Angie Ma offers a condensed introduction to key data science and machine learning concepts and techniques, showing you what is (and isn't) possible with these exciting new tools and how they can benefit your organization.
Media, entertainment, and advertising, Strata Business Summit
Ray Bernard (SuprFanz), Jennifer Webb (SuprFanz)
Ray Bernard and Jennifer Webb explain how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.
Data engineering and architecture, Data-driven business management, Strata Business Summit
Matt Derda (Trifacta), Jonathon Whitton (PRGX USA Inc)
PRGX is a global leader in Recovery Audit and Source-to-Pay (S2P) Analytics services, serving around 75% of the top 20 global retailers. During this session, PRGX will explain how they’ve adopted Trifacta and Cloudera to scale their current processes, and increase revenue for the products and services they offer clients.
Data engineering and architecture
Crystal Valentine (MapR Technologies)
DataOps—a methodology for developing and deploying data-intensive applications, especially those involving data science and machine learning pipelines—supports cross-functional collaboration and fast time to value with an Agile, self-service workflow. Crystal Valentine offers an overview of this emerging field and explains how to implement a DataOps process.
Data science and machine learning, Media, entertainment, and advertising
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
The key takeaways are: 1. Introduction to deep learning - different networks such as RBMs, Conv nets, auto-encoders. 2. Introduction to recommendation systems - why deep learning is required for hybrid systems. 3. complete hands-on TensorFlow tutorial, including TensorBoard. 4. end-to-end view of deep learning based recommendation and learning to rank systems.
Big data and data science in the cloud, Data science and machine learning
Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)
Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.
Keynotes, Data science and machine learning
Jeff Dean (Google)
Keynote with Jeff Dean
Data science and machine learning, Streaming systems and real-time applications
Dan Crankshaw (UC Berkeley RISE Lab)
Clipper is an open source, general-purpose model-serving system that provides low-latency predictions under heavy serving workloads for interactive applications. Dan Crankshaw offers an overview of the Clipper serving system and explains how to use it to serve Apache Spark and TensorFlow models on Kubernetes.
Data engineering and architecture
Ron Bodkin (Google)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.
Data science and machine learning
Andrea Pasqua (Uber), Anny Chen (Uber)
Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.
Data-driven business management, Strata Business Summit
John Akred (Silicon Valley Data Science), Cindi Thompson (Silicon Valley Data Science)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those key aspirations that will define an organization’s future vision. In this tutorial, we explain how to create a modern data strategy that powers data-driven business.
Keynotes, Data-driven business management, Media, entertainment, and advertising, Strata Business Summit
Eric Colson (Stitch Fix)
Data Science has historically been leveraged as a supportive function. But for some business models and companies, Data Science can be the primary means for competitive differentiation. This requires a different way of working and organizing. For it to thrive, Data Science needs its own department reporting directly to the CEO with a workflow completely different from any other department.
Big data and data science in the cloud, Data science and machine learning, Platform security and cybersecurity
Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Daniel Rubin (Stanford)
Clinical collaboration benefits from pooling data to learn models from large datasets, but its hampered by concerns about sharing data. We've developed a privacy-preserving alternative to create statistical models equivalent to one from the entire dataset. We've built this as a cloud application, where each collaborator installs their own, and the installations self-assemble into a star network.
Big data and data science in the cloud, Data engineering and architecture
dong meng (MapR)
DL model performance relies on underlying data. We use a converged data platform to serve as data infrastructure providing distributed file system, key-value storage and streams, Kubernetes as orchestration layer to manage containers to train/deploy DL models using GPU clusters. We also publish and subscribe to streams on the platform to build next-gen applications with DL models.
Data engineering and architecture
Mark Grover (Lyft), Arup Malakar (Lyft)
This talk gives an overview of how we leverage application metrics, logs & auditing to monitor and troubleshoot our data platform at Lyft. We share how we dogfood our own platform to provide, security, auditing, alerting & replayability in our platform. We also detail some of the services & tools we have developed internally to make our data more robust, scalable & self-serving.
Data engineering and architecture, Data science and machine learning, Visualization and user experience
Jules Malin (GoPro)
Drones and smart devices are generating billions of event logs for companies, presenting the opportunity to discover insights that inform product, engineering, and marketing team decisions. Jules Malin explains how technologies like Spark and analytics and visualization tools like Python and Plotly enable those insights to be discovered in the data.
Data engineering and architecture, Streaming systems and real-time applications
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio)
Stream processing systems have the need to support different types of processing semantics due to the diverse nature of streaming applications. In this talk, we methodically effectively once, exactly once and different types of states and consistency, how it is implemented in Heron and how applications can benefit.
Data science and machine learning
Stephen O'Sullivan (Silicon Valley Data Science), Silvia Oliveros (Silicon Valley Data Science)
So how much data engineering should a Data Scientist know? For a Data Scientist to get to the fun part of their job, they normally have to do a bit of data engineering. Like on boarding data. Do a little bit of “wrangling”. Before they get to the fun part part - The Data Science! In most cases this is 50%-80% of the time.
Strata Business Summit
Michael Chui (McKinsey Global Institute)
After decades of extravagant promises, artificial intelligence is finally starting to deliver real-life benefits to early adopters. However, we're still early in the cycle of adoption. Michael Chui explains where investment is going, patterns of AI adoption and value capture by enterprises, and how the value potential of AI across sectors and business functions is beginning to emerge.
Strata Business Summit
Mark Madsen (Third Nature)
If your goal is to provide data to an analyst rather than a data scientist, what’s the best way to deliver analytics? There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. This briefing will present the tradeoffs between different architectures to provide self-service access to data.
Data-driven business management, Strata Business Summit
Frances Haugen (Pinterest), Patrick Phelps (Pinterest)
Data science is most powerful when combined with deep domain knowledge, but those with domain knowledge don't work on data-focused teams. So how do you empower employees with diverse backgrounds and skill sets to be effective users of data? Frances Haugen and Patrick Phelps dive into the social side of data and share strategies for unlocking otherwise unobtainable insights.
Law, ethics, and governance, Strata Business Summit
Mark Donsky (Cloudera)
General Data Protection Regulation (GDPR) will go into effect in May 2018 for firms doing any business in the EU. Yet many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.
Strata Business Summit
Alysa Z. Hutnik (Kelley Drye & Warren LLP), Crystal Skelton (Kelley Drye & Warren LLP)
Big data promises enormous benefits for companies. But what about privacy, data protection, and consumer laws? Having a solid understanding of the legal and self-regulatory rules of the road are key to maximizing the value of your data while avoiding data disasters. Alysa Hutnik and Crystal Skelton share legal best practices and practical tips to avoid becoming a big data “don’t.”
Strata Business Summit
Mike Olson (Cloudera)
Mike Olson shares examples of real-world machine learning applications, explores a variety of challenges in putting these capabilities into production—including the speed with with technology is moving, cloud versus in-data-center consumption, security and regulatory compliance, and skills and agility in getting data and answers into the right hands—and outlines proven ways to meet them.
Data-driven business management, Strata Business Summit
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Recent years have seen dramatic advancements in the technologies available for managing and processing data. While these technologies provide powerful tools to build data applications, they also require new skills. Ted Malaska and Jonathan Seidman explain how to evaluate these new technologies and build teams to effectively leverage these technologies and achieve ROI with your data initiatives.
Strata Business Summit
Yishay Carmiel (IntelligentWire)
For years, one of the most important tasks of AI has been to understand humans. People want machines to understand not only what they say but also what they mean, and to take particular actions based on that information. This goal is the essence of conversational AI. We will explore the latest breakthroughs and revolutions in this field and what are the challenges that are still to come.
Data-driven business management, Strata Business Summit
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
Data science and machine learning
Mike Ruberry (ZestFinance)
What does it mean to explain a machine learning model, and why is it important? Mike Ruberry of ZestFinance will address those questions while discussing several modern explainability methods, including traditional feature contributions, LIME, and DeepLift. Each of these techniques offers a different perspective, and their clever application can reveal new insights and solve business requirements.
Data science and machine learning, Platform security and cybersecurity
Ram Shankar Siva Kumar (Microsoft (Azure Security Data Science))
How to debug a security data science system when it doesn’t work as intended: change the ML approach, redefine the security scenario or start from scratch again? We answer this question by sharing the failed experiments and the lessons learned when building ML detections for 3 security scenarios: cloud lateral movement, identifying anomalous executables and automating incident response process.
Data science and machine learning
Mike Conover (SkipFlag)
Cut to the chase with an overview of the essential techniques for understanding and working with natural language. From off-the-shelf neural networks and snappy preprocessing libraries to architectural patterns for bulletproof productionization, this talk will be of interest to anyone who uses language on a regular basis.
Law, ethics, and governance, Strata Business Summit
Sugreev Chawla (Thorn)
Thorn is a nonprofit that uses technology to fight online child sexual exploitation. This talk describes Spotlight, a tool created by Thorn that allows law enforcement to process millions of escort ads per month in an effort to fight sex trafficking. Graph analysis, time series analysis and NLP techniques are used to surface important networks of ads and characterize their behavior over time.
Data engineering and architecture, Streaming systems and real-time applications
Tyler Akidau (Google)
What does it mean to execute streaming queries in SQL? What is the relationship of streaming queries to classic relational queries? Are streams and tables the same thing? And how does all of this relate to the programmatic frameworks we’re all familiar with? Tyler Akidau answers these questions and more as he walks you through key concepts underpinning data processing in general.
Data engineering and architecture, Data-driven business management, Streaming systems and real-time applications
Rishi Ranjan (Freddie Mac)
Using Apache NiFi and Apache Atlas, Freddie Mac built a centralized production operational data store on a Hadoop cluster. NiFi reduced the time to build a new data pipeline from months to hours and provided a robust data governance capability at the same time.
Data science and machine learning
Yufeng Guo (Google), Amy Unruh (Google)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Yufeng Guo and Amy Unruh walk you through training and deploying a machine learning system using TensorFlow, a popular open source library. Yufeng and Amy take you from a conceptual overview all the way to building complex classifiers and explain how you can apply deep learning to complex problems in science and industry.
Big data and data science in the cloud, Data science and machine learning, Media, entertainment, and advertising, Platform security and cybersecurity
Delip Rao (Joostware)
Spoken conversations have rich information beyond what was said in words. In this talk, you will learn the potential of spoken conversational datasets, including identifying speakers, their demographic attributes, understanding intent, dynamics between speakers, and so on. We will share with you some of the latest science behind this, including some of the work developed at R7.
Robert Schroll (The Data Incubator)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
Robert Schroll offers an introduction to machine learning in Python, as he walks you through building an anomaly detection model and a recommendation engine. You'll gain hands-on experience from prototyping to production, and everything in between, including data cleaning, feature engineering, model building and evaluation, and deployment.
Big data and data science in the cloud, Data engineering and architecture, Media, entertainment, and advertising
Szehon Ho (CRITEO), Pawel Szostek (Criteo)
Hundreds of analysts and thousands of automated jobs run Hive queries at Criteo every day. As Hive is the main data transformation tool at Criteo, we spent a year evolving Hive's platform from an error-prone add-on installed on some spare machines, to a best-in-class installation capable of self-healing and automatically scaling to handle our growing load.
Big data and data science in the cloud, Data science and machine learning
Wee Hyong Tok (Microsoft), Danielle Dean (Microsoft)
Artificial Intelligence (AI) has tremendous potential to extend our capabilities, and empowering organizations to accelerate their digital transformation by infusing apps and experiences with AI. This session will help big data professional demystify AI, and how they can leverage and evolve their valuable big data skills towards doing AI.
Keynotes
Keynote with John Rauser
Data engineering and architecture, Streaming systems and real-time applications
Jordan Hambleton (Cloudera, Inc.), Guru Medasani (Cloudera, Inc.)
Streaming data continuously from Kafka allows users to gain insights faster, but when they fail, can leave users panicked about data loss when restarting their application. Offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.
Data engineering and architecture, Platform security and cybersecurity
Thomas Phelan (BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for Big Data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage; issues that are only compounded when running on Docker containers. This session will discuss these challenges and how to overcome them.
Data-driven business management
Meagan O'Leary (Microsoft)
Microsoft’s Finance organization is reinventing forecasting using machine learning that its leaders describe as “game changing”. This session covers the learnings the data sciences and finance teams experienced in bringing machine learning forecasting to the office of the CFO, by improving forecast accuracy and frequency and driving cultural change through an ML in Finance Center of Excellence.
Data engineering and architecture
Juan Yu (Cloudera)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Apache Impala (incubating) is an exceptional best-of-breed massively parallel processing SQL query engine and a fundamental component of Big Data software stack. In this workshop, the speaker will discuss the cost model Impala planner uses, how Impala optimizes queries, how to identify performance bottleneck through query plan and profile, and how to drive Impala to its full potential.
Big data and data science in the cloud, Data-driven business management, Strata Business Summit, Streaming systems and real-time applications
Michael Lysaght (Weight Watchers), Steven Levine (Weight Watchers ), Nicolas Chikhani (Weight Watchers)
For organizations stuck in a myriad of legacy infrastructure, the path to AI and deep learning seems impossible. In this talk, we discuss how Weight Watchers was able to move from a traditional BI organization to one that uses data effectively. We look at where we were, what were our needs, the changes that were required and the technologies & architecture we use to achieve our goals.
Media, entertainment, and advertising
During Hulu’s journey to becoming a major player in the subscription video on demand industry with yearly revenues over $1B, the company also began an all-too familiar Big Data journey. Hulu will discuss their modern approach to deliver business intelligence where users can do targeted analysis on a large Hadoop Data Lake while supporting high concurrency in a self-service manner across the globe.
Big data and data science in the cloud, Strata Business Summit, Visualization and user experience
Mike Driscoll (Metamarkets)
There’s a make-or-break step ahead for AI development – we need to focus on translating data from machine learning models into beautiful, intuitive visuals. AI tools shouldn’t be designed to replace humans, they should be built with their eyes in mind. This session will offer advice for creators of next-gen predictive algorithms from our experiences turning big data into interactive visualizations
Data-driven business management, Strata Business Summit
Paco Nathan (O'Reilly Media)
Human in the loop (HITL) has emerged as a key design pattern for managing teams where people and machines collaborate. Such systems are mostly automated, with exceptions referred to human experts, who help train the machines further. Paco Nathan offers an overview of HITL from the perspective of a business manager, focusing on use cases within O'Reilly Media.
Data science and machine learning, Law, ethics, and governance
Pramit Choudhary (DataScience.com)
Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation.
Data science and machine learning, Data-driven business management
Veronica Mapes (Pinterest), Garner Chung (Pinterest)
Veronica Mapes and Garner Chung detail the human evaluation platform Pinterest developed to better serve its deep learning and operational teams when its needs grew beyond platforms like Mechanical Turk. Along the way, they cover tricks for increasing data reliability and judgement reproducibility and explain how Pinterest integrated end-user-sourced judgements into its in-house platform.
Data-driven business management, Strata Business Summit
Kapil Surlaker (LinkedIn), Ya Xu (LinkedIn)
Metrics measurement and experimentation plays a crucial role in every product decision at LinkedIn. Kapil Surlaker and Ya Xu explain why, to meet the company's needs, LinkedIn built the UMP and XLNT platforms for metrics computation and experimentation, respectively, which have allowed the company to perform measurement and experimentation very efficiently at scale while preserving trust in data.
Big data and data science in the cloud, Data science and machine learning
Sergey Ermolin (Intel Santa Clara), Suqiang Song (MasterCard Corp)
We are going to demonstrate the use of RNNs on BIgDL to predict a user’s probability of shopping at a particular offer merchant during a “campaign period”. We will compare and contrast the RNN-based method with traditional ones, such as logistics regression and random forests
Data science and machine learning
Michael Lee Williams (Fast Forward Labs)
Interpretable models result in more accurate, safer and more profitable machine learning products. But interpretability can be hard to ensure. In this talk, we'll look closely at the growing business case for interpretability, concrete applications including churn, finance and healthcare, and demonstrate the use of LIME, an open source, model-agnostic tool you can apply to your models today.
Data engineering and architecture, Streaming systems and real-time applications
Dean Wampler (Lightbend)
This talk uses two "microservice" streaming applications based on Kafka, to compare and contrast using Akka Streams and Kafka Streams for data processing. I'll discuss the strengths and weaknesses of each tool for particular design needs, so you'll feel better informed when making choices. I'll also contrast them with Spark Streaming and Flink, including when to chose them instead.
Keynotes
Nancy Lublin (Crisis Text Line)
Keynote with Nancy Lublin
Keynotes, Strata Business Summit
Seth Stephens-Davidowitz (Everybody Lies | NY Times)
Keynote with Seth Stephens-Davidowitz
Big data and data science in the cloud, Data science and machine learning
Mo Patel (Independent), Neejole Patel (Virginia Tech)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Since its arrival in early 2017, PyTorch has won over many deep learning researchers and developers due to its dynamic computation framework. Mo Patel and Neejole Patel walk you through using PyTorch to build a content recommendation model.
Data engineering and architecture, Streaming systems and real-time applications
Emre Velipasaoglu (Lightbend, Inc.)
Most machine learning algorithms are designed to work on stationary data. Yet, real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Here, we review the monitoring methods and evaluate them for applicability in modern fast data and streaming applications.
Data science and machine learning
Joseph Richards (Wise.io @ GE Digital)
Deploying ML software applications for use cases in the Industrial Internet presents a unique set of challenges. Data-driven problems at GE require approaches that are highly accurate, robust, fast, scalable and fault tolerant. I'll discuss our approach to building production-grade ML applications and will talk about our work across GE in industries such as Power, Aviation and Oil & Gas.
Big data and data science in the cloud, Data science and machine learning
Alexandra Gunderson (Arundo Analytics)
Heavy industries, such as oil & gas, have tremendous amounts of data from which predictive models could be built, but it takes weeks and even months to build a comprehensive dataset from all of the various data sources. We will discuss the methodology behind an industry-tested approach that incorporates machine learning to structure and link data from different sources.
Big data and data science in the cloud, Data engineering and architecture, Media, entertainment, and advertising, Streaming systems and real-time applications
Manu Mukerji (Criteo)
Most Machine Learning talks are about the actual algorithm, this talk is about how you take that and scale it and make it production grade. - How the training set and test set is generated and annotated - How the model is pushed to production and evaluated (automatically) and finally used in production. - How the model works for other countries and languages
Delip Rao (Joostware), Brian McMahan (Joostware)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.
Data science and machine learning
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
The instructors demonstrate TensorFlow's capabilities through its Python interface and explore TFLearn, a high-level deep learning library built on TensorFlow. Join in to learn how to use TFLearn and TensorFlow to build machine learning models on real-world data.
Big data and data science in the cloud, Data science and machine learning
Ram Sriharsha (Databricks)
How do you scale geospatial analytics on big data? And while we are at it, can you make it easy to use while achieving state of the art performance on a single node ?Join us to learn about the internals of Magellan and how it provides scalability and performance without sacrificing simplicity.
Data-driven business management, Strata Business Summit
Matthew Granade (Domino Data Lab)
Predictive analytics and artificial intelligence have become critical competitive capabilities. Yet IT teams struggle to provide the support data science teams needs to succeed. Matthew Granade explains how leading banks, insurance and pharmaceutical companies, and others manage data science at scale.
Data-driven business management, Strata Business Summit
Nick Elprin (Domino Data Lab)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise's KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.
Big data and data science in the cloud, Data-driven business management
Self service operational data lake to improve operational efficiency, boosting productivity through fully identifiable data, reducing risk of a data swamp. These were the objectives which drove BP to create a strategic and methodical approach to a Data Lake architecture. Through this approach, BP provides a template of turning insights, hidden risks & unseen opportunities into actionable solutions
Data engineering and architecture, Streaming systems and real-time applications
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (Streamlio)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Across diverse segments in industry, there has been a shift in focus from Big Data to Fast Data. This, in part, stems from the deluge of high velocity data streams and, more importantly, the need for instant data-driven insights. In this tutorial, we walk the audience through the state-of-the-art streaming systems and provide an in-depth review of modern streaming algorithms.
Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Henry Cai (Pinterest), Yi Yin (Pinterest)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin explain offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.
Data science and machine learning
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Natural language processing is a key component in many data science systems that must understand or reason about text. This is a hands-on tutorial for scalable NLP using spaCy for building annotation pipelines, Spark NLP for building distributed natural language machine-learned pipelines, and Spark ML and TensorFlow for using deep learning to build and apply word embeddings.
Data engineering and architecture
Gian Merlino (Imply)
In this talk we'll discuss the SQL layer recently added to the open-source Druid project. It's based on Apache Calcite, which bills itself as "the foundation for your next high-performance database". We'll discuss how Druid and Calcite are integrated, and how you too can learn to stop worrying and love relational algebra in your own projects.
Data engineering and architecture
Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)
Deep learning has shown superior performance in domains such as object recognition and image classification, where time-series data plays an important role. Predictive Maintenance is also a domain where data is collected over time to monitor the state of an asset to predict failures. In this talk we show how to operationalize LSTM networks that predict remaining useful life of aircraft engines.
Data engineering and architecture, Visualization and user experience
Rahim Daya (Pinterest)
Personalization is a powerful tool for building sticky and impactful product experiences. Rahim Daya shares Pinterest's frameworks for building personalized user experiences, from sourcing the right contextual data to designing and evaluating personalization algorithms that can delight the user.
Big data and data science in the cloud, Data engineering and architecture
Abe Gong (Superconductive Health), James Campbell
Data science and engineering have been missing out on one of the biggest productivity boosters in modern software development: automated testing. Abe Gong and James Campbell discuss the concept of pipeline tests and offer an overview of Great Expectations, an open source Python framework for bringing data pipelines and products under test.
Big data and data science in the cloud, Data engineering and architecture
Carlo Torniai (Tesla Motors)
In this talk we explore the architectural challenges we faced in building Pirelli Connesso: an IoT cloud-based system providing information on the tyre operating conditions, consumption and maintenance. We will highlight the operative approaches that enabled the integration of different contribution across cross-functional teams.
Data engineering and architecture, Streaming systems and real-time applications
Holden Karau (Google), Rachel Warren (Independent)
This talk will explore the state of the current big data ecosystem, and how to best work with it in non-JVM languages. Since the presenter works extensively on PySpark much of the focus will be on Python + Spark, but will also include interesting* anecdotes about how this applies to other systems (including Kafka).
Big data and data science in the cloud, Data engineering and architecture
Bin Fan (Alluxio), Shaoshan Liu (PerceptIn)
Bin Fan and Shaoshan Liu explain how PerceptIn designed and implemented a cloud architecture to support video streaming and online object recognition tasks and demonstrate how Alluxio delivers high throughput, low latency, and a unified namespace to support these emerging cloud architectures.
Big data and data science in the cloud, Data engineering and architecture
Ritesh Agrawal (Uber), Anirban Deb (Uber)
"Presto" has emerged as the defacto query engine to quickly process petabytes of data. Few rogue SQL queries can however waste significant amount of critical compute resource and reduce Presto's through put. At Uber, we use machine learning to identify such rogue queries and stop them early. This has led to significant amount of savings at both in terms of computational power and money.
Big data and data science in the cloud, Data science and machine learning
Joseph Bradley (Databricks)
We discuss common paths to productionizing Apache Spark MLlib models: engineering challenges and corresponding best practices. We cover several deployment scenarios including batch scoring, Structured Streaming, and real-time low-latency serving.
Law, ethics, and governance, Strata Business Summit
Anne Buff (SAS Institute)
Emerging technologies such as IoT, AI, and ML present businesses with enormous opportunities for innovation. But, to maximize the potential of these technologies, the approach to governance must radically shift. This session will look at what it takes to shift the focus of governance from standards, conformity and control to accountability, extensibility and enablement.
Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Eugene Kirpichov (Google)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. One key aspect is Beam's DoFn operation for data integration, which makes it suitable for fully unified batch/streaming data ingestion and enabling new, highly modular data ingestion design patterns.
Big data and data science in the cloud, Data engineering and architecture, Data-driven business management, Platform security and cybersecurity, Streaming systems and real-time applications
Yu Xu (TigerGraph)
Graph database is the fastest growing category in all of data management. However, most graph queries only traverse two hops in big graphs due to limitations in most graph databases. Real world applications require deep link analytics which traverses far more than three hops. We present a real world fraud detection system managing 100 billion graph elements to detect risk and fraudulent groups.
Big data and data science in the cloud
Jesse Anderson (Big Data Institute)
2-Day Training Please note: Please note: to attend, you must be registered for a Platinum or Training pass.
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks (both open source and managed cloud services), discusses the leading cloud providers, and explains how to choose the right one for your company.
Big data and data science in the cloud, Data engineering and architecture
Tomer Shiran (Dremio), Jacques Nadeau (Dremio)
Most organization manage 5-15 copies of their data in multiple systems and formats to support different analytical use cases, including BI and machine learning. In this talk we introduce a new approach called Data Reflections, which dramatically reduces the need for data copies. We demonstrate an open source implementation built with Apache Calcite and explore two production case studies.
Strata Business Summit
Ayin Vala (Foundation for Precision Medicine)
Complex diseases like Alzheimer’s cannot be cured by pharmaceutical or genetic sciences alone, and current treatments and therapies lead to mixed successes. Ayin Vala explains how to use the power of big data and AI to treat challenging diseases with personalized medicine, which takes into account individual variability in medicine intake, lifestyle, and genetic factors for each patient.
Data engineering and architecture
Jiangjie Qin (LinkedIn)
LinkedIn runs more than 1800+ Kafka brokers that deliver more than 2 trillion messages/day. Running Kafka at such a scale makes automated operations a necessity. We will share the lessons we learned from operating Kafka at scale with minimum human intervention.
Big data and data science in the cloud, Data engineering and architecture, Visualization and user experience
Sean Kandel (Trifacta)
Organizations leverage reporting, analytic, and machine learning pipelines to drive decision making and power critical operational systems. In this talk we present methods for detecting, visualizing and resolving inconsistencies between source and target data models across these pipeliens.
Data science and machine learning, Media, entertainment, and advertising
April Chen (Civis Analytics), John Davis (Civis Analytics, Inc.)
Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. This talk covers the shortcomings of these models and proposes a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.
Data science and machine learning
Adam Greenhall explains how Lyft uses simulation to test out new algorithms, help develop new features, and study the economics of ride-sharing markets as they grow.
Data-driven business management, Strata Business Summit, Streaming systems and real-time applications, Visualization and user experience
Mike Prorock (mesur.io)
mesur.io is transforming the agricultural and turf management market with a combination of IoT sensor technology, an advanced analytic platform, and self-service visualization. Growers are able to monitor areas of concern, from water conservation to soil conditions and beyond. This is a climate awareness solution for managing the modern farm, plantation, or golf course.
Data science and machine learning
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Join in for an introduction to Apache Spark 2.0 core concepts with a focus on Spark's machine learning library, using text mining on real-world data as the primary end-to-end use case.
Data engineering and architecture
Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)
Spark applications need to be well tuned so that individual applications run quickly and reliably, and cluster resources are efficiently utilized. In this talk, we will present a fast, reliable and automated process for tuning Spark applications, enabling users to quickly identify and fix problems.
Data science and machine learning
Vincent Xie (Intel), Peng Meng (Intel)
Intel has been deeply involved in Spark from its earliest moments. Vincent Xie and Peng Meng share what Intel has been working on Spark ML and introduce the methodology behind Intel's work on SparkML optimization.
Big data and data science in the cloud, Data engineering and architecture, Visualization and user experience
Zhen Fan (JD.com), Wei Ting Chen (Intel)
This topic would like to use JD.com as an example to tell you about how they are using Spark on Kubernetes in a production environment and why they choose Spark on Kubernetes for their AI workloads. You will learn how to run Spark with Kubernetes and the advantages you can get from Spark on Kubernetes.
Data science and machine learning
Ian Cook (Cloudera)
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.
Data engineering and architecture, Streaming systems and real-time applications
Tim Berglund (Confluent)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.
Data engineering and architecture, Streaming systems and real-time applications
Sijie Guo (Streamlio)
Apache BookKeeper is a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads. It is widely adopted, including by enterprises like Twitter, Yahoo, and Salesforce, to store and serve mission-critical data. We will present how Apache BookKeeper satisfies the needs of stream storage.
Data engineering and architecture, Streaming systems and real-time applications
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
This hands-on tutorial builds several streaming applications as "microservices" based on Kafka with Akka Streams and Kafka Streams for data processing. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll feel better informed when choosing tools for your needs. We'll also contrast them with Spark Streaming and Flink, including when to chose them instead.
Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
William Chambers (Databricks), Michael Armbrust (Databricks)
This talk will cover two core topics. Firstly, we'll cover the motivation and basics of the Structured Streaming processing engine in Apache Spark. Secondly, we'll cover the core lessons that we've learned running hundreds of Structured Streaming workloads in the cloud.
Data engineering and architecture, Streaming systems and real-time applications
Fabian Hueske (data Artisans), Shuyi Chen (Uber)
In this talk we discuss SQL in the world of streaming data, and its implementation in Apache Flink. We will cover the concepts (streaming semantics, event time, incremental results) and discuss practical experiences of using Flink SQL in production at Uber. We will cover how Uber leverages Flink SQL to solve its unique business challenges.
Data engineering and architecture
Gwen Shapira (Confluent)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering and details how Apache Kafka, microservices and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable, and built to evolve.
Data science and machine learning, Data-driven business management
Clare Gollnick (Terbium Labs)
At the heart of the reproducibility crisis in the sciences is the widespread misapplication of statistics. Data science relies on the same statistical methodology as these scientific fields. Can we avoid the same crisis of integrity? Clare Gollnick considers the philosophy of data science and shares a framework that explains (and even predicts) the likelihood of success of a data project.
Data-driven business management, Strata Business Summit
Angela Zutavern (Booz Allen Hamilton)
How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. In this talk, Angela Zutavern will share insights from her work with pioneering companies, government agencies and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.”
Law, ethics, and governance, Strata Business Summit
John Mertic (The Linux Foundation), Ferd Scheepers (ING)
This joint presentation, John Mertic – Director of ODPi – and Ferd Scheepers – Global Chief Information Architect of ING – will address the benefits of a vendor-neutral approach to data governance, the need for an open metadata standard, along with insight around how companies ING, IBM, Hortonworks and more are delivering solutions to this challenge as an open source initiative.
Data science and machine learning, Law, ethics, and governance, Streaming systems and real-time applications
Jennifer Prendki (Atlassian)
This talk reviews the options to develop machine learning models even if the data is protected by privacy and compliance laws and cannot be used without anonymizing. Dr. Prendki will discuss how techniques ranging from contextual bandits to document vector representation offer data scientists the opportunity to build models even when the data can't be used in its whole integrity.
Data science and machine learning
Rajat Monga (Google)
Rajat Monga offers an overview of TensorFlow progress and adoption in 2016 before looking ahead to the areas of importance in the future—performance, usability, and ubiquity—and the efforts TensorFlow is making in those areas.
Keynotes
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
Data engineering and architecture
Michael Freedman (TimescaleDB | Princeton)
I offer an overview of TimescaleDB, a new open-source database designed for time series workloads, engineered up as a plugin to PostgreSQL. Unlike most time-series newcomers, TimescaleDB supports full SQL while achieving fast ingest and complex queries. This enables developers to avoid today’s polyglot architectures and their corresponding operational and application complexity.
Data science and machine learning
Rachita Chandra (IBM Watson Health)
In this proposal, we describe the challenges and considerations for transforming a research prototype built for a single machine to a deployable healthcare solution that leverages Spark in a distributed environment.
Data-driven business management, Strata Business Summit
Brian Karfunkel (Pinterest)
When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. This talk will show how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at non-core users.
Data science and machine learning
Vartika Singh (Cloudera), Jeffrey Shmain (Cloudera)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
We go through approaches for preprocessing, training, inference and deployment across data sets (time-series, audio, video and text), leveraging Spark, extended ecosystem of libraries and Deep Learning Frameworks. We use respective (sample) data and code to understand implementation nuances, and subsequently highlight the bottlenecks and solutions for data/model at scale.
Big data and data science in the cloud, Data-driven business management, Strata Business Summit
Big data becomes much more powerful when it has context. Fortunately, creative data scientists can create needed context though the use of metadata. This talk describes how metadata is created and used to: gain competitive advantages, predict troop strength, or even guess Social Security numbers.
Data engineering and architecture, Streaming systems and real-time applications
Stephan Ewen (data Artisans), Flavio Junqueira (DellEMC)
We present an open source streaming data stack consisting of Pravega (stream storage) and Apache Flink (computation on streams). The combination of these two systems offers an unprecedented way of handling “everything as a stream”, with un-bounded streaming storage, unified Batch- and Streaming abstraction, and dynamically accommodating workload variations in a novel way.
Data science and machine learning
Karthik Ramasamy (Uber), Lenny Evans (Uber)
Stolen credit cards are a major problem faced by many companies, including Uber. Karthik Ramasamy and Lenny Evans detail a new weapon against stolen credit cards that uses computer vision to scan credit cards, verifying possession of the physical card with basic fake card detection capabilities.
Data engineering and architecture, Streaming systems and real-time applications
Shivnath Babu (Duke University | Unravel Data Systems), Sumit Jindal (Unravel Data Systems)
Getting the best performance, predictability, and reliability for Kafka-based applications is an art today. We aim to change that by leveraging recent advances in machine learning and AI. This talk will describe our methodology of applying statistical learning to the rich and diverse monitoring data that is available from Kafka.
Media, entertainment, and advertising, Visualization and user experience
Ann Nguyen (Whole Whale)
Power Poetry is the largest online platform for young poets with over 350k users. In 2017, we started building the Poetry Genome, as series of ML tools that analyze and breakdown similarity scores of the poems added to the site. The most recent is a Rap Poetry similarity that matches young poet's work to rap artists and then shows them the education value of that connection.
Data science and machine learning
Mario Inchiosa (Microsoft), Vanja Paunic (Microsoft), Robert Horton (Microsoft), Debraj GuhaThakurta (Microsoft Corporation), Ali Zaidi (Microsoft), Tomas Singliar (Microsoft), John-Mark Agosta (Microsoft)
Tutorial Please note: to attend, you must be registered for a Gold or Silver pass.
Join us to learn how to do scalable, end-to-end data science in R and Python on single machines as well as on Spark clusters and cloud-based infrastructure. You'll be assigned an individual virtual machine with all contents preloaded and software installed and use it to gain experience building and operationalizing machine learning models using distributed functions in both R and Python.
Big data and data science in the cloud, Data engineering and architecture, Streaming systems and real-time applications
Siddharth Teotia (Dremio)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.
Data science and machine learning, Streaming systems and real-time applications
Diane Chang (Intuit)
When building a chatbot, it’s important to develop one that is humanized, has contextual responses, and can simulate true empathy for the end users. In this session, Diane Chang, Principal Data Scientist at Intuit, shares how her team preps, cleans, organizes and augments her training data through multiple practices, and some of the best practices she has learned along the way.
Keynotes
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
Data science and machine learning, Media, entertainment, and advertising
Noah Gift (UC Davis)
Explore NBA Team valuation and attendance using data science and machine learning, as well as individual player performance. Questions that will be discussed: What drives the valuation of teams: attendance, local real estate market? Does winning bring more fans to games? Does salary correlate with social media performance?
Data engineering and architecture
Daniel Templeton (Cloudera), Andrew Wang (Cloudera)
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Big data and data science in the cloud, Data science and machine learning, Visualization and user experience
Baron Schwartz (VividCortex)
Anomaly detection is super hot in my industry (~monitoring). I've built anomaly detection, I've seen customers not really understand or care about it, and I've seen others repeat the same pattern many times. Why? And what can we do about it? This is my story of arriving at a "post-anomaly-detection" point of view.
Data science and machine learning
Patrick Harrison (S&P Global)
Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. In this talk, we’ll open up the black box of a popular word embedding algorithm and embark on an end-to-end walk-through of how it works its magic. Along the way, we’ll dig into many core neural network concepts, including hidden layers, loss gradients, backpropagation, and more.
Data-driven business management
Thomas Miller (Northwestern University)
Sports analytics today is more than a matter of analyzing box scores and play-by-play statistics. Faced with detailed on-field or on-court data from every game, teams seek competitive advantage through data science and data engineering. We provide a review of the data challenges that teams face and information technologies useful in addressing those challenges.
Data science and machine learning
Andrew Ray (Sam’s Club Technology)
This talk will give a brief introduction to the distributed graph algorithm abstractions provided by Pregel, PowerGraph, and GraphX. We will discuss the implementations of three key examples in each abstraction and provide historical context for the evolution between these three abstractions.
Data-driven business management
Tracy Malingo (Next IT)
AI is transformative for business, but it’s not magic; it’s data. Tracy Malingo, President of Next IT, presents recent work with global enterprise customers on how they have helped transform their businesses with AI solutions. She’ll outline how companies should build AI strategies, utilize data to develop and evolve conversational intelligence and business intents, and ultimately increase ROI.