Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY


If you’re looking to find like minds and make new professional connections, come to the women's networking lunch on Wednesday.
Sophie Watson (Red Hat)
Recommender systems enhance user experience and business revenue every day. Sophie Watson demonstrates how to develop a robust recommendation engine using a microservice architecture.
Maryam Jahanshahi (TapRecruit)
Hiring teams have long relied on intuition and experience to scout talent. Increased data and data-science techniques give us a chance to test common recruiting wisdom. Drawing on results from her recent behavioral experiments and analyses of over 10 million jobs and their outcomes, Maryam Jahanshahi illustrates how often innocuous recruiting decisions have dramatic impacts on hiring outcomes.
Jeroen Janssens (Data Science Workshops B.V.)
"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems.
Jason Wang (Cloudera), Suraj Acharya (Cloudera), Tony Wu (Cloudera)
The largest infrastructure paradigm change of the 21st century is the shift to the cloud. Companies now face the difficult decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. Jason Wang, Suraj Acharya, and Tony Wu compare the relative strengths and weaknesses of AWS and Azure.
Minh Chau Nguyen (ETRI), Heesun Won (ETRI)
Minh Chau Nguyen and Heesun Won explain how to implement analytics services in data marketplace systems on a single Hadoop cluster across distributed data centers. The solution extends the overall architecture of the Hadoop ecosystem with the blockchain so that multiple tenants and authorized third parties can securely access data while still maintaining privacy, scalability, and reliability.
Francesca Lazzeri (Microsoft), Jaya Mathew (Microsoft)
With the growing buzz around data science, many professionals want to learn how to become a data scientist—the role Harvard Business Review called the "sexiest job of the 21st century." Francesca Lazzeri and Jaya Mathew explain what it takes to become a data scientist and how artificial intelligence solutions have started to reinvent businesses.
Jun Rao (Confluent)
The controller is the brain of Apache Kafka and is responsible for maintaining the consistency of the replicas. Jun Rao outlines the main data flow in the controller, then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster.
Alexander Heye (Cray), Ding Ding (Intel)
Precipitation nowcasting is used to predict the future rainfall intensity over a relatively short timeframe. The forecasting resolution and time accuracy required are much higher than for other traditional forecasting tasks. Alexander Heye and Ding Ding explain how to build a precipitation nowcasting system with recurrent neural networks using BigDL on Apache Spark.
Anand Raman (Microsoft), Wee Hyong Tok (Microsoft)
Anand Raman and Wee Hyong Tok walk you through applying AI technologies in the cloud. You'll learn how to add prebuilt AI capabilities like object detection, face understanding, translation, and speech to applications, build cognitive search applications that understand deep content in images, text, and other data, use the Azure platform to accelerate machine learning, and more.
Moty Fania (Intel), Sergei Kom (Intel)
Moty Fania and Sergei Kom share their experience and lessons learned implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming, and online actuation.
Bethann Noble (Cloudera), Daniel Huss (State Street), Abhishek Kodi (State Street)
Bethann Noble, Abhishek Kodi, and Daniel Huss share their experience and best practices for designing and executing on a roadmap for open data science and AI for business.
Randy Lea (Arcadia Data)
The use of data lakes continue to grow, and the right business intelligence (BI) and analytics tools on data lakes are critical to data lake success. Randy Lea explains why existing BI tools work well for data warehouses but not data lakes and why every organization should have two BI standards: one for data warehouses and one for data lakes.
Randy Lea (Arcadia Data)
The use of data lakes continue to grow, and the right business intelligence (BI) and analytics tools on data lakes are critical to data lake success. Randy Lea explains why existing BI tools work well for data warehouses but not data lakes and why every organization should have two BI standards: one for data warehouses and one for data lakes.
Milene Darnis (Uber)
Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze.
Renee Yao (NVIDIA)
Renee Yao explains how generative adversarial networks (GAN) are successfully used to improve data generation and explores specific real-world examples where customers have deployed GANs to solve challenges in healthcare, space, transportation, and retail industries.
As the data authority for hybrid cloud for big data analytics and AI, NetApp understands the value of the access, management, and control of data. Karthikeyan Nagalingam discusses the NetApp Data Fabric, which provides a unified data management environment that spans edge devices, data centers, and multiple hyperscale clouds using ONTAP software, all-flash systems, ONTAP Select, and cloud volumes.
Joshua Patterson (NVIDIA), Onur Yilmaz (NVIDIA)
GPUs have allowed financial firms to accelerate their computationally demanding workloads. Today, the bottleneck has moved completely to ETL. The GPU Open Analytics Initiative (GoAi) is helping accelerate ETL while keeping the entire workflow on GPUs. Joshua Patterson and Onur Yilmaz discuss several GPU-accelerated data science tools and libraries.
Ankit Jain (Uber)
Personalization is a common theme in social networks and ecommerce businesses. Personalization at Uber involves an understanding of how each driver and rider is expected to behave on the platform. Ankit Jain explains how Uber employs deep learning using LSTMs and its huge database to understand and predict the behavior of each and every user on the platform.
Occhio Orsini (Aetna)
Occhio Orsini offers an overview of Aetna's Data Fabric platform. Join in to learn the needs and desires that led to the creation of the advanced analytics platform, explore the platform's architecture, technology, and capabilities, and understand the key technologies and capabilities that made it possible to build a hybrid solution across on-premises and cloud-hosted data centers.
Jennifer Prendki (Figure Eight)
Agile methodologies have been widely successful for software engineering teams but seem inappropriate for data science teams, because data science is part engineering, part research. Jennifer Prendki demonstrates how, with a minimum amount of tweaking, data science managers can adapt Agile techniques and establish best practices to make their teams more efficient.
DD Dasgupta (Cisco)
DD Dasgupta explores the exciting development of the edge-cloud continuum, which is redefining business models and technology strategies while creating a vast array of new applications that will power the digital age. The continuum is also destroying what we know about the centralized data centers and cloud computing infrastructures that were so vital to the success of the previous computing eras.
Harry Glaser (Periscope Data)
What is the moral responsibility of a data team today? As AI and machine learning technologies become part of our everyday life and as data becomes accessible to everyone, CDOs and data teams are taking on a very important moral role as the conscience of the corporation. Harry Glaser highlights the risks companies will face if they don't empower data teams to lead the way for ethical data use.
Bill Franks (International Institute For Analytics)
Drawing on a recent study of the analytics maturity level of large enterprises by the International Institute for Analytics, Bill Franks discusses how maturity varies by industry, shares key steps organizations can take to move up the maturity scale, and explains how the research correlates analytics maturity with a wide range of success metrics, including financial and reputational measures.
Masha Westerlund (Investopedia)
Businesses rely on user data to power their sites, products, and sales. Can we give back by sharing those insights with users? Masha Westerlund explains how Investopedia harnessed reader data to build an index that tracks market anxiety and moves with the VIX, a proprietary measure of market volatility. You'll see how thinking outside the box helps turn data into tools for users, not stakeholders.
Jay Kreps (Confluent)
Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. Jay Kreps explores some of the difficulties of building production machine learning systems and explains how Apache Kafka and stream processing can help.
Carolyn Duby (Hortonworks)
Carolyn Duby shows you how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable open source platform. After this interactive overview of the platform's major features, you'll be ready to analyze your own haystack back at the office.
Ken Jones (Databricks, Inc.)
Ken Jones walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.
Andrew Montalenti ( )
What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content.
Brian Wu (AppNexus)
Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. Brian Wu describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization, and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus.
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Dan Harple (Context Labs)
Dan Harple explains how distributed systems are being influenced by and are influencing operational, financial, and social impact requirements of a wide range of enterprises and how trust in these distributed systems is being challenged, elevated, and resolved by engineers and architects today.
Jennifer Shin (8 Path Solutions | NYU Stern | IBM)
Common wisdom dictates that we should never make assumptions, but assumptions are essential in the creation of statistical models. Jennifer Shin explores how assumptions fit into the creation of a statistical model, the pitfalls of applying a model to data without taking the underlying assumptions into account, and how to identify datasets where the model and its assumptions are applicable.
Jennifer Shin (8 Path Solutions | NYU Stern | IBM)
Common wisdom dictates that we should never make assumptions, but assumptions are essential in the creation of statistical models. Jennifer Shin explores how assumptions fit into the creation of a statistical model, the pitfalls of applying a model to data without taking the underlying assumptions into account, and how to identify datasets where the model and its assumptions are applicable.
Arun Murugan (GE Digital), Jeff Miller (GE)
Arun Murugan and Jeff Miller detail how complex relationships are discovered and modeled to simplify analytics while keeping an Agile architecture for data acquisition. You’ll see how GE uses machine learning (powered by Io-Tahoe) in data discovery and profiling for data engineering of the development of a standard data model essential to enterprise use cases.
Bob Levy (Virtual Cove, Inc.)
Augmented reality opens a completely new lens on your data through which you see and accomplish amazing things. Bob Levy explains how to use simple Python scripts to leverage completely new plot types. You'll explore use cases revealing new insight into financial markets data as well as new ways of interacting with data that build trust in otherwise “black box” machine learning solutions.
Mike Tung (Diffbot)
Mike Tung offers an overview of available open source and commercial knowledge graphs and explains how consumer and business applications are already taking advantage of them to provide intelligent experiences and enhanced business efficiency. Mike then discusses what's coming in the future.
James Psota (Panjiva )
James Psota explains how organizationsåBusinesses are pouring massive amounts of money into data science projects, and expectations are sky-high. But how many of those projects will deliver real value to customers? The history of other hyped new technologies predicts that many will fail, leaving a sense of disillusionment in their wake.
LaVonne Reimer (Lumenous)
GDPR asks us to rethink personal data systems—viewing UI/UX, consent management, and value-add data services through the eyes of subjects of the data. LaVonne Reimer explains why the opportunity in the $150B credit and risk industry is to deploy data governance technologies that balance the interests of individuals to control their own data with requirements for trusted data.
Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture.
Faria Bruno (Amazon Web Services)
Bruno Faria explains how to identify the components and workflows in your current environment and shares best practices to migrate these workloads to AWS.
Faria Bruno (Amazon Web Services)
Bruno Faria explains how to identify the components and workflows in your current environment and shares best practices to migrate these workloads to AWS.
Andrew Burt (Immuta)
Machine learning is becoming prevalent across industries, creating new types of risk. Managing this risk is quickly becoming the central challenge of major organizations, one that strains data science teams, legal personnel, and the C-suite alike. Andrew Burt shares lessons from past regulations focused on similar technology along with a proposal for new ways to manage risk in ML.
Ted Malaska (Capital One), Mark Grover (Lyft)
Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed.
Atul Kale (Airbnb), Xiaohan Zeng (Airbnb)
Atul Kale and Xiaohan Zeng offer an overview of Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Built on Python, Spark, and Kubernetes, Bighead integrates popular libraries like TensorFlow, XGBoost, and PyTorch and is designed be used in modular pieces.
Jacob Ward (CNN | Al Jazeera | PBS)
For most of us, our own mind is a black box—an all-powerful and utterly mysterious device that runs our lives for us, using rules and shortcuts of which we aren’t even aware. Jacob Ward reveals the relationship between the unconscious habits of our minds and the way that AI is poised to amplify them, alter them, maybe even reprogram them.
Daniel Kang (Stanford University)
Daniel Kang offers an overview of exploratory video analytics engine BlazeIt, which offers FrameQL, a declarative SQL-like language for querying video, and a query optimizer for executing these queries. You'll see how FrameQL can capture a large set of real-world queries ranging from aggregation and scrubbing and how BlazeIt can execute certain queries up to 2,000x faster than a naive approach.
Make your way from booth to booth while you check out all the exhibitors in the Expo Hall on Wednesday after sessions end.
Amanda Pustilnik (University of Maryland School of Law | Center for Law, Brain & Behavior, Mass. General Hospital)
Have you ever dreamed you could read minds? Do telekinesis? Maybe fly a magic carpet by thought alone? Until now, these powers have existed only in the realm of imagination or, more recently, video, AR, and VR games. Join Amanda Pustilnik to learn how brain-based human-machine interfaces are beginning to offer these powers in near-commercially-viable forms.
Olga Cuznetova (Optum), Manna Chang (Optum)
Olga Cuznetova and Manna Chang demonstrate supervised and unsupervised learning methods to work with claims data and explain how the methods complement each other. The supervised method looks at CKD patients at risk of developing end-stage renal disease (ESRD), while the unsupervised approach looks at the classification of patients that tend to develop this disease faster than others.
Intelligent enterprises—fueled by rapid advances in artificial intelligence (AI), machine learning (ML), and the internet of things (IoT)—promise significant business value. Richard Mooney explains how to achieve the game-changing outcomes of an intelligent enterprise, delivering value across business functions with the synergy of machine and human intelligence.
Chris Fregly (PipelineAI)
Chris Fregly details a full-featured, open source end-to-end TensorFlow model training and deployment system, using the latest advancements with Kubernetes, TensorFlow, and GPUs.
David Arpin (Amazon Web Services)
David Arpin walks you through building a machine learning application, from data manipulation to algorithm training to deployment to a real-time prediction endpoint, using Spark and Amazon SageMaker.
Karthik Ramasamy (Streamlio), Andrew Jorgensen (Google)
Streaming systems like Apache Heron are being used for an increasingly broad array of applications. Karthik Ramasamy and Andrew Jorgensen offer an overview of Fabric Answers, which provides real-time insights to mobile developers to improve their product experience at Google Fabric using Apache Heron.
Joshua Laurito (Squarespace)
Joshua Laurito explores systems Squarespace built for acquiring and enforcing consistency on obtained data and for inferring conclusions from a company’s marketing and product initiatives. Joshua discusses the intricacies of gathering and evaluating marketing and user data, from raising awareness to driving purchases, and shares results of previous analyses.
Bob Bradley (Geotab), Chad W. Jennings (Google)
If your company isn’t good at analytics, it’s not ready for AI. Bob Bradley and Chad W. Jennings explain how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value. You'll then see an in-depth demonstration of Google technology from smart cities innovator Geotab.
Nir Yungster (JW Player), Kamil Sindi (JW Player)
JW Player—the world’s largest network-independent video platform, representing 5% of global internet video—provides on-demand recommendations as a service to thousands of media publishers. Nir Yungster and Kamil Sindi explain how the company is systematically improving model performance while navigating the many engineering challenges and unique needs of the diverse publishers it serves.
Jorge A. Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Faria Bruno (Amazon Web Services)
Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services.
Robin Way (Corios)
Robin Way shares case study examples of next-best offer strategies, predictive customer journey analytics, and behavior-driven time-to-event targeting for mathematically optimal customer messaging that drives incremental margins.
Kaushik Deka (Novantas), Ted Gibson (Novantas)
Kaushik Deka and Ted Gibson share a large-scale optimization architecture in Spark for a consumer product portfolio optimization use case in retail banking. The architecture combines a simulator that distributes computation of complex real-world scenarios and a constraint optimizer that uses business rules as constraints to meet growth targets.
Jonathan Ellis (DataStax)
Is open source Apache Cassandra still relevant in an era of hosted cloud databases? Jonathan Ellis discusses Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.
Andreas Kohlmaier (Munich Re)
Munich Re is increasing client resilience against economic, political, and cyberrisks while setting and shaping trends in the insurance market. Recently, Munich Re successfully launched a data catalog as the driver for analyst adoption of a data lake. Andreas Kohlmaier explains how cataloging new data encouraged users to explore new ideas, developed new business, and enhanced customer service.
Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights.
Ash Munshi (Pepperdata)
Ash Munshi outlines a technique for labeling applications using runtime measurements of CPU, memory, and network I/O along with a deep neural network. This labeling groups the applications into buckets that have understandable characteristics, which can then be used to reason about the cluster and its performance.
Paul Curtis (MapR Technologies)
Once the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? Paul Curtis explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment.
Paul Kent (SAS)
Software is eating the world, and open source is eating the software. Most contemporary analytics shops use a lot of open source software in their analytics platform. So where does commercial software like SAS fit? Paul Kent explains how you can achieve the best of both worlds by combining your favorite open source software with the power of SAS analytics.
Mathew Lodge (Anaconda)
The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.
Mathew Lodge (Anaconda)
The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.
Roger Barga (Amazon Web Services), Sudipto Guha (Amazon Web Services), Kapil Chhabra (Amazon Web Services )
Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.
Arun Kejariwal (Independent), Francois Orsini (MZ)
The rate of growth of data volume and velocity has been accelerating along with increases in the variety of data sources. This poses a significant challenge to extracting actionable insights in a timely fashion. Arun Kejariwal and Francois Orsini explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.
Don't miss an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in New York.
Nuria Ruiz (Wikimedia)
The Wikipedia community feels strongly that you shouldn’t have to provide personal information to participate in the free knowledge movement. Nuria Ruiz discusses the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection, and details some creative workarounds that allow WMF to calculate metrics in a privacy-conscious way.
Michelle Ufford (Netflix)
Michelle Ufford shares some of the cool things Netflix is doing with data and the big bets the company is making on data infrastructure, covering workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, platform intelligence, and more.
Paco Nathan (
Program chair Alistair Croll welcomes you to the Data Case Studies tutorial.
Barbara Eckman (Comcast)
Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro.
Dan Adams (Pitney Bowes)
The role of data and the demand to get it right, coupled with competitive pressures to move faster, have dramatically increased. Companies now recognize data as an asset and need to manage it that way. Join Dan Adams for the insights you need to ensure that your data addresses current and future needs and that your organization is set up for success.
Andrew J Brust (ZDNet | Blue Badge Insights)
Data governance has grown from a set of mostly data management-oriented technologies in the data warehouse era to encompass catalogs, glossaries, and more in the data lake era. Now new requirements are emerging, and new products are rising to meet the challenge. Andrew Brust tracks data governance's past and present and offers a glimpse of the future.
Jim Scott (MapR Technologies)
Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of deep learning projects and solutions while walking you through a customer use case.
Sam Helmich (Deere & Company)
Sam Helmich explains how data science can benefit from borrowing Agile principles. These benefits are compounded by structuring the team roles in such a manner to enable success without relying on employing full stack expert “unicorns.”
Jeroen Janssens (Data Science Workshops B.V.)
The Unix command line remains an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful command-line tools, you can quickly scrub, explore, and model your data as well as hack together prototypes. Join Jeroen Janssens for a hands-on workshop based on his book Data Science at the Command Line.
Erin Coffman (Airbnb)
Airbnb has open-sourced many high-leverage data tools, including Airflow, Superset, and the Knowledge Repo, but adoption of these tools across the company was relatively low. Erin Coffman offers an overview of Data University, launched to make data more accessible and utilized in decision making at Airbnb.
Anna Nicanorova (Annalect)
Data visualization is supposed to be our map to information. However, contemporary charting techniques have a few shortcomings, including context reduction, hard numeric grasp, and perceptual dehumanization. Anna Nicanorova explains how augmented reality can solve these issues by presenting an intuitive and interactive environment for data exploration.
Mike Berger (Mount Sinai Health System)
Mount Sinai Health has moved up the analytics maturity chart to deliver business value in new risk models around population health. Mike Berger explains how Mount Sinai designed a team, built a data factory, and generates the analytics to drive decision-centricity and explores examples of mixing Tableau, SQL, Hive, APIs, Python, and R into a cohesive ecosystem supported by a data factory.
Garrett Hoffman (StockTwits)
Garrett Hoffman walks you through deep learning methods for natural language processing and natural language understanding tasks, using a live example in Python and TensorFlow with StockTwits data. Methods include word2vec, recurrent neural networks and variants (LSTM, GRU), and convolutional neural networks.
Swetha Machanavajhala (Microsoft), Xiaoyong Zhu (Microsoft)
In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure.
Wangda Tan (Hortonworks)
In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN.
Dr. Vijay Srinivas Agneeswaran (SapientRazorfish), Abhishek Kumar (SapientRazorfish)
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.
Ward Eldred (NVIDIA)
Ward Eldred offers an overview of the types of analytical problems that can be solved using deep learning and shares a set of heuristics that can be used to evaluate the feasibility of analytical AI projects.
Swatee Singh (American Express)
Artificial intelligence (AI) is now being adopted in the financial world at an unprecedented scale. Swatee Singh discusses the need to “democratize” AI in the company beyond the purview of "unicorn" data scientists and offers a framework to do this by stitching AI with the cloud and big data at its backend.
Lars Hulstaert (Microsoft)
Transfer learning allows data scientists to leverage insights from large labeled datasets. The general idea of transfer learning is to use knowledge learned from tasks for which a lot of labeled data is available in settings where little labeled data is available. Lars Hulstaert explains what transfer learning is and how it can boost your NLP or CV pipelines.
Diego Oppenheimer (Algorithmia)
After big investments in collecting and cleaning data and building machine learning (ML) models, enterprises face big challenges in deploying models to production and managing a growing portfolio of ML models. Diego Oppenheimer covers the strategic and technical hurdles each company must overcome and the best practices developed while deploying over 4,000 ML models for 70,000 engineers.
Data is the fuel for analytics and AI workloads, but the challenges in using it are constant. Ziya Ma discusses how recent innovations from Intel in high-capacity persistent memory and open source software are accelerating production-scale deployments, delivering breakthrough optimizations and faster insights to a wide range of opportunities in the digital enterprise.
Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)
Arun Kejariwal and Karthik Ramasamy lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, covering messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. They also share case studies from the IoT, gaming, and healthcare and their experience operating these systems at internet scale.
What does your emoji say about you? Using an off-the-shelf camera and NVIDIA DGX Station—the world’s fastest personal AI workstation—NVIDIA engineers capture and analyze dozens of faces and emotions at the same time. The system then displays people’s emotions using emojis.
Chiny Driscoll (MetiStream), Jawad Khan (Rush University Medical Center )
Chiny Driscoll and Jawad Khan offer an overview of a solution by Cloudera and MetiStream that lets healthcare providers automate the extraction, processing, and analysis of clinical notes within an electronic health record in batch or real time, improving care, identifying errors, and recognizing efficiencies in billing and diagnoses.
Ahsan Ashraf (Pinterest)
Online recommender systems often rely heavily on user engagement features. This can cause a bias toward exploitation over exploration, overoptimizing on users' interests. Content diversification is important for user satisfaction, but measuring and evaluating impact is challenging. Ahsan Ashraf outlines techniques used at Pinterest that drove ~2–3% impression gains and a ~1% time-spent gain.
Cory Minton (Dell EMC), Colm Moynihan (Cloudera)
Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble.
James Dreiss (Reuters)
James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in the world, The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.
Steve Otto (Navistar)
Navistar built an IoT-enabled remote diagnostics platform, OnCommand Connection, to bring together data from 375,000+ vehicles in real time, in order to drive predictive analytics. This service is now being offered to fleet owners, who can monitor the health and performance of their trucks from smartphones or tablets. Join Steven Otto to learn more about Navistar's IoT and data journey.
Basil Faruqui (BMC Software)
Basil Faruqui demonstrates how to simplify the automation and orchestration of an IoT-driven data pipeline in a cloud environment where machine learning algorithms predict failures.
GDPR is more than another regulation to be handled by your back office. Enacting the GDPR's Data Subject Access Rights (DSAR) requires practical actions. Jean-Michel Franco outlines the practical steps to deploy governed data services.
Brandy Freitas (Pitney Bowes)
Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. Join Brandy Freitas to develop context and vocabulary around data science topics to help build a culture of data within your organization.
Paco Nathan (
Deep learning works well when you have large labeled datasets, but not every team has those assets. Paco Nathan offers an overview of active learning, an ML variant that incorporates human-in-the-loop computing. Active learning focuses input from human experts, leveraging intelligence already in the system, and provides systematic ways to explore and exploit uncertainty in your data.
Sanjeev Mohan (Gartner)
If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. But how do you do so if you don’t know what data is in the lake, the level of its quality, or the trustworthiness of models? Sanjeev Mohan explains why data governance is the linchpin to success.
Mikio Braun (Zalando SE)
In order to become "AI ready," an organization not only has to provide the right technical infrastructure for data collection and processing but also must learn new skills. Mikio Braun highlights three pieces companies often miss when trying to become AI ready: making the connection between business problems and AI technology, implementing AI-driven development, and running AI-based projects.
Mark Donsky (Okera), Steven Ross (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) went into effect for firms doing business in the EU, but many companies still aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.
Ted Malaska (Capital One), Jonathan Seidman (Cloudera)
Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation.
Cassie Kozyrkov (Google)
Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness, and hiring experts doesn’t seem to help. Cassie Kozyrkov examines what it takes to build a truly data-driven organizational culture and highlights a vital yet often neglected job function: the data science manager.
Tony Baer (Ovum), Florian Douetteau (DATAIKU)
Tony Baer and Florian Douetteau share the results of research cosponsored by Ovum and Dataiku that surveyed a specially selected sample of chief data officers and data scientists on how to map roles and processes to make success with AI in the business repeatable.
Dean Wampler (Lightbend)
Streaming data systems, so called "fast data," promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler shares what you need to know to exploit fast data successfully.
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
Ian Cook (Cloudera)
Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools.
Mridul Mishra (Fidelity Investments)
Currently, most ML models—and particularly those for deep learning—work like a black box. As a result, a key challenge in their adoption is the need for explainability. Mridul Mishra discusses the need for explainability and its current state. Mridul then provides a framework for considering these needs and offers potential solutions.
We've added some fun areas in and around the Expo Hall for you to experience and enjoy at Strata NY.
We've added some fun areas in and around the Expo Hall for you to experience and enjoy at Strata NY.
We've added some fun areas in and around the Expo Hall for you to experience and enjoy at Strata NY.
Interested in how Ebates is using a hybrid on-premises and cloud implementation to scale out its centralized business intelligence and data hub? Mark Stange-Tregear shares the history, business context, and technical plan around Ebates’s hybrid Hadoop-AWS cloud approach.
Alistair Croll (Solve For Interesting), Robert Passarella (Alpha Features)
Program chairs Alistair Croll and Robert Passarella welcome you to Findata Day.
Deborah Reynolds (Pfizer), Kurt Muehmel (Dataiku)
By creating a collaborative and interactive analytic environment, a forward-thinking company may harness the best capabilities of its business analysts and data scientists to answer the company’s most pressing business questions. Deborah Reynolds and Kurt Muehmel explain how large enterprises can successfully put data at the core of everyday business decisions.
Stephanie Fischer (datanizing GmbH)
Whether customer emails, product reviews, company wikis, or support communities, user-generated content (UGC) as a form of unstructured text is everywhere, and it’s growing exponentially. Stephanie Fischer explains how to discover meaningful insights from the UGC of a famous New York discussion forum.
JF Gagne (Element AI)
JF Gagne explains why the CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI governance team that is well staffed and deeply established in the company, in order to catch biases that can develop from faulty goals or flawed data.
Sam Chance (Cambridge Semantics), Partha Bhattacharjee (Cambridge Semantics)
Ben Szekely shares a vision for digital innovation: The data fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future.
Andreea Kremm (Netex Group), Mohammed Ibraaz Syed (UCLA)
Narrative economics studies the impact of popular narratives and stories on economic fluctuations in the context of human interests and emotions. Andreea Kremm and Mohammed Ibraaz Syed describe the use of emotion analysis, entity relationship extraction, and topic modeling in modeling narratives from written human communication.
Julien Le Dem (WeWork)
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.
Friederike Schuur (Cloudera), Rita Ko (USA for UNHCR)
Friederike Schuur and Rita Ko explain how the Hive (an internal group at USA for UNHCR) and Cloudera Fast Forward Labs transformed USA for UNHCR, enabling the agency to use data science and machine learning (DS/ML) to address the refugee crisis. Along the way, they cover the development and implementation of a DS/ML strategy, identify use cases and success metrics, and showcase the value of DS/ML.
Janet Forbes, Danielle Leighton, and Lindsay Brin lead a primer on crafting well-conceived data science projects that uncover valuable business insights. Using case studies and hands-on skills development, Janet, Danielle, and Lindsay walk you through essential techniques for effecting real business change.
Brian Foo (Google), Holden Karau (Google), Jay Smith (Google)
TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Brian Foo, Jay Smith, and Holden Karau explain how to bring deep learning models from training to serving in a cloud production environment. You'll learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.
David Huh (Hitachi Vantara)
Data in most organizations today is massive, messy, and often found in silos. With so many sources to analyze, data engineers need to construct robust data pipelines using automation and minimize duplicate processes, as computation is costly for big data. David Huh shares strategies to construct data pipelines for machine learning, including one to reduce time to insight from weeks to hours.
Nick Curcuru (Mastercard)
Data—in part, harvested personal data—brings industries unprecedented insights about customer behavior. We know more about our customers and neighbors than at any other time in history, but we need to avoid crossing the "creepy" line. Laura Eisenhardt discusses how ethical behavior drives trust, especially in today's IoT age.
Mark Donsky (Okera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)
New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR.
Patrick Nussbaumer (Alteryx)
There is a lot of buzz around data science and machine learning in the world today. Unfortunately, to truly innovate with data and advanced capabilities, organizations need to expand their focus beyond just a few specialists. Patrick Nussbaumer details how focusing on people can help improve analytic value and drive innovation.
Ben Sharma (Zaloni), Selwyn Collaco (TMX)
Selwyn Collaco and Ben Sharma share insights from their real-world experience and discuss best practices for architecture, technology, data management, and governance to enable centralized data services and explain how to leverage the Zaloni Data Platform (ZDP), an integrated self-service data platform, to operationalize the enterprise data lake .
Alen Capalik (, Jim McHugh (NVIDIA), SriSatish Ambati (, Tim Delisle (Datalogue)
Explore case studies from Datalogue,, and that demonstrate how GPU-accelerated analytics, machine learning, and ETL help companies overcome slow queries and tedious data preparation process, dynamically correlate among data, and enjoy automatic feature engineering.
Tim Davis (IBM)
Tim Davis discusses key pain points and solutions to problems many enterprises face with data in silos, poor-quality data that cannot always be trusted, and managing and making large volumes of data available to derive more accurate insights and machine learning models.
Paul Scott-Murphy (WANdisco)
Every organization is considering its storage options, with an eye toward the cloud. Paul Scott-Murphy explores what makes different large-scale storage systems and services unique, their clear (and unexpected) differences, the options you have to use them, and the surprises you can expect along the way.
Zachary Glassman (The Data Incubator)
Zachary Glassman leads a hands-on dive into building intelligent business applications using machine learning, walking you through all the steps of developing a machine learning pipeline. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend these models into two applications using a real-world dataset.
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. You'll also discover a few ML model serving ideas along the way.
Longqi Yang (Cornell Tech, Cornell University)
State-of-the-art recommendation algorithms are increasingly complex and no longer one size fits all. Current monolithic development practice poses significant challenges to rapid, iterative, and systematic, experimentation. Longqi Yang explains how to use OpenRec to easily customize state-of-the-art solutions for diverse scenarios.
Karthik Ramasamy (Streamlio), Matteo Merli (Streamlio)
Apache Pulsar is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it's very important to ensure that the system can make use of all the available resources. Karthik Ramasamy and Matteo Merli share insights into the design decisions and the implementation techniques that allow Pulsar to achieve high performance with strong durability guarantees.
Mark Huang (Bell Canada)
Like all telecommunication giants, Bell Canada relies on huge volumes of data to make accurate business decisions and deliver better services. Mark Huang discusses why Bell Canada chose Kyvos’s OLAP on big data technology to achieve multidimensional analytics and how it helped the company deliver to its growing business reporting demands.
Shawn Terry (Komatsu Mining Corp)
Global heavy equipment manufacturer Komatsu is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Shawn Terry details the company's data journey and explains how it is using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.
Arakere Ramesh (Intel), Bharath Yadla (Aerospike)
Persistent memory accelerates analytics, database, and storage workloads across a variety of use cases, bringing new levels of speed and efficiency to the data center and to in-memory computing. Arakere Ramesh and Bharath Yadla offer an overview of the newly announced Intel Optane data center persistent memory and share the exciting potential of this technology in analytics solutions.
Ivan Jibaja (Pure Storage)
Pure Storage runs over 70,000 tests per day. Using Spark’s flexible computing platform, the company can write a single application for both streaming and batch jobs so the company's team of triage engineers can understand the state of the continuous integration pipeline. Ivan Jibaja discusses the use case for big data analytics technologies, the architecture of the solution, and lessons learned.
Ann Nguyen (Whole Whale)
The for-profit system lacks a conscious and empathy thinking. Ann Nguyen takes a look at the good, the bad, and the ugly of data culture, explores successes in the nonprofit sector, and shows how all companies can adapt a “for-benefit” mindset, merging their data culture with an empathy economy and using data to create and share value among their core audiences.
Aileen Nielsen (Skillman Consulting)
There is mounting evidence that the widespread deployment of machine learning and artificial intelligence in business and government applications is reproducing or even amplifying existing prejudices and social inequalities. Aileen Nielsen demonstrates how to identify and avoid bias and other unfairness in your analyses.
Osman Sarood (Mist Systems)
Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million.
Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.
John Thuma (Arcadia Data)
Forget about the fake news; data and analytics in politics is what drives elections. John Thuma shares ethical dilemmas he faced while proposing analytical solutions to the RNC and DNC. Not only did he help causes he disagreed with, but he also armed politicians with real-time data to manipulate voters.
Ian Brooks (Hortonworks)
The power of big data continues to modernize traditional industries, including healthcare. Ian Brooks explains how to implement intelligent preventive screening for conditions by applying electronic medical records (EMR) to predictive analytics via supervised machine learning techniques.
Harish Doddi (Datatron Technologies), Jerry Xu (Datatron Technologies)
Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions.
Emily Riederer (Capital One)
Emily Riederer explains how best practices from data science, open source, and open science can solve common business pain points. Using a case example from Capital One, Emily illustrates how designing empathetic analytical tools and fostering a vibrant InnerSource community are keys to developing reproducible and extensible business analysis.
Srikanth Desikan (Oracle)
SparklineData is an in-memory distributed scale-out analytics platform built on Apache Spark to enable enterprises to query on data lakes directly with instant response times. Srikanth Desikan offers an overview of SparklineData and explains how it can enable new analytics use cases working on the most granular data directly on data lakes.
Owen O'Malley (Hortonworks), Ryan Blue (Netflix)
Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.
Timothy Spann (DZone)
Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices.
Guoqiong Song (Intel), Wenjing Zhan (Talroo), Jacob Eisinger (Talroo )
Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? Guoqiong Song, Wenjing Zhan, and Jacob Eisinger demonstrate how to leverage distributed deep learning framework BigDL on Apache Spark to predict a candidate’s probability of applying to specific jobs based on their résumé.
Kevin Lu (PayPal), MAULIN VASAVADA (PayPal), Na Yang (PayPal)
PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing.
Anand Raman (Impetus Technologies)
Is a single source of truth across the enterprise possible, or is it just an expensive myth? Anand Raman explains why you need a holistic decision framework that addresses multiple facets from platform to processes. Join in to explore EDW modernization strategies, self-service analytics, and interactive insights on big data and discover a process to get to a unified data model.
Michelle Casbon (Google)
Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project.
Michael Balint (NVIDIA)
Michael Balint explains how NVIDIA employs its own distribution of Kubernetes, in conjunction with DGX hardware, to make the most efficient use of GPU resources and scale its efforts across a cluster, allowing multiple users to run experiments and push their finished work to production.
Skyler Thomas (MapR)
In the past, there have been major challenges in quickly creating machine learning training environments and deploying trained models into production. Skyler Thomas details how Kubernetes helps data scientists and IT work in concert to speed model training and time-to-value.
Viviana Acquaviva (CUNY New York City College of Technology)
Using interesting, diverse publicly available datasets and actual problems in astronomy research, Viviana Acquaviva leads an intermediate tutorial on machine learning. You'll learn how to customize algorithms and evaluation metrics required by scientific applications and discover best practices for choosing, developing, and evaluating machine learning algorithms in "real-world" datasets.
Yaroslav Tkachenko (Activision)
What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision.
Archana Anandakrishnan (American Express)
Building accurate machine learning models hinges on the quality of the data. Errors and anomalies get in the way of data scientists doing their best work. Archana Anandakrishnan explains how American Express created an automated, scalable system for measurement and management of data quality. The methods are modular and adaptable to any domain where accurate decisions from ML models are critical.
Vartika Singh (Cloudera), Alan Silva (Cloudera), Alex Bleakley (Cloudera), Steven Totman (Cloudera), Mirko Kämpf (Cloudera), Syed Nasar (Cloudera)
Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.
Drew Paroski (MemSQL), Aatif Din (Fanatics)
Today’s successful businesses utilize data better than their competitors; however, data sprawl and inefficient data infrastructure restrict what’s possible. Blending the best of the past with the software innovations of today will solve future data challenges. Drew Paroski shares how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility.
Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)
The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts).
Mikio Braun (Zalando SE)
Time series data has many applications in industry, from analyzing server metrics to monitoring IoT signals and outlier detection. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn’t, and industry use cases.
Dylan Bargteil (The Data Incubator)
The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. Dylan Bargteil introduces TensorFlow's capabilities through its Python interface.
James Bednar (Anaconda)
Python lets you solve data science problems by stitching together packages from the Python ecosystem, but it can be difficult to assemble the right tools to solve real-world problems. James Bednar walks you through using the 15+ packages covered by the new initiative to make it simple to build interactive plots and dashboards, even for large, streaming, and highly multidimensional data.
Oleksii Kachaiev (Attendify)
When we talk about microservices, we usually focus on the communication layer. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity, such as structural and semantic changes, knowledge sharing, and data discovery. Join Alexey Kachayev to explore emerging technologies created to tackle these challenges.
Joshua Poduska (Domino Data Lab), Patrick Harrison (S&P Global)
The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Joshua Poduska and Patrick Harrison detail how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage
Ben Lorica (O'Reilly Media)
As companies begin adopting machine learning, important considerations, including fairness, transparency, privacy, and security, need to be accounted for. Ben Lorica offers an overview of recent tools for building privacy-preserving and secure machine learning products and services.
Anand S (Gramener)
Answering simple questions about India's geography can be a nightmare. Official shape files are not publicly available. Worse, each ministry uses their own maps. But an active group of volunteers are crafting open maps. Anand S explains what it takes for a grass-roots initiative to transform a country's data infrastructure.
Danny Chen (Uber Technologies), Omkar Joshi (Uber Technologies), Eric Sayle (Uber Technologies)
Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.
Jerry Overton (DXC), Ashim Bose (DXC), Samir Sehovic (DXC)
Acquiring machine learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp that is equal parts hackathon, presentation, and group participation, Jerry Overton, Ashim Bose, and Samir Sehovic teach you how to apply advanced analytics in ways that reshape the enterprise and improve outcomes.
Mani Parkhe (Databricks), Andrew Chen (Databricks)
Successfully building and deploying a machine learning model is difficult to do once. Enabling other data scientists to reproduce your pipeline, compare the results of different versions, track what's running where, and redeploy and rollback updated models is much harder. Mani Parkhe and Andrew Chen offer an overview of MLflow—a new open source project from Databricks that simplifies this process.
Dan Crankshaw (UC Berkeley RISELab)
Dan Crankshaw offers an overview of the current challenges in deploying machine applications into production and the current state of prediction serving infrastructure. He then leads a deep dive into the Clipper serving system and shows you how to get started.
Jared Lander (Lander Analytics)
Temporal data is being produced in ever-greater quantity, but fortunately our time series capabilities are keeping pace. Jared Lander explores techniques for modeling time series, from traditional methods such as ARMA to more modern tools such as Prophet and machine learning models like XGBoost and neural nets. Along the way, Jared shares theory and code for training these models.
Jennifer Lim (Cerner)
The use of data throughout Cerner had taxed the company's legacy operational data store, data warehouse, and enterprise reporting pipeline to the point where it would no longer scale to meet needs. Jennifer Lim explains how Cerner modernized its corporate data platform with the use of a hybrid cloud architecture.
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.
Thomas Weise (Lyft), Mark Grover (Lyft)
Thomas Weise and Mark Grover explain how Lyft uses its streaming platform to detect and respond to anomalous events, using data science tools for machine learning and a process that allows for fast and predictable deployment.
Zachary Hanif (Capital One)
An understanding of graph-based analytical techniques can be extremely powerful when applied to modern practical problems, and modern frameworks and analytical techniques are making graph analysis methods viable for increasingly large, complex tasks. Zachary Hanif examines three prominent graph analytic methods, including graph convolutional networks, and applies them to concrete use cases.
Usama Fayyad (Open Insights & OODA Health, Inc.), Troels Oerting (WEF Global Cybersecurity Center)
Usama Fayyad and Troels Oerting share outcomes and lessons learned from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on big data and AI at a major EU bank, in collaboration with several financial services institutions.
Ian Swanson (Oracle)
Ian Swanson explores why and how data scientists and line-of-business leaders must treat AI as a team sport and explains what tools are needed to deploy models and applications that truly inform decision making.
Enjoy delicious snacks and beverages with fellow Strata attendees, speakers, and sponsors at the Opening Reception, happening immediately after tutorials on Tuesday.
Greg Rahn (Cloudera)
Cloud object stores are becoming the bedrock of cloud data warehouses for modern data-driven enterprises, and it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. Greg Rahn and Mostafa Mokhtar discuss optimal end-to-end workflows and technical considerations for using Apache Impala over object stores for your cloud data warehouse.
Michael Freedman (TimescaleDB)
Michael Freedman explains how to leverage Postgres for high-volume time series workloads using TimescaleDB, an open source time series database built as a Postgres plug-in. Michael covers the general architectural design principles and new time series data management features, including adaptive time partitioning and near-real-time continuous aggregations.
Bonnie Barrilleaux (LinkedIn)
As LinkedIn encouraged members to join conversations, it found itself in danger of creating a "rich get richer" economy in which a few creators got an increasing share of all feedback. Bonnie Barrilleaux explains why you must regularly reevaluate metrics to avoid perverse incentives—situations where efforts to increase the metric cause unintended negative side effects.
Pop-up Talks is an "unconference" within Strata—an open, community-driven format where you can connect with other attendees on topics that you want to discuss. Attendees schedule discussions on the sign-up board and lead the conversations, emphasizing participation over presentations.
Pop-up Talks is an "unconference" within Strata—an open, community-driven format where you can connect with other attendees on topics that you want to discuss. Attendees schedule discussions on the sign-up board and lead the conversations, emphasizing participation over presentations.
Hilary Mason (Cloudera Fast Forward Labs)
Machine learning and artificial intelligence are exciting technologies, but real value comes from marrying those capabilities with the right business problems. Hilary Mason explores the current state of these technologies, investigates what's coming next in applied machine learning, and explains how to identify and execute on the right business opportunities at the right time.
Patrick Hall ( | George Washington University), Avni Wadhwa (, Mark Chan (
Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. Patrick Hall, Avni Wadhwa, and Mark Chan share practical and productizable approaches for explaining, testing, and visualizing machine learning models using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.
Cristobal Lowery (Baringa), Marc Warner (ASI)
In EU households, heating and hot water alone account for 80% of energy usage. Cristobal Lowery and Marc Warner explain how future home energy management systems could improve their energy efficiency by predicting resident needs through utilities data, with a particular focus on the key data features, the need for data compression, and the data quality challenges.
Chris Wojdak (Symcor)
Chris Wojdak explains how Symcor has transformed its big data architecture using Informatica’s comprehensive machine learning-based solutions for data integration, data quality, data cataloging, and data governance.
Les McMonagle (BlueTalon)
Privacy by design is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. Les McMonagle outlines how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for noncompliance.
Gerard Maas (Lightbend)
Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences with regard to key aspects of a streaming application: API usability, dealing with time, dealing with state and machine learning capabilities, and more. You'll learn when to pick one over the other or combine both to implement resilient streaming pipelines.
Sumit Gulwani (Microsoft)
Programming by input-output examples (PBE) is a new frontier in AI, set to revolutionize the programming experience for the masses. It can enable end users—99% of whom are nonprogrammers—to create small scripts and make data scientists 10–100x more productive for many data wrangling tasks. Sumit Gulwani leads a deep dive into this new programming paradigm and explores the science behind it.
Ted Dunning (MapR)
Stateful containers are a well-known anti-pattern, but the standard solution—managing state in a separate storage tier—is costly and complex. Recent developments have changed things dramatically for the better. In particular, you can now manage a high-performance software-defined-storage tier entirely in Kubernetes. Ted Dunning describes what's new and how it makes big data easier on Kubernetes.
Felipe Hoffa (Google), Damien Desfontaines (Google | ETH Zürich)
Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa and Damien Desfontaines explore how to handle massive public datasets, taking you from theory to real life as they showcase newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm.
Julia Angwin (ProPublica)
Algorithms are increasingly arbiters of forgiveness. Julia Angwin discusses what she has learned about forgiveness in her series of articles on algorithmic accountability and the lessons we all need to learn for the coming AI future.
Shivnath Babu (Unravel Data Systems, Duke University), Madhusudan Tumma (TIAA)
Operationalizing big data apps in a quick, reliable, and cost-effective manner remains a daunting task. Shivnath Babu and Madhusudan Tumma outline common problems and their causes and share best practices to find and fix these problems quickly and prevent such problems from happening in the first place.
Kimberly Nevala (SAS Institute)
Too often, the discussion of AI and ML includes an expectation—if not a requirement—for infallibility. But as we know, this expectation is not realistic. So what’s a company to do? While risk can’t be eliminated, it can be rationalized. Kimberly Nevala demonstrates how an unflinching risk assessment enables AI/ML adoption and deployment.
Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components.
Amro Alkhatib (National Health Insurance Company-Daman)
Processing claims is central to every insurance business. Amro Alkhatib shares a successful business case for automating claims processing, from idea to production. The machine learning-based claim automation model uses NLP methods on non-text data and allows auditable automated claims decisions to be made.
Yasuyuki Kataoka (NTT Innovation Institute, Inc.)
One of the challenges of sports data analytics is how to deliver machine intelligence beyond a mere real-time monitoring tool. Yasuyuki Kataoka highlights various real-time machine learning models in both IndyCar and Tour de France, sharing real-time data processing architectures, machine learning models, and demonstrations that deliver meaningful insights for players and fans.
Jesse Anderson (Big Data Institute)
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks and explains how to choose the right one for your company.
Lawrence Cowan (Cicero Group)
Firms are struggling to leverage their data. Lawrence Cowan outlines a methodology for assessing four critical areas that firms must consider when looking to make the analytical leap: data strategy, data culture, data analysis and implementation, and data management and architecture.
Bruno Gonçalves (JPMorgan Chase & Co.)
Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Join Bruno Gonçalves to learn how to use recurrent neural networks to model and forecast time series and discover the advantages and disadvantages of recurrent neural networks with respect to more traditional approaches.
Kyle Davis (Redis Labs)
Kyle Davis explains how Redis can be used for ingesting high-velocity data from large-scale platforms and IoT data collections as well as for storing and querying data using probabilistic data structures that trade some precision for both higher speed and lower storage requirements. Along the way, Kyle shares examples and a demo of the solution.
Zhi Zhu (China Construction Bank ), Luke Han (Kyligence)
When China Construction Bank wanted to migrate 23,000+ reports to mobile, it chose Apache Kylin as the high-performance and high-concurrency platform to refactor its data warehouse architecture to serving 400K+ users. Zhi Zhu and Luke Han detail the necessary architecture and best practices for refactoring a data warehouse for mobile analytics.
Sudhanshu Arora (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon Freeman (Cloudera), Jason Wang (Cloudera), Shravan Pabba (Cloudera)
Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS.
Ihab Ilyas (University of Waterloo | Tamr)
Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.
Francesco Mucio (Zalando SE)
Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead.
Katharina Warzel (EveryMundo)
Airlines want to know what happens after a user interacts with their websites. Do they convert? Do they close the browser and come back later? Airlines traditionally have depended on analytics tools to prove value. Katharina Warzel explores how to implement a client-independent end-to-end tracking system.
Ramesh Krishnan (lmco), Steve Morgan (Lockheed Martin)
Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud.
Shioulin Sam (Cloudera Fast Forward Labs)
Recent advances in deep learning allow us to use the semantic content of items in recommendation systems, addressing a weakness of traditional methods. Shioulin Sam explores the limitations of classical approaches and explains how using the content of items can help solve common recommendation pitfalls, such as the cold start problem, and open up new product possibilities.
Jacques Nadeau (Dremio)
Jacques Nadeau leads a deep dive into a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture—including the cache life cycle, update patterns, cache cohesion, and appropriate use cases—learn how it all works, and see it in action.
Greg Quist (SmartCover Systems)
Sewers can talk. Water levels in sewers have a signature, analogous to a human EKG. Greg Quist explains how this signature can be analyzed in real time, using pattern recognition techniques, revealing distressed pipelines and allowing users of this technology to take appropriate steps for maintenance and repair.
Darrin Johnson (NVIDIA)
While every enterprise is on a mission to infuse its business with deep learning, few know how to build the infrastructure to get them there. Darrin Johnson shares insights and best practices learned from NVIDIA's deep learning deployments around the globe that you can leverage to shorten deployment timeframes, improve developer productivity, and streamline operations.
Chad W. Jennings (Google)
Cities all over the world are using data and analytics to optimize infrastructure, but city planners are often held back by outdated data gathering methods and legacy analysis tools. Chad Jennings details how Geotab, a leader in IoT fleet logistics, brought BigQuery's unique machine learning and geospatial capabilities to its existing datasets to deliver a more capable solution to city planners.
Chang Liu (Georgian Partners )
Chang Liu offers an overview of a common problem faced by many software companies, the cold-start problem, and explains how Georgian Partners has been successful at solving this problem by transferring knowledge from existing data through differentially private data aggregation.
Amber Case (MIT Media Lab)
Amber Case outlines several methods that product designers and managers can use to improve everyday interactions through an understanding and application of sound design.
David Talby (Pacific AI), Alberto Andreotti (John Snow Labs), Stacy Ashworth (SelectData), Tawny Nichols (Select Data)
David Talby, Alberto Andreotti, Stacy Ashworth, and Tawny Nichols outline a question-answering system for accurately extracting facts from free-text patient records and share best practices for training domain-specific deep learning NLP models. The solution is based on Spark NLP, an extension of Spark ML that provides state-of-the-art performance and accuracy for natural language understanding.
Gather before keynotes on Wednesday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees.
Gather before keynotes on Thursday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees.
Michael Mahoney (Kinetica)
Michael Mahoney demonstrates how to leverage the power of GPUs to converge streaming data analysis, location analysis, and streamlined machine learning with a single engine. Along the way, Michael shares real-world case studies on how Kinetica is used to solve complex data challenges.
Revant Nayar (FMI Technologies LLC )
Machine learning has so far underperformed in time series prediction (slowness and overfitting), and classical methods are ineffective at capturing nonlinearity. Revant Nayar shares an alternative approach that is faster and more transparent and does not overfit. It can also pick up regime changes in the time series and systematically captures all the nonlinearity of a given dataset.
Brent Dykes (Domo)
Companies collect all kinds of data and use advanced tools and techniques to find insights, but they often fail in the last mile: communicating insights effectively to drive change. Brent Dykes discusses the power that stories wield over statistics and explores the art and science of data storytelling—an essential skill in today’s data economy.
The Strata Data Awards recognize the most innovative startups, leaders, and data science projects from Strata sponsors and exhibitors around the world.
Join the Cloudera Foundation and O’Reilly Media in assembling seed kits to benefit the Humane Society of New York.
Tim Berglund (Confluent)
Tim Berglund leads this solid introduction to Apache Kafka as a streaming data platform. You'll cover the internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams, then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka.
William Chambers (Databricks)
Streaming big data is a rapidly growing field but currently involves a lot of operational complexity and expertise. Bill Chambers shares a decision making framework for determining the best tools and technologies for successfully deploying and maintaining streaming data pipelines to solve business problems and offers an overview of Apache Spark’s Structured Streaming processing engine.
TD Bank’s data analytics team has undertaken a multiyear journey to modernize its data infrastructure for today and future needs. Joseph DosSantos explains how the team built a governed data lake foundation, enabling business users to leverage its big data environment to extract analytical insights while minimizing risks.
Canadian AI startup wrnch demonstrates a real-time deep learning software platform that can read body language from standard video.
Ted Dunning (MapR)
There’s real value in big data and more waiting when you add real-time, but to get the payoff, you need successful deployments of your AI and data-intensive applications. You need to be ready with your current applications in production but must have an architecture and infrastructure that are ready for the next ones as well. Ted Dunning explores how others have fared in this journey.
Jane Tran (Unqork)
Data’s role in financial services has been elevated. However, often the rollout of data solutions fails when an organization’s existing culture is misaligned with its capabilities. Unqork is increasing adoption by honoring existing capabilities. Jane Tran explores methods to finally implement data solutions through both qualitative and quantitative discoveries.
Chris Stirrat (Eagle Investment Systems)
Eagle Investment Systems, a leading provider of financial services technology, is building a new Hadoop and cloud-based data management solution. Chris Stirrat explains how Eagle went from incubation to an enterprise-scale solution in just 10 months, using a Hadoop-based big data stack and multitenant architecture, transforming software creation, delivery, quality, technology, and culture.
Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling, infrastructure, and more, Michelangelo D'Agostino shares concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value.
Ben Sharma (Zaloni)
Once, a company could live 60-70 years on the S&P 500. Now it averages 15 years. If companies were people, this would be an epidemic on par with the Black Plague. But the same things that dragged humanity out of that dark age can drag companies out of this one.
Ryan Blue (Netflix), Daniel Weeks (Netflix)
In the last few years, Netflix's data warehouse has grown to more than 100 PB in S3. Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3.
Anupam Singh (Cloudera), Brian Coyne (PNC)
Data volumes don’t translate to business value. What matters is your data platform’s ability to support unprecedented numbers of business users and use cases. Anupam Singh and Brian Coyne look at some of the challenges posed by data-hungry organizations and share new techniques to extract meaningful insights at the speed of today’s modern business.
Gwen Shapira (Confluent)
Gwen Shapira shares design and architecture patterns that are used to modernize data engineering. You'll learn how modern engineering organizations use Apache Kafka, microservices, and event streams to efficiently build data pipelines that are scalable, reliable, and built to evolve.
Antonio Fragoso (Globant)
Antonio Fragoso explores the key aspects of implementing a natural language processing project within your organization and reveals the necessary steps for making it a success. Antonio focuses on how to leverage an iterative process that can pave the way toward building a successful product.
Adil Aijaz (Split Software)
Many products, whether data driven or not, chase “the one metric that matters.” It may be engagement, revenue, or conversion, but the common theme is the pursuit of improvement in one metric. Product development teams should instead focus on the design of metrics that measure our goals. Adil Aijaz shares an approach to designing metrics and discusses best practices and common pitfalls.
Cassie Kozyrkov (Google)
Why do businesses fail at machine learning despite its tremendous potential and the excitement it generates? Is the answer always in data, algorithms, and infrastructure, or is there a subtler problem? Will things improve in the near future? Let's talk about some lessons learned at Google and what they mean for applied data science.
Amandeep Khurana shares critical data management practices for easy and unified data access that meets security and regulatory compliance, helping you avoid the pitfalls that could lead to complex expensive architectures.
Joseph Lubin (Consensus Systems)
Ethereum is a world computer on top of a peer-to-peer network that runs smart contracts - applications that run exactly as programmed without the possibility of censorship, fraud, or third-party interference. Until now, businesses had to build their systems on database technologies that resulted in siloed and redundant information in typically adversarial contexts.
Theresa Johnson (Airbnb)
Theresa Johnson explains how Airbnb is building its next-generation end-to-end revenue forecasting platform, leveraging machine learning, Bayesian inference, TensorFlow, Hadoop, and web technology.
Umur Cubukcu (Citus Data)
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases.
Jeffrey Heer (Trifacta | University of Washington)
Jeffrey Heer offers an overview of Vega and Vega-Lite—high-level declarative languages for interactive visualization that support exploratory data analysis, communication, and the development of new visualization tools.
Join Strata Business Summit speakers and attendees for a networking lunch on Thursday.
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
Meet the Experts sessions give you the chance to meet face-to-face with Strata presenters in a small-group setting. Drop in to discuss their sessions, ask questions, or make suggestions.
Author Book Signings will be held in the O’Reilly booth during the conference. This is a great opportunity for you to meet O’Reilly authors and get a free copy of one of their books. Complimentary copies will be provided to the first 35 attendees. Limit one free book per attendee.
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)
Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop.
Patrick Angeles (Cloudera)
The financial crisis of 2008 exposed systemic issues in the financial system that resulted in the failures of several established institutions and a bailout of the entire industry. Patrick Angeles explains why banks and regulators are turning to big data solutions to avoid a repeat of history.
Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases.
Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn)
Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage.
Han Yang (Cisco Systems)
Data is the lifeblood of an enterprise, and it's being generated everywhere. To overcome the challenges of data gravity, data analytics, including machine learning, is best done where the data is located: ubiquitous machine learning. Han Yang explains how to overcome the challenges of machine learning everywhere.
Holden Karau (Google), Rachel Warren (Salesforce Einstein), Anya Bida (Salesforce)
Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.
Sara Alavi (Bell Canada)
Bell Canada, Canada's largest communications company, leads the industry in providing world-class broadband communications services to consumers and business customers. Join Sara Alavi to learn how the network big data and AI team within Bell is using modern data environments and applying a startup mindset to transform traditional networks into insight-driven intelligent networks.
tao huang (, mang zhang (, 白冰 (
Tao Huang, Mang Zhang, and 白冰 explain how uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average.
Tim Walpole (BJSS)
Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform.
Dave Shuman (Cloudera), Bryan Dean (Red Hat)
The focus on the IoT is turning increasingly to the edge, and the way to make the edge more intelligent is by building machine learning models in the cloud and pushing them back out to the edge. Dave Shuman and Bryan Dean explain how Cloudera and Red Hat executed this architecture at one of Europe's leading manufacturers, along with a demo highlighting this architecture.
Petrus Smith (PwC)
Peet Smith explains how PwC is using modern database tools with a combination of open source technologies to automate and scale data ingestion and transformation to get data to engagement teams to help them streamline and accelerate client service delivery.
Jim Scott (MapR Technologies)
Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures.
Brian O'Neill (Designing for Analytics)
Gartner says 85%+ of big data projects will fail, despite the fact your company may have invested millions on engineering implementation. Why are customers and employees not engaging with these products and services? Brian O'Neill explains why a "people first, technology second" mission—a design strategy, in other words—enables the best UX and business outcomes possible.
Sarah Catanzaro (Amplify Partners), Rama Sekhar (Norwest Venture Partners), Zavain Dar (Lux Capital), Jonathan Lehr (Work-Bench), Crystal Huang (NEA)
In this panel discussion, venture capital investors explain how startups can accelerate enterprise adoption of machine learning and explore the new tech trends that will give rise to the next transformation in the big data landscape.
Paul Lashmet (Arcadia Data)
Artificial intelligence and deep learning are used to generate and execute trading strategies. Regulators and investors demand transparency into investment decisions, but the decision-making processes of machine learning technologies are opaque. Paul Lashmet explains how these same machines generate data that can be visualized to spot new trading opportunities.
Jeffrey Wecker (Goldman Sachs)
Jeffrey Wecker leads a deep dive on data in financial services, with perspectives on the evolving landscape of data science, the advent of alternative data, the importance of data centricity, and the future for machine learning and AI.
IBM Analytics’s Dinesh Nirmal solves school lunch and the struggle to keep ahead of regulations. With AI tech like deep learning and NLG, supplying meals to California’s kids leaps from enriching metadata for compliance to actionable insights for the business.
Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers.
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
Meet the Experts sessions give you the chance to meet face-to-face with Strata presenters in a small-group setting. Drop in to discuss their sessions, ask questions, or make suggestions.
Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.
Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)
Kubernetes (K8s)—the open source container orchestration system for modern big data workloads—is increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s.
Patty Ryan (Microsoft), CY Yam (Microsoft), Elena Terenzi (Microsoft)
Large online fashion retailers must efficiently maintain catalogues of millions of items. Due to human error, it's not unusual that some items have duplicate entries. Since manually trawling such a large catalogue is next to impossible, how can you find these entries? Patty Ryan, CY Yam, and Elena Terenzi explain how they applied deep learning for image segmentation and background removal.
Fabian Hueske (data Artisans)
Fabian Hueske discusses why SQL is a great approach to unify batch and stream processing. He gives an update on Apache Flink's SQL support and shares some interesting use cases from large-scale production deployments. Finally, Fabian presents Flink's new query service that enables users and applications to submit streaming and batch SQL queries and retrieve low-latency updated results.
William Benton (Red Hat)
Containers are a hot technology for application developers, but they also provide key benefits for data scientists. William Benton details the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively.
Ajay Kulkarni (TimescaleDB)
Ajay Kulkarni explores the underlying changes that are characterizing the next wave of computing and shares several ways in which individual businesses and overall industries will be transformed.
Felix Cheung (Uber)
Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame.
Varant Zanoyan (Airbnb)
Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems.