Presented By O’Reilly and Cloudera
Make Data Work
September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

If you are looking for slides and video from 2017, visit the Strata Data Conference in New York 2017 site.

All
Data science and machine learning
Jeroen Janssens (Data Science Workshops B.V.)
"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems.
Data engineering and architecture, Emerging technologies and case studies
Minh Chau Nguyen (ETRI), Heesun Won (ETRI)
This session will address how analytics services in data marketplace systems can be performed on one single Hadoop cluster across distributed data centers. We extend the overall architecture of Hadoop ecosystem with blockchain so that multiple tenants and authorized third parties can securely access data to perform various analytics while still maintaining the privacy, scalability and reliability.
Data-driven business management, Strata Business Summit
Francesca Lazzeri (Microsoft), Jaya Mathew (Microsoft)
What profession did Harvard Business Review call the Sexiest Job of the 21st Century? With the growing buzz of data science, several professionals have approached us at various events to learn more about how to become a data scientist. This session aims at raising awareness of what it takes to become a data-scientist and how artificial intelligence solutions have started to reinvent businesses.
Streaming systems and real-time applications
Jun Rao (Confluent)
The controller is the brain of Apache Kafka and is responsible for maintaining the consistency of the replicas. We will first describe the main data flow in the controller. We then describe some of the recent improvements in the controller that handle certain edge cases correctly and allows for more partitions in a Kafka cluster.
Data science and machine learning
Alexander Heye (Cray, Inc), Ding Ding (Intel)
Precipitation nowcasting is used to predict the future rainfall intensity over a relatively short timeframe. The forecasting resolution and time accuracy required are much higher than other traditional forecasting tasks. We will talk about building a precipitation nowcasting system with recurrent neural networks using BigDL on Apache Spark.
Data science and machine learning
Moty Fania (Intel)
In this session, Moty Fania will share Intel’s IT experience from implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming and online actuation. This session highlights the key learnings from this work with a thorough review of platform’s architecture
Data science and machine learning
Matt Brandwein (Cloudera)
An overview of considerations and tradeoffs for choosing an open approach to enterprise data science. In this talk we’ll share a model to help organizations begin the journey, build momentum, and reduce reliance on legacy software. This includes such things as executive leadership, cost transparency, and clear metrics of user adoption and success with open data science tools.
Data engineering and architecture
Milene Darnis (Uber)
Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis, who leads Product Management for the Experimentation platform, will talk about how the team built a scalable and self-serve platform, that lets users plug in any metric to analyze.
Data engineering and architecture
Joshua Patterson (NVIDIA), Mike Wendt (NVIDIA)
GPUs have allowed financial firms to run complex simulations, train myriads of models, and data mine at unparalleled speeds. Today, the bottleneck has moved completely to ETL. With the GPU Open Analytics Initiative (GoAi), we’re accelerating ETL and keeping the entire workflow on GPUs. We’ll discuss real-world examples, benchmarks, and how we’re accelerating our largest FS customers.
Data science and machine learning
Ankit Jain (Uber)
Personalization is a common theme in social networks and e-commerce businesses. However, personalization at Uber will involve understanding of how each driver/rider is expected to behave on the platform. In this talk, we will focus on how Deep Learning (LSTM's) and Uber's huge database can be used to understand/predict future behavior of each and every user on the platform.
Data engineering and architecture
Occhio Orsini (Aetna)
Aetna's Data Fabric platform is based on the Hadoop technology stack but has integrated many different technologies to create a robust Data Lake and Advanced Analytics platform to meet the needs to Aetna's Data Scientists and analytics practitioners.
Data engineering and architecture
Preeti Vaidya (Viacom), Mark Cohen (Viacom, Inc.)
Data Products, different from Data-Driven Products, are finding their own place in organizational Data. Driven Decision Making. Shifting the focus to “data”, opens up new opportunities. The presentation, with case studies, dives deeper into a layered implementation architecture, provides intuitive learnings and solutions that allow for more agile, reusable data modules for a data product team.
Law, ethics, governance, Strata Business Summit
Harry Glaser (Periscope Data)
What is the moral responsibility of a data team today? As AI & machine learning technologies become part of our everyday life, and as data becomes accessible to everyone, CDOs and data teams are taking on a very important moral role as the conscience of the corporation. This session will highlight the risks companies will face if they don't empower data teams to lead the way for ethical data use.
Data science and machine learning
William Fehlman (USAA), Spencer Kirn (College of William & Mary)
Provide a comparison of topic modeling algorithms used to identify latent topics in large volumes of text data, and then present coherence scores that illustrate the method that shows high consistency with human judgments on the quality of topics. We will then discuss the importance of the coherence scores in choosing topic modeling algorithms that best support different use cases.
Data-driven business management, Strata Business Summit
Bill Franks (International Institute For Analytics)
The International Institute For Analytics studied the analytics maturity level of large enterprises. The talk will cover how maturity varies by industry and some of the key steps organizations can take to move up the maturity scale. The research also correlates analytics maturity with a wide range of corporate success metrics including financial and reputational measures.
Data science and machine learning
Masha Westerlund (Investopedia)
As our businesses rely more heavily on user data to power our sites, products, and sales, can we give back by sharing those insights with users? Learn how Investopedia harnessed reader data to build an index that tracks market anxiety and moves with the VIX, a proprietary measure of market volatility. We’ll focus on thinking outside the box to turn data into tools for users, not just stakeholders.
Data engineering and architecture
Jay Kreps (Confluent)
Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. This talk will explain some of the difficulties of building production machine learning systems and talk about how Apache Kafka and stream processing can help.
Data engineering and architecture, Platform security and cybersecurity
Carolyn Duby (Hortonworks)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Learn how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable, open-source platform. After this interactive overview of the platform's major features, you will be ready to analyze your own haystack back at the office.
1-Day Training Please note: to attend a training course, you must be registered for a Platinum or Training pass; does not include access to tutorials.
The instructor walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.
Data science and machine learning
Andrew Montalenti (Parse.ly )
What can we learn from a one-billion-person live-poll of the Internet? Parse.ly has gathered a unique data set of news reading sessions of billions of devices, peaking at over 2 million sessions per minute on thousands of high-traffic news and information websites. Our team of data scientists and machine learning engineers have used this data to unearth the secrets behind online content.
Data engineering and architecture, Streaming systems and real-time applications
Brian Wu (Appnexus)
Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. This talk describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus.
Data engineering and architecture
Mark Madsen (Think Big Analytics), Todd Walter (Teradata)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build multi-use data infrastructure that is not subject to past constraints. We will explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.
Data engineering and architecture
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Using Customer 360 and the Internet of Things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics.
Visualization and user experience
Bob Levy (Virtual Cove, Inc.)
Augmented Reality opens a completely new lens on your data through which you see and accomplish amazing things. Learn how simple Python scripts can leverage completely new plot types. See use cases revealing new insight into financial markets data. Explore new ways of seeing & interacting with data to shed light on & build trust in otherwise “black box” machine learning solutions.
Law, ethics, governance
LaVonne Reimer (Lumenous)
GDPR asks us to rethink personal data systems--viewing UI/UX, consent management, and value-add data services through the eyes of subjects of the data. The opportunity in the $150B credit and risk industry is to deploy data governance technologies that balance the interests of individuals to control their own data with requirements for trusted data.
Big data and data science in the cloud, Data engineering and architecture
Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)
Recruit Group and NTT DATA Corporation developed a platform based on "Datahub" utilizing Apache Kafka. This platform should handle around 1TB/day application logs generated by a lot of services in Recruit Group. Some of the best practices and know-hows, such as schema evolution and network architecture, learned during this project are explained in this session.
Law, ethics, governance, Strata Business Summit
Andrew Burt (Immuta)
Machine learning is becoming prevalent across industries, creating new types of risk. Managing this risk is quickly becoming the central challenge of major organizations, one that strains data science teams, legal personnel and the c-suite alike. This talk will highlight lessons from past regulations focused on similar technology, and conclude with a proposal for new ways to manage risk in ML.
Data science and machine learning
Atul Kale (Airbnb), Xiaohan Zeng (Airbnb)
We introduce Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Bighead integrates popular libraries including Tensorflow, XGBoost, and PyTorch. It is built on Python, Spark, and Kubernetes, and is designed be used in modular pieces. It has reduced the overall model development time from many months to days at Airbnb.
Data science and machine learning
Daniel Kang (Stanford University)
As video volumes grow, automatic methods are required to prioritize human attention. However, these methods do not scale and are cumbersome to deploy. In response, we introduce BlazeIt, an exploratory video analytics engine. We show our declarative language, FrameQL, can capture a range of real-world queries and BlazeIt's optimizer can execute these queries over 2000x faster than naive approaches.
Data science and machine learning
Olga Cuznetova (Optum), Manna Chang (Optum UHG)
This presentation will focus on showing both supervised and unsupervised learning methods to work with claims data and how they can complement each other. A supervised method will look at CKD patients at-risk to develop ESRD, and unsupervised approach will look at classification of patients that tend to develop this disease faster than others.
Data engineering and architecture
Chris Fregly (PipelineAI)
Applying my Netflix experience to a real-world problem in the ML and AI world, I will demonstrate a full-featured, open-source, end-to-end TensorFlow Model Training and Deployment System using the latest advancements with Kubernetes, TensorFlow, and GPUs.
Data science and machine learning
David Arpin (Amazon Web Services)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Outline - What is Amazon SageMaker? Quick product overview of AWS's newest ML Platform - Create a Spark EMR cluster - Integrate SageMaker algorithms into Spark pipelines - Ensemble multiple models for a real-time prediction task
Data engineering and architecture
Yaroslav Tkachenko (Activision)
What can be easier than building a data pipeline? You add a few Apache Kafka clusters, some way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse... wait, it does start to look like A LOT of things, doesn't it? Join this talk to learn about the best practices we've been using for all the above.
Big data and data science in the cloud, Data engineering and architecture
Nir Yungster (JW Player), Kamil Sindi (JW Player)
Building a video recommendation model that serves millions of monthly visitors is a challenge in itself. At JW Player, we face the challenge of providing on-demand recommendations as a service to thousands of media publishers. We focus on how to systematically improve model performance while navigating the many engineering challenges and unique needs of the diverse publishers we serve.
Big data and data science in the cloud, Data engineering and architecture
Jorge A. Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services (AWS)), Paul Sears (Amazon Web Services), Ryan Nienhuis (Amazon Web Services), Randy Ridgley (Amazon Web Services)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez walks you through building a big data application using a combination of open source technologies and AWS managed services.
Data engineering and architecture
Kaushik Deka (Novantas), Ingrid Liu (Novantas)
We discuss a large-scale optimization architecture in Spark for a consumer product portfolio optimization case study in retail banking—which combines a simulator that distributes computation of complex real-world scenarios given varying macro-economic factors, consumer behavior and competitor landscape, and a constraint optimizer that uses business rules as constraints to meet growth targets.
Big data and data science in the cloud
Jonathan Ellis (DataStax)
Is open-source Apache Cassandra still relevant in an era of hosted cloud databases? DataStax CTO Jonathan Ellis will discuss Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner.
Data-driven business management
Andreas Kohlmaier (MunichRe)
MunichRe is increasing client esilience against economic, political and cyber risks while setting and shaping trends in the insurance market. Recently MunichRe successfully launched a data catalog as the driver for analyst adoption of a data lake. Cataloging new data encouraged users to effectively and collaboratively explore new ideas, develop new business and enhance customer service.
Big data and data science in the cloud, Data engineering and architecture
Do your analysts always trust the insights generated by your Data Platform? Ensuring insights are always reliable is critical for use-cases in the Financial Sector. Similar to a circuit-breaker design pattern used in Service Architectures, this talk describes a circuit-breaker pattern we developed for data pipelines. We are able to detect/correct problems and ensure always reliable insights!
Data science and machine learning
Ash Munshi (Pepperdata)
In this talk we will describe a technique for labeling applications using runtime measurements of CPU, memory, i/o and network and a deep neural network. This labeling groups the applications into buckets that have understandable characteristics and which can then be used to reason about the cluster and its performance.
Data engineering and architecture
Paul Curtis (MapR Technologies)
Now that the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? In this discussion explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment.
Data engineering and architecture
Mala Ramakrishnan (Cloudera), Jason Wang (Cloudera), Tony Wu (Cloudera)
The largest infrastructure paradigm change of the 21st Century is the shift to the cloud. Companies are faced with the difficult and daunting decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. In this talk we use our experience from building production services on AWS and Azure to compare their strengths and weaknesses.
Data science and machine learning
One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. We shall walk the audience through how to marry correlation analysis with anomaly detection can help and share techniques to guide effective decision making.
Data-driven business management
Abhimanyu Verma (Novartis)
A case study on how a transformational business opportunity was realized on the foundation of an integrated data, process, culture, organization and technology strategy
Data engineering and architecture
Michelle Ufford (Netflix)
In this talk, Michelle Ufford will share some cool things Netflix is doing with data and the big bets we’re making on data infrastructure. Topics will include workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, and platform intelligence.
Law, ethics, governance, Strata Business Summit
Nuria Ruiz (Wikimedia)
The Wikipedia community feels strongly that you shouldn’t have to provide personal information to participate in the free knowledge movement. In this talk we will go into the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection and some creative workarounds that allow WMF to calculate metrics in a privacy conscious way.
Strata Business Summit
Alistair Croll (Solve For Interesting)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Welcome to the Data Case Studies tutorial.
Data engineering and architecture, Law, ethics, governance
Barbara Eckman (Comcast)
Comcast’s Streaming Data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. We were recently challenged to integrate on-prem datasources, including traditional data warehouses and RDBMS’s. Our data governance strategy must now include relational and JSON schemas in addition to Apache Avro. Here’s how we did it!
Data engineering and architecture
Andrew J Brust (ZD Net)
Data governance is a product category that has grown from a set of mostly data management-oriented technologies in the data warehouse era, to encompass catalogs, glossaries and more in the data lake era. Now new requirements are emerging and new products are rising to meet the challenge. This session tracks data governance's past present and future.
Data-driven business management
Sam Helmich (Deere & Company)
Data science can benefit by borrowing some principles of Agile. These benefits can be compounded by structuring the team roles in such a manner to enable success without relying on employing full-stack expert “unicorns”.
Data science and machine learning
Jeroen Janssens (Data Science Workshops B.V.)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
The unix command line remains an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful, command-line tools you can quickly scrub, explore, and model your data as well as hack together prototypes. This hands-on workshop is based on the O’Reilly book Data Science at the Command Line, written by instructor Jeroen Janssens.
Data-driven business management, Strata Business Summit
Erin Coffman (Airbnb)
Airbnb has open-sourced many high-leverage data tools: Airflow, Superset, and the Knowledge Repo. However, adoption of these tools across Airbnb was relatively low. To make data more accessible and utilized in decision-making, Airbnb launched Data University in early 2017. Since the launch, over a quarter of the company has participated in the program, and data tool utilization rates have doubled.
Visualization and user experience
Anna Nicanorova (Annalect)
Data Visualization is supposed to be our map to information. However, contemporary charting techniques have a few shortcomings: Context Reduction, Hard numeric grasp, Perceptual de-humanization. Augmented Reality can potentially solve all of the issues listed above by presenting an intuitive and interactive environment for data exploration.
Data-driven business management
Mike Berger (Mount Sinai Health System)
Hear how Mount Sinai Health has moved up the analytics maturity chart to deliver business value in new risk models around Population Health. Learn how to design a team, build a data factory and generate the analytics to drive decision-centricity. See examples of mixing Tableau, SQL, Hive, APIs, Python and R into a cohesive ecosystem supported by our data factory
Data science and machine learning
Garrett Hoffman (StockTwits)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
This workshop will review deep learning methods used for natural language processing and natural language understanding tasks while working on a live example with StockTwits data using python and TensorFlow. Methods we review include Word2Vec, Recurrent Neural Networks and Variants (LSTM, GRU) and Convolutional Neural Networks.
Big data and data science in the cloud
Swetha Machanavajhala (Microsoft), Xiaoyong Zhu (Microsoft)
In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds (dog bark, alarms, people calling from behind etc.,). We all take this for granted, there are over 360 million in this world who are deaf or hard of hearing. How can we make the auditory world inclusive as well as meet the great demand in other sectors by applying deep learning on audio in Azure?
Data engineering and architecture
Wangda Tan (Hortonworks inc)
In order to train deep learning/machine learning models, applications such as TensorFlow / MXNet / Caffe / XGBoost can be leveraged, we introduced new features in Apache Hadoop 3.x to better support deep learning workloads. (Such as GPU isolation, Docker support, etc.). This talk we will take a closer look at these improvements and show how to run these applications on YARN with demos.
Data science and machine learning
Dr. Vijay Srinivas Agneeswaran (SapientRazorfish), Abhishek Kumar (SapientRazorfish)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.
Data science and machine learning
Lars Hulstaert (Microsoft)
Transfer learning allows data scientists to leverage insights from large labelled data sets. The general idea of transfer learning is to use knowledge learned from tasks for which a lot of labelled data is available in settings where only little labelled data is available. In this talk, you’ll learn what transfer learning is and how it can boost your NLP or CV pipelines.
Data science and machine learning
Diego Oppenheimer (Algorithmia)
After big investments in collecting & cleaning data, and building Machine Learning models, enterprises discover the big challenges in deploying models to production and managing a growing portfolio of ML models. This talk covers the strategic and technical hurdles each company must overcome and the best practices we've developed while deploying over 4,000 ML models for 70,000 engineers.
Data engineering and architecture, Streaming systems and real-time applications
Arun Kejariwal (MZ), Karthik Ramasamy (Streamlio)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
In this tutorial, we will walk the audience through the landscape of state-of-the-art systems for each stage of a end-to-end data processing pipeline, viz., messaging frameworks, streaming computing frameworks, storage frameworks for real-time data. We will also walk through case studies from IoT, Gaming and Healthcare, and share our experiences operating these systems at Internet scale.
Data science and machine learning, Strata Business Summit
Chiny Driscoll (Metistream Inc.), Jawad Khan (Rush University Medical Center )
This Cloudera/MetiStream solution lets healthcare providers automate the extraction, processing and analysis of clinical notes within the Electronic Health Record in batch or real-time. Improve care, identify errors, and recognize efficiencies in billing and diagnoses by leveraging NLP capabilities to conduct fast analytics in a distributed environment. Use case by Rush University Medical Center.
Data science and machine learning
Ahsan Ashraf (Pinterest)
Online recommender systems often rely heavily on user engagement features. This can cause a bias towards exploitation over exploration, over-optimizing on users' interests. Content diversification is important for user satisfaction, however measuring and evaluating impact is challenging. This work outlines techniques used at Pinterest that drove ~2-3% impression gains and a ~1% time spent gain.
Data engineering and architecture
Cory Minton (Dell EMC), Nikki Rouda (Cloudera)
How to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble
Data science and machine learning
James Dreiss (Reuters)
A discussion of the challenges in building a content recommendation system for one of the largest news sites in the world, Reuters.com. The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.
Data-driven business management, Emerging technologies and case studies
Steve Otto (Navistar)
Navistar built an IoT-enabled remote diagnostics platform, called OnCommand®™ Connection, to bring together data from 375,000+ vehicles in real-time, to drive predictive analytics. This service is being offered to fleet owners who can now monitor the health and performance of their trucks from smartphones or tablets. Join Steven Otto, from Navistar to learn more about their IoT & data journey.
Law, ethics, governance, Strata Business Summit
GDPR is more than another regulation to be handled by your back office. Enacting the Data Subject Access Rights (DSAR) requires practical actions. In this session, we will discuss the practical steps to deploy governed data services
Law, ethics, governance, Strata Business Summit
Anthony Hsu (LinkedIn), Issac Buenrostro (LinkedIn)
With over 100 million LinkedIn members in the EU, enforcing GDPR compliance is challenging. In this talk, we explain the architecture of our system and how we leverage Hive, Kafka, Gobblin, and WhereHows to ensure compliance.
Data-driven business management, Strata Business Summit
Jennifer Prendki (Atlassian)
The Agile Methodology has been widely successful for Software Engineering teams, but seems inappropriate for Data Science teams. This is because Data Science is part-engineering, part-research. In this talk, I will show how, with a minimum amount of tweaking, Data Science managers can adapt the techniques used in Agile and establish best practices to make their teams more efficient.
Data-driven business management, Strata Business Summit
Brandy Freitas (Pitney Bowes)
Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. In this session, Harvard Biophysicist-turned-Data Scientist, Brandy Freitas, will work with participants to develop context and vocabulary around data science topics to help build a culture of data within their organization.
Data-driven business management, Strata Business Summit
Sanjeev Mohan (Gartner)
If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. But, how does one do so if they don’t know what data is in the lake, the level of its quality and the trustworthiness of models? This is why data governance becomes the linchpin to the success of lakes.
Data-driven business management, Strata Business Summit
Mikio Braun (Zalando SE)
In order to become "AI ready", an organization not just has to provide the right technical infrastructure for data collection and processing, but also learn new skills. In this talk I will highlight three such missing pieces: making the connection between business problems and AI technology, AI driven development, and how to run AI based projects.
Law, ethics, governance, Strata Business Summit
Mark Donsky (Cloudera), Steven Ross (Cloudera)
General Data Protection Regulation (GDPR) goes into effect in May 2018 for firms doing any business in the EU. However many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). This session will explore the capabilities your data environment needs in order to simplify GDPR compliance, as well as future regulations.
Data engineering and architecture, Strata Business Summit
Ted Malaska (Blizzard Entertainment), Jonathan Seidman (Cloudera)
Creating a successful big data practice in your organization presents new challenges in managing projects and teams. In this session we'll provide guidance and best practices to help technical leaders deliver successful projects from planning to implementation.
Data-driven business management, Strata Business Summit
Cassie Kozyrkov (Google)
Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness and hiring experts doesn’t seem to help. This session examines what it takes to build a truly data-driven organizational culture and highlights a vital, yet often neglected, job function: the data science manager.
Data-driven business management, Strata Business Summit
Tony Baer (Ovum), Florian Douetteau (DATAIKU)
Ovum will present the results of research cosponsored by Dataiku, surveying a specially selected sample of chief data officers and data scientists, on how to map roles and processes to make success with AI in the business repeatable.
Strata Business Summit, Streaming systems and real-time applications
Dean Wampler (Lightbend)
Streaming data systems, so called "Fast Data", promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just "faster" versions of Big Data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. This talk tells you what you need to know to exploit Fast Data successfully.
Data science and machine learning, Strata Business Summit
David Talby (Pacific AI)
Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.
Ian Cook (Cloudera)
1-Day Training Please note: to attend a training course, you must be registered for a Platinum or Training pass; does not include access to tutorials.
Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, with different syntaxes, conventions, and terminology. The instructor will simplify the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, participants will overcome obstacles to getting started using new tools.
Data-driven business management, Law, ethics, governance
Mridul Mishra (Fidelity Investments)
Currently, most of the ML (specifically deep learning) models work like a black box and a key challenge in their adoption is the need for explainability. In this talk, we will explore the reason for the need of explainability, current state and provide a framework to think for these needs and the potential solution options.
Strata Business Summit
Alistair Croll (Solve For Interesting)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Program Chair, Alistair Croll, welcomes you to Findata Day.
Data-driven business management
Stephanie Fischer (datanizing GmbH)
Users generate text all over the internet - 24 hours a day, 7 days a week. This text often contains complaints, wishes and clever ideas. Using both unsupervised and supervised Machine Learning we show you what insight can be derived from 100.000 user comments related to New York. We will uncover the most exciting trends and sentiments with interactive visualisations.
Data-driven business management, Strata Business Summit
JF Gagne (Element AI)
The CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI Governance team that is well staffed and deeply established in the company in order to catch biases that can develop from faulty goals or flawed data
Data engineering and architecture
Julien Le Dem (WeWork)
Over the past 10 years the Big Data infrastructure has evolved from flat files in a distributed file system to an efficient ecosystem turning into a fully deconstructed and open source database with reusable components. We started from a system good at looking for a needle in a haystack using snowplows. A lot of horsepower and scalability but lacking the efficiency of relational databases.
Data-driven business management, Strata Business Summit
Friederike Schuur (Cloudera), Rita Ko (USA for UNHCR)
The Hive and Cloudera Fast Forward Labs share how they transformed USA for UNHCR (UN Refugee Agency) to use data science and machine learning (DS/ML) to address the refugee crisis. From identifying use cases and success metrics to showcasing the value of DS/ML, we cover the development and implementation of a DS/ML strategy hoping to inspire other organizations looking to derive value from data.
Data-driven business management, Strata Business Summit
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
This tutorial is a primer on crafting well-conceived data science projects on course toward uncovering valuable business insights. Using case studies and hands-on skills development, we will teach techniques that are essential for a variety of audiences invested in effecting real business change.
Data engineering and architecture
Brian Foo (Google), Ron Bodkin (Google), Jay Smith (Google), David Aronchick (Google)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Join Ron Bodkin and Brian Foo to learn how to bring deep learning models from training to serving in a cloud production environment. You will learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.
Laura Eisenhardt (iKnow Solutions)
Data brings unprecedented insights to industries about customer behavior & personal data is being harvested. We know more about our customers and neighbors then at any other time in history but need to avoid "crossing the creepy line". Governance and Security experts from Cloudera, Mastercard and iKnow solutions discuss how ethical behavior drives trust especially in today's IoT age
Data engineering and architecture, Law, ethics, governance
Mark Donsky (Cloudera), Andre Araujo (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span on-prem, private cloud, multi-cloud, and hybrid cloud. We will share hands-on best practices for meeting these challenges, with special attention to GDPR.
Zachary Glassman (The Data Incubator)
1-Day Training Please note: to attend a training course, you must be registered for a Platinum or Training pass; does not include access to tutorials.
The Data Incubator offers a foundation in building intelligent business applications using machine learning. We will walk through all the steps - from prototyping to production - of developing a machine learning pipeline. We’ll look at data cleaning, feature engineering, model building/evaluation, and deployment. Students will extend these models into an application using a real-world dataset.
Data engineering and architecture, Streaming systems and real-time applications
Dean Wampler (Lightbend), Boris Lublinsky (Lightbend)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
This hands-on tutorial builds streaming apps as microservices using Kafka with Akka Streams and Kafka Streams. We'll assess the strengths and weaknesses of each tool for particular needs, so you'll be better informed when choosing tools for your needs. We'll contrast them with Spark Streaming and Flink, including when to chose them instead. The sample apps demonstrate ML model serving ideas.
Emerging technologies and case studies
Karthik Ramasamy (Streamlio), Matteo Merli (Streamlio)
Apache Pulsar, a messaging system is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it is very important to ensure that the system can make use of all the available resources. This talk will provide insight on on the design decisions and the implementation techniques that allow Pulsar high achieve performance with strong durability guarantees.
Keynotes
Jacob Ward (CNN | Al Jazeera | PBS)
For most of us, our own mind is a black box: an all-powerful and utterly mysterious device that runs our lives for us. And not only do we humans just barely understand how it works, science is now revealing that it makes most of our decisions for us using rules and shortcuts of which you and I aren’t even aware.
Data engineering and architecture
Shawn Terry (Komatsu), Anthony Reid (Komatsu)
Global heavy equipment manufacturer, Komatsu, is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Join Shawn Terry & Anthony Reid, to learn more about their data journey and how they are using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment.
Data-driven business management
Ann Nguyen (Whole Whale)
Google returns 97,900,000 results for “data-driven business.” Innovation is the key to survival and data, combined with design thinking and iteration is a proven path. The problem is that this system lacks a conscious, it lacks empathy thinking.
Data science and machine learning
Aileen Nielsen (One Drop)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
There is mounting evidence that the widespread deployment of machine learning and artificial intelligence in business and government applications is likely reproducing or even amplifying existing prejudices and social inequalities. This tutorial is designed to give knowledge and tools to data scientists so they can identify and avoid bias and other unfairness in their analyses.
Data engineering and architecture
Osman Sarood (Mist Systems)
Mist consumes several Terabytes of telemetry data daily from its globally deployed wireless Access Points, a significant portion of which is consumed by ML algorithms. Last year, we saw 10X infrastructure growth. Learn how we are running 75% of our production infrastructure — reliably -- on AWS EC2 spot instances, which has kept our annual AWS cost to $1 million vs. $3 million (a 66% reduction).
Data engineering and architecture, Streaming systems and real-time applications
Uber has real needs to provide faster, fresher data to data consumers & products, running hundreds of thousands of analytical queries everyday. Uber engineers share the design, architecture & use-cases of the second generation of ‘Hudi’, an analytical storage engine designed to serve such needs and beyond.
Law, ethics, governance, Strata Business Summit
John Thuma (Arcadia Data)
Forget about the fake news, data and analytics in politics is what drives elections. While proposing analytical solutions to the RNC and DNC, I faced ethical dilemmas. Not only did I help causes I disagreed with, but I also armed politicians with “REAL-TIME” data to manipulate voters. Politics is a business, and today’s modern data infrastructure optimize campaign funds more effectively than ever.
Data engineering and architecture
Owen O'Malley (HortonWorks), Ryan Blue (Netflix)
Iceberg is a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.
Data engineering and architecture
Timothy Spann (DZone)
A hands-on deep dive on using Apache MiniFi with Apache MXNet and other Deep Learning Libraries on the edge device.
Data engineering and architecture
Michelle Casbon (Google Cloud Platform Developer Relations)
Learn how to build a Machine Learning application with Kubeflow, which makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere. Kubeflow supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Find out what Kubeflow currently supports and the long-term vision for the project, presented by a project contributor.
Emerging technologies and case studies
Robin Way (Corios)
This session will present case study examples of next best offer strategies, predictive customer journey analytics, and behavior-driven time-to-event targeting for mathematically-optimal customer messaging that drives incremental margins.
Data science and machine learning
Viviana Acquaviva (CUNY New York City College of Technology)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
We present an intermediate Machine Learning tutorial based on actual problems in Astronomy research. Our strengths are that we use interesting, diverse, publicly available data sets; we feature students' feedback as "best and worst" content; we focus on the customization of algorithms and evaluation metrics required by scientific applications; and we propose open problems to our participants.
Data science and machine learning
Vartika Singh (Cloudera), Suyash Ramineni (Cloudera), Juan Yu (Cloudera), Steven Totman (Cloudera), Marton Balassi (Cloudera)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Vartika Singh and Juan Yu outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks.
Data engineering and architecture
Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Huawei)
The StreamDM library provides the largest collection of data stream mining algorithms for Spark. This talk will cover how StreamDM can be used alongside Structured Streaming for build incremental models specially for non-stationary streams (i.e. those with concept drifts). Concretely, we will cover how to develop, apply and evaluate learning models using StreamDM and Structured Streaming.
Data science and machine learning
Mikio Braun (Zalando SE)
Time series data has many applications in industry, from analyzing server metrics to monitoring IoT signals and outlier detection. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn’t, and industry use cases.
Dylan Bargteil (The Data Incubator)
1-Day Training Please note: to attend a training course, you must be registered for a Platinum or Training pass; does not include access to tutorials.
The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. This training will introduce TensorFlow's capabilities through its Python interface.
Delip Rao (R7 Speech Science)
1-Day Training Please note: to attend a training course, you must be registered for a Platinum or Training pass; does not include access to tutorials.
Explore machine learning and deep learning with PyTorch and walk you through how to build effective models for real world data.
Visualization and user experience
James Bednar (Anaconda)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Python lets you solve data-science problems by stitching together packages from the Python ecosystem, but it can be difficult to assemble the right tools to solve real-world problems. Here we show how to use the 15+ packages covered by the new PyViz.org initiative to make it simple to build interactive plots and dashboards, even for large, streaming, and highly multidimensional data.
Data engineering and architecture
Alexey Kachayev (Attendify)
When we talk about microservices we usually focus on the communication layer and rarely on the data. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity: structural and semantic changes, knowledge sharing and data discovery. We'll discuss emerging technologies created to tackle these challenges.
Data-driven business management, Strata Business Summit
Nick Elprin (Domino Data Lab)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
The honeymoon era of data science is ending, and accountability is coming. Not content to wait for results that may or may not arrive, successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Nick Elprin details how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage.
Law, ethics, governance
Anand S (Gramener)
Answering simple questions about India's geography can be a nightmare. What is the boundary of the postal code? Or a census block? Or even a constituency? The official answer resides in a set of manually drawn PDFs. But an active group of volunteers are crafting open maps. Their coverage and quality are such that it may enable the largest census exercise in the world in 2020.
Data engineering and architecture
Danny Chen (Uber Technologies), Omkar Joshi (Uber Technologies), Eric Sayle (Uber Technologies)
Marmaray is a generic Hadoop ingestion and dispersal framework recently released to production at Uber. We will introduce the main features of Marmaray and business needs met, share how Marmaray can help a team's data needs by ensuring data can be reliably ingested into Hive or dispersed into online data stores, and give a deep dive into the architecture to show you how it all works.
Sponsored, Strata Business Summit
1-Day Training Please note: to attend a training course, you must be registered for a Platinum or Training pass; does not include access to tutorials.
Acquiring machine-learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp, we teach students how to apply advanced analytics in ways that reshape the enterprise and improve outcomes. This training is equal parts hackathon, presentation, and group participation.
Data science and machine learning
Dan Crankshaw (UC Berkeley RISELab)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
This tutorial consists of three parts. First, I will present an overview of the current challenges in deploying machine applications into production and provide a survey of the current state of prediction serving infrastructure. Next, I will provide a deep dive on the Clipper serving system. Finally, I will run a hands-on workshop for getting started with Clipper.
Data science and machine learning
Jared Lander (Lander Analytics)
Temporal data is being produced in ever greater quantity and fortunately so do our time series capabilities. We look at a number of techniques for modeling time series. We start with traditional methods such as ARMA then go over more modern tools such as Prophet and machine learning models like XGBoost and Neural Nets. Along the way we look at a bit of theory and code for training these models.
Data-driven business management
Jennifer Lim (Cerner)
Big Data expectations can no longer be technical requirements managed with bubble systems. It impacts our entire architecture including operational assets in areas like HR, Finance, Marketing, and Service Management. Share in approaches used to create our modern architecture strategy, realigning big data expectations with our business goals to increase our efficiency and innovation.
Data science and machine learning
David Talby (Pacific AI), Claudiu Branzan (G2 Web Services), Alexander Thomas (Indeed)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable, open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve.
Data engineering and architecture, Streaming systems and real-time applications
Thomas Weise (Lyft), Mark Grover (Lyft)
Consumer facing real-time processing poses a number of challenges to protect from fraudulent transactions and other risks. The streaming platform at Lyft seeks to support this with an architecture that brings together a data science friendly programming environment with a deployment stack for the reliability, scalability and other SLA requirements of a mission critical stream processing system.
Data science and machine learning
Zachary Hanif (Capital One)
Modern frameworks and analytical techniques are making graph analysis methods viable for increasingly large complex tasks, while an understanding of graph based analytical techniques can be extremely powerful when applied to modern practical problems. This talk examines three prominent graph analytic methods, including graph convolutional networks, and applies them to concrete use cases.
Data-driven business management, Strata Business Summit
Usama Fayyad (Open Insights), Troels Oerting (Barclays UK)
This presentation will share the main outcomes and learnings from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on BigData and AI at a major EU bank and in collaboration with several financial services institutions. The focus is on learnings and breakthroughs gleaned from making the systems work
Big data and data science in the cloud
Greg Rahn (Cloudera), Mostafa Mokhtar (Cloudera)
Cloud object stores are becoming the bedrock of a cloud data warehouse for modern data-driven enterprises. Given today's data sizes, it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. In this talk, we'll discuss the optimal end-to-end workflows and technical considerations of using Apache Impala over object stores for your cloud data warehouse.
Data engineering and architecture
Michael Freedman (TimescaleDB)
I describe how to leverage Postgres even for high-volume time-series workloads using TimescaleDB, an open-source time-series database built as a Postgres plugin. I explain its general architectural design principles, as well as new time-series data management features including adaptive time partitioning and near-real-time continuous aggregations.
Data science and machine learning
Bonnie Barrilleaux (linkedin)
Following metrics blindly leads to unintended negative side-effects. At LinkedIn as we encouraged members to join conversations, we found ourselves in danger of creating a "rich get richer" economy in which a few creators got an increasing share of all feedback. This example reminds us to regularly re-evaluate metrics, because creating value for users is more important than driving any metric.
Patrick Hall (H2O.ai | George Washington University), Navdeep Gill (H2O.ai), Megan Kurka (H2O.ai), Mark Chan (H2O.ai)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. This technical tutorial will share practical and productizable approaches for explaining, testing, and visualizing machine learning models through a series of publicly available examples using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost.
Data science and machine learning
Cristobal Lowery (Baringa), Marc Warner (ASI)
Future Home Energy Management Systems could improve their energy efficiency by predicting resident needs through utilities data. This session discusses the opportunity with a particular focus on the key data features, the need for data compression and the data quality challenges.
Platform security and cybersecurity, Strata Business Summit
Les McMonagle (BlueTalon )
"Privacy by Design" is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. This session will outline how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for non-compliance.
Streaming systems and real-time applications
Gerard Maas (Lightbend)
Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. We will provide a critical view of their differences in key aspects of a streaming application: API usability, dealing with time, dealing with state and machine learning capabilities. We will round up with practical guidance on picking one or combining both to implement resilient streaming pipelines.
Data science and machine learning
Sumit Gulwani (Microsoft)
Programming by input-output examples (PBE) is a new frontier in AI, set to revolutionize the programming experience for the masses. It can enable end users (99% of whom are non-programmers) to create small scripts, and make data scientists 10-100x more productive for many data wrangling tasks. Come learn about this new programming paradigm: its applications, form factors, the science behind it.
Emerging technologies and case studies
Ted Dunning (MapR Technologies)
Stateful containers are a well-known antipattern, but the standard answer of managing state in a separate storage tier is costly and complex. Recent developments have changed things dramatically for the better. In particular, you can now manage a high-performance software defined storage tier entirely in Kubernetes. I will describe what's new and how it makes big data easier on Kubernetes.
Data engineering and architecture, Platform security and cybersecurity
Felipe Hoffa (Google)
Before releasing a public dataset, practitioners need to thread the balance between utility and protection of individuals. In this talk we'll move from theory to real-life while handling massive public datasets. We'll showcase newly available tools that help with PII detection, and bring concepts like k-anonymity and l-diversity to a practical realm.
Data science and machine learning, Strata Business Summit
Kimberly Nevala (SAS Institute)
Too often, the discussion of AI and ML includes an expectation - if not a requirement - for infallibility. But as we know, this expectation is not realistic. So what’s a company to do? While risk can’t be eliminated, it can be rationalized. This session will demonstrate how a unflinching risk assessment enables AI/ML adoption and deployment.
Data engineering and architecture
Mauricio Aristizabal (Impact Inc)
Lessons learned from migrating Impact's traditional ETL platform to real-time on Hadoop (leveraging the full Cloudera EDH stack). A Data Lake in HBase, Spark Streaming jobs (with Spark SQL), Kudu for 'fast data' BI queries, and Kafka data bus for loose coupling between components are some of the topics we'll explore in detail.
Data-driven business management
Amro Alkhatib (National Health Insurance Company - Daman)
Processing claims is central to every insurance business. We present a successful business case for automating claims processing from idea to production. The machine learning based claim automation model uses NLP methods on non-text data and allows auditable automated claims decisions to be made.
Data-driven business management, Strata Business Summit
Yasuyuki Kataoka (NTT Innovation Institute, Inc.)
One of the challenges of sports data analytics is how to deliver machine intelligence beyond a mere real-time monitoring tool. This session highlights various real-time machine learning models in both IndyCar and Tour de France. This talk encompasses real-time data processing architecture, machine learning model, and demonstration that delivers meaningful insights for players and fans.
Jesse Anderson (Big Data Institute)
1-Day Training Please note: to attend a training course, you must be registered for a Platinum or Training pass; does not include access to tutorials.
To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks and explains how to choose the right one for your company.
Data-driven business management, Strata Business Summit
Lawrence Cowan (Cicero Group)
We've worked with firms and seen over and over that they are struggling to leverage their data. We've developed a methodology for assessing 4 critical areas that firms must consider when looking to make the analytical leap: Data Strategy; Data Culture; Data Analysis & Implementation; Data Management & Architecture.
Data science and machine learning
Bruno Gonçalves (New York University)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
The world is ever changing. As a result, many of the systems and phenomena we are interested in evolve over time resulting in time evolving datasets. Timeseries often display any interesting properties and levels of correlation. In this tutorial we will introduce the students to the use of Recurrent Neural Networks and LSTMs to model and forecast different kinds of timeseries.
Data-driven business management
Theresa Johnson (Airbnb)
How Airbnb builds its next generation end to end Revenu Forecasting Platform leveraging Machine Learning, Bayesian Inference, Tensorflow, Hadoop, and Web technology.
Big data and data science in the cloud, Data engineering and architecture
Eugene Fratkin (Cloudera), Vinithra Varadharajan (Cloudera), Stefan Salandy (Cloudera), brandon freeman (Cloudera, Inc.)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
Organizations now run diverse, multidisciplinary big data workloads that span data engineering, analytic database, and data science applications. Many of these workloads operate on the same underlying data and the workloads themselves can be transient or long-running in nature. One of the challenges we will explore is keeping the data context consistent across these various workloads.
Data science and machine learning
Ihab Ilyas (University of Waterloo | Tamr)
Machine-learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions.
Data engineering and architecture, Strata Business Summit
Francesco Mucio (Zalando SE)
The story of how Zalando went from old school BI to an AI driven company built on a solid data platform, what we learned in the process and what are the challenges we still see in front of us.
Data engineering and architecture
Changshu Liu (Pinterest)
At Pinterest we builds data lake on s3 where we read and write data directly. The footprint of the data lake exceeds 100 PB as the business expands rapidly which makes Pinterest as one of the biggest customers in AWS. This massive scale and the nature of S3 bring a lots of technical challenges on processing engines like Hive, Presto and Spark SQL.
Data-driven business management
Katharina Warzel (EveryMundo)
Airlines want to know what happens after a user interacts with our technology on their website. Do they convert? Do they close the browser and come back later? Previously depending on airline’s analytics tools to prove value, Katharina explores how to implement a client-independent end-to-end tracking system.
Data science and machine learning
Shioulin Sam (Cloudera Fast Forward Labs)
Recent advances in deep learning allow us to use the semantic content of items in recommendation systems, addressing a weakness of traditional methods. In this talk we explore limitations of classical approaches and look at how using the content of items can help solve common recommendation pitfalls such as the cold start problem, and open up new product possibilities.
Data engineering and architecture
Jacques Nadeau (Dremio)
This talk will deep-dive on a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. We'll start with an overview of the system design and deployment architecture. This includes coverage of cache lifecycle, update patterns, cache cohesion and appropriate use cases.
Emerging technologies and case studies
Greg Quist (SmartCover Systems)
The first step in solving this crisis is knowing the extent and severity of the problem. Water levels in sewers have a signature, analogous to a human EKG. This signature can be analyzed in real-time, using pattern recognition techniques, revealing distressed pipeline and allowing users of this technology to take appropriate steps for maintenance and repair. Sewers can talk!
Data science and machine learning
Chang Liu (Georgian Partners )
This talk outlines a common problem faced by many software companies, the cold-start problem, and how Georgian Partners has been successful at solving this problem by transferring knowledge from existing data through differentially private data aggregation.
Data science and machine learning
David Talby (Pacific AI)
This case study describes a question answering system, for accurately extracting facts from free-text patient records. The solution is based on Spark NLP - an open source extension of Spark ML, providing state of the art performance and accuracy for natural language understanding. We'll share best practices for training domain specific deep learning NLP models as such problems usually require.
Visualization and user experience
Brent Dykes (Domo)
With companies collecting all kinds of data and using advanced tools and techniques to find insights, they often fail in the last mile--communicating insights effectively to drive change. This session will look at the power that stories wield over statistics and explore the art and science of data storytelling—an essential skill that everyone must have in today’s data economy.
Data engineering and architecture, Streaming systems and real-time applications
Tim Berglund (Confluent)
Tutorial Please note: to attend, your registration must include Tutorials on Tuesday.
A solid introduction to Apache Kafka as a streaming data platform. We'll cover its internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams—then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka.
Streaming systems and real-time applications
William Chambers (Databricks)
Streaming big data is a rapidly growing field and one that currently involves a lot of operational complexity and expertise. This talk will discuss a decision making framework for attendees about how to reason about the tools and technologies with which they can be successful deploying and maintaining streaming data pipelines to solve business problems.
Jane Tran (Unqork)
Data’s role in financial services has been elevated. However, often times the rollout of data solutions fail when an organization’s existing culture is misaligned with its capabilities. With Unqork, we’re increasing adoption by honoring existing capabilities. This discussion will explore methods to finally implement data solutions through both qualitative and quantitative discoveries.
Data-driven business management, Strata Business Summit
Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling and infrastructure to continuing education, this talk will offer concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value.
Big data and data science in the cloud
Ryan Blue (Netflix), Daniel Weeks (Netflix)
In the last few years, Netflix's data warehouse has grown to more than 100PB in S3. This talk will summarize what we've learned, the tools we currently use and those we've retired, as well as the improvements we are rolling out, including Iceberg, a new table format for S3.
Data engineering and architecture
Gwen Shapira (Confluent)
Gwen Shapira will share design and architecture patterns that are used to modernize data engineering. We will see how Apache Kafka, microservices and event streams are used by modern engineering organizations to efficiently build data pipelines that are scalable, reliable and built to evolve.
Data science and machine learning, Strata Business Summit
Adil Aijaz (Split Software)
Many products - whether data driven or not - chase “the one metric that matters”. It may be engagement, revenue, or conversion, but the common theme is the pursuit of improvement in one metric. Product development teams should focus on the design of metrics that measure our goals. Adil will present an approach to designing metrics, discuss best practices and common pitfalls that you may run into.
Data engineering and architecture
Amandeep Khurana (Cerebro Data)
Critical data management practices for easy and unified data access that meets security and regulatory compliance
Data engineering and architecture
Umur Cubukcu (Citus Data)
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases.
Visualization and user experience
Jeffrey Heer (Trifacta | University of Washington)
Introduces Vega and Vega-Lite -- high-level declarative languages for interactive visualization that support exploratory data analysis, communication, and the development of new visualization tools.
Keynotes
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes.
Data engineering and architecture
Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)
We have developed TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. TonY's native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop including MapReduce and Spark.
Data-driven business management
Patrick Angeles (Cloudera)
The financial crisis of 2008 exposed systemic issues in the financial system that resulted in the failures of several established institutions and a bailout of the entire industry. In the aftermath, banks and regulators are turning to big data solutions in order to avoid a repeat of history.
Data engineering and architecture
This talk explain how we, at Stitch Fix, built a service to better understand the movement and evolution of data within the Data Warehouse from the initial ingestion from outside sources and through all of our ETLs. We talk about why we built the service, how we built it and the use cases that are benefitted by it.
Data engineering and architecture
Shankar Manian (LinkedIn), Manoj Kumar (Linkedin), Arpan Agrawal (LinkedIn)
Have you ever tuned a Spark or MR job? If the answer is yes, then you already know how difficult it is to tune more than hundred parameters to optimize the resources used. With Dr. Elephant we introduced heuristic based tuning recommendations. Now, we introduce TuneIn, an auto tuning tool developed to minimize the resource usage of jobs. Experiments have shown upto 50% reduction in resource usage.
Data engineering and architecture
Holden Karau (Google), Rachel Warren (Salesforce Einstein), Anya Bida (Salesforce)
Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads.
Data engineering and architecture
Tim Walpole (BJSS)
Financial Service clients demand increased data driven personalization, faster insight-based decisions and multi-channel, real-time access. BJSS discusses how organizations can deliver real time, vendor-agnostic, personalized chat services. We discuss the issues around security, privacy, legal sign-off, data compliance and how the Internet of Things can be used as a delivery platform.
Big data and data science in the cloud, Strata Business Summit
Niraj Nagrani ( Ancestry)
Ancestry has more than 10 petabytes of structured and unstructured data. Ancestry’s SVP of platform, Niraj Nagrani, will discuss how companies can build a data platform that uses cloud computing, Data Science, Artificial Intelligence and Machine Learning to analyze complex data sets at scale to provide personalized insights and relationship graph to consumers.
Data engineering and architecture
Dave Shuman (Cloudera), James Kirkland (Red Hat)
The focus on IoT is turning increasingly to the edge. And the way to make the edge more intelligent is by building machine learning models in the cloud and pushing those learnings back out to the edge. Join Cloudera and Red Hat where they will showcase how they executed this architecture at one of the world’s leading manufacturers in Europe, including a demo highlighting this architecture.
Data engineering and architecture
Jim Scott (MapR Technologies)
Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures.
Visualization and user experience
Brian O'Neill (Designing for Analytics)
Gartner says 85%+ of big data projects will fail, despite the fact your company may have invested millions on engineering implementation. Why are customers and employees not engaging with these products and services? CDOs, CIOs, product managers, and analytics leaders with a "people first, technology second" mission–a design strategy–will realize the best UX and business outcomes possible. #design
Data-driven business management
Paul Lashmet (Arcadia Data)
Artificial intelligence and deep learning are used to generate and execute trading strategies today. Meanwhile, regulators and investors demand transparency into investment decisions. The challenge is that the decision-making processes of machine learning technologies are opaque. The opportunity is that these same machines generate data that can be visualized to spot new trading opportunities.
Keynotes
Ben Lorica (O'Reilly Media), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)
Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.
Data engineering and architecture, Emerging technologies and case studies
Anant Chintamaneni (BlueData)
There is increased interest in using Kubernetes (K8s), the open-source container orchestration system for modern Big Data workloads. The promised land is a unified platform for cloud native stateless and stateful data services. However, stateful, multi-service Big Data cluster orchestration brings unique challenges. This session will delve into the considerations for Big Data services for K8s.
Data science and machine learning
Patty Ryan (Microsoft), CY Yam (Microsoft), Elena Terenzi (Microsoft)
Large online fashion retailers face the problem of efficiently maintaining catalogue of millions of items. Due to human error, it is not unusual that some items have duplicate entries. To trawl along such a large catalogue manually is near to impossible. How would you prevent such error? Find out how we applied deep learning as part of the solution.
Data engineering and architecture
William Benton (Red Hat)
Containers are a hot technology for application developers, but they also provide key benefits for data scientists. In this talk, you'll learn about the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively.
Data engineering and architecture
Felix Cheung (Uber)
Do you know how your Uber rides are powered by Apache Spark? Come to this talk to learn how Uber builds data platform with Apache Spark at enormous scale, what unique challenges we face and overcome.
Data engineering and architecture
Varant Zanoyan (Airbnb)
Zipline is Airbnb’s soon to be open sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days, and offers features to support end-to-end data management for machine learning. This talk covers architecture and dives into how Zipline solves ML specific problems.
Data-driven business management
Maryam Jahanshahi (TapRecruit)
Hiring teams have long-relied on intuition and experience to scout talent. Increased data and data-science techniques give us a chance to test common recruiting wisdom. Maryam will draw on results from her recent behavioral experiments and analyses of over 10 million jobs and their outcomes to illustrate how often innocuous recruiting decisions have dramatic impacts on hiring outcomes.