Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Schedule List View Grid View

Topics

1A 06/07

11:20am Semantic recommendations Shioulin Sam (Cloudera Fast Forward Labs)

1:15pm Document vectors in the wild: Building a content recommendation system for Reuters.com James Dreiss (Reuters)

2:05pm Diversification in recommender systems: Using topical variety to increase user satisfaction Ahsan Ashraf (Pinterest)

2:55pm Perverse incentives in metrics: Inequality in the like economy Bonnie Barrilleaux (LinkedIn)

4:35pm Anxiety at scale: How Investopedia used readership data to track market volatility Masha Westerlund (Investopedia)

5:25pm Network effects: Working with modern graph analytic systems Zachary Hanif (Capital One)

1A 08

11:20am BlazeIt: An exploratory video analytics engine Daniel Kang (Stanford University)

1:15pm Why data scientists should love Linux containers William Benton (Red Hat)

2:05pm Bighead: Airbnb's end-to-end machine learning platform Atul Kale (Airbnb), Xiaohan Zeng (Airbnb)

2:55pm Solving the cold start problem: Data and model aggregation using differential privacy Chang Liu (Georgian Partners )

4:35pm Programming by input-output examples Sumit Gulwani (Microsoft)

5:25pm From emotion analysis and topic extraction to narrative modeling Andreea Kremm (Netex Group), Mohammed Ibraaz Syed (UCLA)

1A 12/14

11:20am Breaking the rules: End-stage renal disease prediction Olga Cuznetova (Optum), Manna Chang (Optum)

1:15pm Correlation analysis on live data streams Arun Kejariwal (Independent), Francois Orsini (MZ)

2:05pm Continuous machine learning over streaming data: The story continues. Roger Barga (Amazon Web Services), Sudipto Guha (Amazon Web Services), Kapil Chhabra (Amazon Web Services )

2:55pm 50 reasons to learn the shell for doing data science Jeroen Janssens (Data Science Workshops)

4:35pm VC trends in machine learning and data science Sarah Catanzaro (Amplify Partners), Rama Sekhar (Norwest Venture Partners), Zavain Dar (Lux Capital), Jonathan Lehr (Work-Bench), Crystal Huang (NEA)

5:25pm A roadmap for open data science and AI for business: Panel discussion with State Street Bethann Noble (Cloudera), Daniel Huss (State Street), Abhishek Kodi (State Street)

1A 15/16

11:20am Machine learning for time series: What works and what doesn't Mikio Braun (Zalando)

1:15pm Harnessing and customizing state-of-the-art recommendation solutions with OpenRec Longqi Yang (Cornell Tech, Cornell University)

2:05pm Achieving personalization with LSTMs Ankit Jain (Uber)

2:55pm A deep learning approach for precipitation nowcasting with RNN using Analytics Zoo on BigDL Alex Heye (Cray), Ding Ding (Intel)

4:35pm When Tiramisu meets online fashion retail Patty Ryan (Microsoft), CY Yam (Microsoft), Elena Terenzi (Microsoft)

5:25pm Accelerating financial data science workflows with GPUs Joshua Patterson (NVIDIA), Onur Yilmaz (NVIDIA)

1A 10

11:20am Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber Felix Cheung (Uber)

1:15pm The evolution of Netflix's S3 data warehouse Ryan Blue (Netflix), Daniel Weeks (Netflix)

2:05pm Building a recommendation engine Sophie Watson (Red Hat)

2:55pm Optimizing Apache Impala for a cloud-based data warehouse Greg Rahn (Cloudera)

4:35pm Setting up a lightweight distributed caching layer using Apache Arrow Jacques Nadeau (Dremio)

5:25pm From flat files to deconstructed database: The evolution and future of the big data ecosystem Julien Le Dem (WeWork)

1A 21/22

11:20am Protecting sensitive data in huge datasets: Cloud tools you can use Felipe Hoffa (Google), Damien Desfontaines (Google | ETH Zürich)

1:15pm A data marketplace case study with the blockchain and advanced multitenant Hadoop in a smart open data platform Minh Chau Nguyen (ETRI), Heesun Won (ETRI)

2:05pm Using the blockchain in the enterprise Jim Scott (NVIDIA)

2:55pm Zipline: Airbnb's data management platform for machine learning Varant Zanoyan (Airbnb)

4:35pm How to cost-effectively and reliably build infrastructure for machine learning Osman Sarood (Mist Systems)

5:25pm Apache Kafka and the four challenges of production machine learning systems Jay Kreps (Confluent)

1A 23/24

11:20am The future of ETL isn’t what it used to be. Gwen Shapira (Confluent)

1:15pm Lessons learned building a scalable and extendable data pipeline for Call of Duty Yaroslav Tkachenko (Activision)

2:05pm Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)

2:55pm Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned Mauricio Aristizabal (Impact)

4:35pm Tracking data lineage at Stitch Fix Neelesh Salian (Stitch Fix)

5:25pm Circuit breakers to safeguard for garbage in, garbage out Sandeep Uttamchandani (Intuit)

1E 07/08

11:20am Processing fast data with Apache Spark: A tale of two APIs Gerard Maas (Lightbend)

1:15pm Why and how to leverage the power and simplicity of SQL on Apache Flink Fabian Hueske (Ververica)

2:05pm Building Fabric Answers using Apache Heron Karthik Ramasamy (Streamlio), Andrew Jorgensen (Google)

2:55pm Streaming big data in the cloud: What to consider and why Bill Chambers (Databricks)

4:35pm AppNexus's stream-based control system for automated buying of digital ads Brian Wu (AppNexus)

5:25pm Hudi: Unifying storage and serving for batch and near-real-time analytics Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)

1E 09

11:20am DIY versus designer approaches to deploying data center infrastructure for machine learning and analytics Cory Minton (Dell EMC), Colm Moynihan (Cloudera)

1:15pm Data governance: A big job that's getting bigger Andrew Brust (Blue Badge Insights | ZDNet)

2:05pm What's the Hadoop-la about Kubernetes? Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)

2:55pm Clouds and containers: Case studies for big data Paul Curtis (Weaveworks)

4:35pm Using machine learning to drive intelligence at the edge Dave Shuman (Cloudera), Bryan Dean (Red Hat)

5:25pm Introducing Iceberg: Tables designed for object stores Owen O'Malley (Cloudera), Ryan Blue (Netflix)

1E 10/11

11:20am From data governance to AI governance: The CIO's new role JF Gagne (Element AI)

1:15pm Data University: How Airbnb democratized data Erin Coffman (Airbnb)

2:05pm Realizing the true value in your data: Data-drivenness assessment Lawrence Cowan (Cicero Group)

2:55pm From strategy to implementation: Putting data to work at USA for UNHCR Friederike Schuur (Cloudera), Rita Ko (USA for UNHCR)

4:35pm The lure of "the one metric that matters" Adil Aijaz (Split Software)

5:25pm Deploying machine learning models in the enterprise Diego Oppenheimer (Algorithmia)

1E 12/13

11:20am Agile for data science teams Jennifer Prendki (Figure Eight)

1:15pm Privacy by design: Building in data privacy and protection versus bolting it on later Les McMonagle (BlueTalon)

2:05pm An ethical foundation for the AI-driven future Harry Glaser (Periscope Data)

2:55pm Beyond explainability: Regulating machine learning in practice Andrew Burt (bnh.ai)

4:35pm Rationalizing risk in AI and ML Kimberly Nevala (SAS)

5:25pm If you thought politics was dirty, you should see the analytics behind it. John Thuma (Arcadia Data)

1E 14

11:20am Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations Mark Donsky (Okera), Steven Ross (Cloudera)

1:15pm Executive Briefing: Profit from AI and machine learning—The best practices for people and process Tony Baer (dbInsight), Florian Douetteau (DATAIKU)

2:05pm Executive Briefing: Why machine-learned models crash and burn in production and what to do about it David Talby (Pacific AI)

2:55pm Executive Briefing: Managing successful data projects—Technology selection and team building Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

4:35pm Executive Briefing: Most data-driven cultures aren’t Cassie Kozyrkov (Google)

5:25pm Executive Briefing: Enhance your data lake with comprehensive data governance to improve adoption and meet compliance needs Sanjeev Mohan (Gartner)

Expo Hall

11:20am Next-generation cybersecurity via data fusion, AI, and big data: Pragmatic lessons from the front lines in financial services Usama Fayyad (Open Insights & OODA Health, Inc.), Troels Oerting (WEF Global Cybersecurity Center)

1:15pm A comparative analysis of the fundamentals of AWS and Azure Jason Wang (Cloudera), Suraj Acharya (Cloudera), Tony Wu (Cloudera)

2:05pm MLflow: An open platform to simplify the machine learning lifecycle Mani Parkhe (Databricks), Andrew Chen (Databricks)

2:55pm Performant time series data management and analytics with Postgres Michael Freedman (TimescaleDB)

4:35pm Architectural principles for building trusted, real-time, distributed IoT systems Dan Harple (Context Labs)

5:25pm Automating business processes with large-scale knowledge graphs Mike Tung (Diffbot)

6:05pm Booth Crawl | Room: Expo Hall

1 E15

11:20am Deep learning: Assessing analytics project feasibility and requirements (sponsored by NVIDIA) Ward Eldred (NVIDIA)

1:15pm Simplifying AI infrastructure: Lessons in scaling a deep learning enterprise (sponsored by NVIDIA) Darrin Johnson (NVIDIA)

2:05pm Kubernetes on GPUs (sponsored by NVIDIA) Michael Balint (NVIDIA)

4:35pm GPU-accelerated analytics and machine learning ecosystems (Inception Showcase sponsored by NVIDIA) Alen Capalik (FASTDATA.io), Jim McHugh (NVIDIA), SriSatish Ambati (H2O.ai), Tim Delisle (Datalogue)

5:25pm Accelerate AI with synthetic data using generative adversarial networks (GAN) (sponsored by NVIDIA) Renee Yao (NVIDIA)

1E 17

11:20am Guidebook to unwind the enterprise "data hairball" and get ready for AI (sponsored by IBM) Tim Davis (IBM)

1:15pm Refactor your data warehouse with mobile analytics products (sponsored by Kyligence) Zhi Zhu (China Construction Bank ), Luke Han (Kyligence)

2:05pm Bringing together machine and human intelligence (sponsored by SAP) Richard Mooney (SAP)

2:55pm The big data makeover: 10 months from ideation to enterprise-scale solution (sponsored by Infoworks) Chris Stirrat (Eagle Investment Systems)

4:35pm Hadoop-compatible filesystems: The limits of "compatible" (sponsored by WANdisco) Paul Scott-Murphy (WANdisco)

5:25pm From data lakes to the data fabric: Our vision for digital strategy (sponsored by Cambridge Semantics) Sam Chance (Cambridge Semantics), Partha Bhattachargee (Cambridge Semantics)

1A 01/02

11:20am Data operations problems created by deep learning and how to fix them (sponsored by MapR) Jim Scott (NVIDIA)

1:15pm Kubernetes plays Cupid for data scientists and IT (sponsored by MapR) Skyler Thomas (MapR)

2:05pm A developer's guide to building AI applications (sponsored by Microsoft) Anand Raman (Microsoft), Wee Hyong Tok (Microsoft)

2:55pm Use of modern data environments in telecom (sponsored by Microstrategy) Sara Alavi (Bell Canada)

4:35pm TD Bank’s journey to turn its big data environment into a true data lake (sponsored by Talend) Joseph ( Joe ) DosSantos (TD Bank)

5:25pm How to avoid drowning in logs: Streaming 80 billion events and batch processing 40 TB/hour (sponsored by Pure Storage) Ivan Jibaja (Pure Storage)

1A 03

11:20am Ubiquitous machine learning (sponsored by Cisco) Chiang Yang (Cisco)

1:15pm Interactive business intelligence and OLAP on big data lakes using a Spark-native fast data mart (sponsored by Oracle + DataScience.com) Srikanth Desikan (Oracle)

2:05pm Preventing more fraud in less time with machine learning-driven data management (sponsored by Informatica) chris wojdak (Symcor)

2:55pm Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda) Mathew Lodge (Anaconda)

4:35pm Accelerate big data analytics and AI with NetApp hybrid cloud architecture (sponsored by NetApp) Karthikeyan Nagalingam (NetApp)

5:25pm Data for posterity: Nobody licenses or builds data just to have it. (sponsored by Pitney Bowes) Dan Adams (Pitney Bowes)

1A 04/05

11:20am Using modern database and open source tools to accelerate client service delivery (sponsored by MemSQL) Petrus Smith (PwC)

1:15pm How the blurring of memory and storage is revolutionizing the data era (sponsored by Intel) Arakere Ramesh (Intel), Bharath Yadla (Aerospike)

2:05pm Governing your cloud-based enterprise data lake (sponsored by Zaloni) Ben Sharma (Zaloni), Selwyn Collaco (TMX)

2:55pm Speed, scale, smarts: GPU-powered analytics for the extreme data economy (sponsored by Kinetica) Michael Mahoney (Kinetica)

4:35pm Keys to operationalize enterprise 360 (sponsored by Impetus) Anand Raman (Impetus Technologies)

5:25pm How Bell Canada increased the scale of BI exponentially with OLAP on big data (sponsored by Kyvos Insights) Mark Huang (Bell Canada)

1E 06

11:20am Commercial software in an increasingly open source ecosystem (sponsored by SAS) Paul Kent (SAS)

1:15pm Feet on the ground, head in the clouds (sponsored by AtScale) Mark Stange-Tregear (Ebates)

2:05pm A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data) Randy Lea (Arcadia Data)

2:55pm Enabling predictive maintenance using automated IoT data pipelines (sponsored by BMC) Basil Faruqui (BMC Software)

4:35pm Best practices for migrating big data workloads to Amazon Web Services (sponsored by Amazon Web Services) Faria Bruno (Amazon Web Services)

5:25pm A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data) Randy Lea (Arcadia Data)

3E
8:50am Wednesday keynotes Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

9:00am The future of data warehousing Anupam Singh (Cloudera), brian coyne (PNC)

9:15am Managing risk in machine learning Ben Lorica (O'Reilly)

9:25am The answer to life, the universe, and everything: But can you get that into production? (sponsored by MapR) Ted Dunning (MapR, now part of HPE)

9:35am Von Neumann to deep learning: Data revolutionizing the future Jeffrey Wecker (Goldman Sachs)

9:50am AI, ML, and the IoT will destroy the data center and the cloud (just not in the way you think) (sponsored by Cisco) DD Dasgupta (Cisco)

9:55am The Missing Piece Cassie Kozyrkov (Google)

10:15am Leveraging the best of the past to power a better future (sponsored by MemSQL) Drew Paroski (MemSQL), Aatif Din (Fanatics)

10:25am The power of Ethereum Joseph Lubin (Consensus Systems)

10:50am Morning break sponsored by Cisco | Room: 3B | Expo Hall

3:35pm Afternoon Break sponsored by Intel | Room: 3B | Expo Hall

7:30am Morning Coffee | Room: 3E Foyer

8:00am Speed Networking | Room: Crystal Palace

7:05pm Grey space closer slot only

12:00pm Lunch sponsored by MapR Wednesday Topic Tables at Lunch | Room: Expo Hall (Hall 3B)

12:00pm Wednesday Business Summit Lunch | Room: 3D 09

12:00pm Better Together: Women in Big Data Luncheon (sponsored by SAP and Intel) | Room: 3D 10/11

7:30pm Sponsored by Cloudera and Cisco Data After Dark | Room: TAO Downtown

11:20am-12:00pm (40m) Data science and machine learning Deep Learning, Recommendation Systems

Semantic recommendations

Shioulin Sam (Cloudera Fast Forward Labs)

Recent advances in deep learning allow us to use the semantic content of items in recommendation systems, addressing a weakness of traditional methods. Shioulin Sam explores the limitations of classical approaches and explains how using the content of items can help solve common recommendation pitfalls, such as the cold start problem, and open up new product possibilities.

1:15pm-1:55pm (40m) Data science and machine learning Media, Marketing, Advertising, Recommendation Systems, Text and Language processing and analysis

Document vectors in the wild: Building a content recommendation system for Reuters.com

James Dreiss (Reuters)

James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in the world, Reuters.com. The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content.

2:05pm-2:45pm (40m) Data science and machine learning Media, Marketing, Advertising, Recommendation Systems

Diversification in recommender systems: Using topical variety to increase user satisfaction

Ahsan Ashraf (Pinterest)

Online recommender systems often rely heavily on user engagement features. This can cause a bias toward exploitation over exploration, overoptimizing on users' interests. Content diversification is important for user satisfaction, but measuring and evaluating impact is challenging. Ahsan Ashraf outlines techniques used at Pinterest that drove ~2–3% impression gains and a ~1% time-spent gain.

2:55pm-3:35pm (40m) Data science and machine learning Ethics and Privacy, Media, Marketing, Advertising, Recommendation Systems

Perverse incentives in metrics: Inequality in the like economy

Bonnie Barrilleaux (LinkedIn)

As LinkedIn encouraged members to join conversations, it found itself in danger of creating a "rich get richer" economy in which a few creators got an increasing share of all feedback. Bonnie Barrilleaux explains why you must regularly reevaluate metrics to avoid perverse incentives—situations where efforts to increase the metric cause unintended negative side effects.

4:35pm-5:15pm (40m) Data science and machine learning Financial Services, Text and Language processing and analysis

Anxiety at scale: How Investopedia used readership data to track market volatility

Masha Westerlund (Investopedia)

Businesses rely on user data to power their sites, products, and sales. Can we give back by sharing those insights with users? Masha Westerlund explains how Investopedia harnessed reader data to build an index that tracks market anxiety and moves with the VIX, a proprietary measure of market volatility. You'll see how thinking outside the box helps turn data into tools for users, not stakeholders.

5:25pm-6:05pm (40m) Data science and machine learning Financial Services

Network effects: Working with modern graph analytic systems

Zachary Hanif (Capital One)

An understanding of graph-based analytical techniques can be extremely powerful when applied to modern practical problems, and modern frameworks and analytical techniques are making graph analysis methods viable for increasingly large, complex tasks. Zachary Hanif examines three prominent graph analytic methods, including graph convolutional networks, and applies them to concrete use cases.

11:20am-12:00pm (40m) Data science and machine learning Media, Marketing, Advertising

BlazeIt: An exploratory video analytics engine

Daniel Kang (Stanford University)

Daniel Kang offers an overview of exploratory video analytics engine BlazeIt, which offers FrameQL, a declarative SQL-like language for querying video, and a query optimizer for executing these queries. You'll see how FrameQL can capture a large set of real-world queries ranging from aggregation and scrubbing and how BlazeIt can execute certain queries up to 2,000x faster than a naive approach.

1:15pm-1:55pm (40m) Data engineering and architecture, Data science and machine learning Model lifecycle management

Why data scientists should love Linux containers

William Benton (Red Hat)

Containers are a hot technology for application developers, but they also provide key benefits for data scientists. William Benton details the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively.

2:05pm-2:45pm (40m) Data science and machine learning Data Platforms, Model lifecycle management, Retail and e-commerce

Bighead: Airbnb's end-to-end machine learning platform

Atul Kale (Airbnb), Xiaohan Zeng (Airbnb)

Atul Kale and Xiaohan Zeng offer an overview of Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Built on Python, Spark, and Kubernetes, Bighead integrates popular libraries like TensorFlow, XGBoost, and PyTorch and is designed be used in modular pieces.

2:55pm-3:35pm (40m) Data science and machine learning Ethics and Privacy

Solving the cold start problem: Data and model aggregation using differential privacy

Chang Liu (Georgian Partners )

Chang Liu offers an overview of a common problem faced by many software companies, the cold-start problem, and explains how Georgian Partners has been successful at solving this problem by transferring knowledge from existing data through differentially private data aggregation.

4:35pm-5:15pm (40m) Data science and machine learning

Programming by input-output examples

Sumit Gulwani (Microsoft)

Programming by input-output examples (PBE) is a new frontier in AI, set to revolutionize the programming experience for the masses. It can enable end users—99% of whom are nonprogrammers—to create small scripts and make data scientists 10–100x more productive for many data wrangling tasks. Sumit Gulwani leads a deep dive into this new programming paradigm and explores the science behind it.

5:25pm-6:05pm (40m) Data science and machine learning Text and Language processing and analysis

From emotion analysis and topic extraction to narrative modeling

Andreea Kremm (Netex Group), Mohammed Ibraaz Syed (UCLA)

Narrative economics studies the impact of popular narratives and stories on economic fluctuations in the context of human interests and emotions. Andreea Kremm and Mohammed Ibraaz Syed describe the use of emotion analysis, entity relationship extraction, and topic modeling in modeling narratives from written human communication.

11:20am-12:00pm (40m) Data science and machine learning Health and Medicine

Breaking the rules: End-stage renal disease prediction

Olga Cuznetova (Optum), Manna Chang (Optum)

Olga Cuznetova and Manna Chang demonstrate supervised and unsupervised learning methods to work with claims data and explain how the methods complement each other. The supervised method looks at CKD patients at risk of developing end-stage renal disease (ESRD), while the unsupervised approach looks at the classification of patients that tend to develop this disease faster than others.

1:15pm-1:55pm (40m) Data science and machine learning Media, Marketing, Advertising, Temporal data and time-series analytics

Correlation analysis on live data streams

Arun Kejariwal (Independent), Francois Orsini (MZ)

The rate of growth of data volume and velocity has been accelerating along with increases in the variety of data sources. This poses a significant challenge to extracting actionable insights in a timely fashion. Arun Kejariwal and Francois Orsini explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.

2:05pm-2:45pm (40m) Data science and machine learning Retail and e-commerce, Temporal data and time-series analytics

Continuous machine learning over streaming data: The story continues.

Roger Barga (Amazon Web Services), Sudipto Guha (Amazon Web Services), Kapil Chhabra (Amazon Web Services )

Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams.

2:55pm-3:35pm (40m) Data science and machine learning

50 reasons to learn the shell for doing data science

Jeroen Janssens (Data Science Workshops)

"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems.

4:35pm-5:15pm (40m) Data science and machine learning, Data-driven business management

VC trends in machine learning and data science

Sarah Catanzaro (Amplify Partners), Rama Sekhar (Norwest Venture Partners), Zavain Dar (Lux Capital), Jonathan Lehr (Work-Bench), Crystal Huang (NEA)

In this panel discussion, venture capital investors explain how startups can accelerate enterprise adoption of machine learning and explore the new tech trends that will give rise to the next transformation in the big data landscape.

5:25pm-6:05pm (40m) Data science and machine learning

A roadmap for open data science and AI for business: Panel discussion with State Street

Bethann Noble (Cloudera), Daniel Huss (State Street), Abhishek Kodi (State Street)

Bethann Noble, Abhishek Kodi, and Daniel Huss share their experience and best practices for designing and executing on a roadmap for open data science and AI for business.

11:20am-12:00pm (40m) Data science and machine learning Deep Learning, Retail and e-commerce, Temporal data and time-series analytics

Machine learning for time series: What works and what doesn't

Mikio Braun (Zalando)

Time series data has many applications in industry, from analyzing server metrics to monitoring IoT signals and outlier detection. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn’t, and industry use cases.

1:15pm-1:55pm (40m) Data science and machine learning Deep Learning, Media, Marketing, Advertising, Recommendation Systems, Retail and e-commerce

Harnessing and customizing state-of-the-art recommendation solutions with OpenRec

Longqi Yang (Cornell Tech, Cornell University)

State-of-the-art recommendation algorithms are increasingly complex and no longer one size fits all. Current monolithic development practice poses significant challenges to rapid, iterative, and systematic, experimentation. Longqi Yang explains how to use OpenRec to easily customize state-of-the-art solutions for diverse scenarios.

2:05pm-2:45pm (40m) Data science and machine learning Deep Learning, Recommendation Systems, Temporal data and time-series analytics, Transportation and Logistics

Achieving personalization with LSTMs

Ankit Jain (Uber)

Personalization is a common theme in social networks and ecommerce businesses. Personalization at Uber involves an understanding of how each driver and rider is expected to behave on the platform. Ankit Jain explains how Uber employs deep learning using LSTMs and its huge database to understand and predict the behavior of each and every user on the platform.

2:55pm-3:35pm (40m) Data science and machine learning Deep Learning, Temporal data and time-series analytics

A deep learning approach for precipitation nowcasting with RNN using Analytics Zoo on BigDL

Alex Heye (Cray), Ding Ding (Intel)

Precipitation nowcasting is used to predict the future rainfall intensity over a relatively short timeframe. The forecasting resolution and time accuracy required are much higher than for other traditional forecasting tasks. Alexander Heye and Ding Ding explain how to build a precipitation nowcasting system with recurrent neural networks using BigDL on Apache Spark.

4:35pm-5:15pm (40m) Data science and machine learning Deep Learning, Media, Marketing, Advertising, Retail and e-commerce

When Tiramisu meets online fashion retail

Patty Ryan (Microsoft), CY Yam (Microsoft), Elena Terenzi (Microsoft)

Large online fashion retailers must efficiently maintain catalogues of millions of items. Due to human error, it's not unusual that some items have duplicate entries. Since manually trawling such a large catalogue is next to impossible, how can you find these entries? Patty Ryan, CY Yam, and Elena Terenzi explain how they applied deep learning for image segmentation and background removal.

5:25pm-6:05pm (40m) Data science and machine learning Financial Services

Accelerating financial data science workflows with GPUs

Joshua Patterson (NVIDIA), Onur Yilmaz (NVIDIA)

GPUs have allowed financial firms to accelerate their computationally demanding workloads. Today, the bottleneck has moved completely to ETL. The GPU Open Analytics Initiative (GoAi) is helping accelerate ETL while keeping the entire workflow on GPUs. Joshua Patterson and Onur Yilmaz discuss several GPU-accelerated data science tools and libraries.

11:20am-12:00pm (40m) Data engineering and architecture Data Integration and Data Pipelines, Transportation and Logistics

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber

Felix Cheung (Uber)

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame.

1:15pm-1:55pm (40m) Big data and data science in the cloud Data Platforms

The evolution of Netflix's S3 data warehouse

Ryan Blue (Netflix), Daniel Weeks (Netflix)

In the last few years, Netflix's data warehouse has grown to more than 100 PB in S3. Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3.

2:05pm-2:45pm (40m) Data engineering and architecture

Building a recommendation engine

Sophie Watson (Red Hat)

Recommender systems enhance user experience and business revenue every day. Sophie Watson demonstrates how to develop a robust recommendation engine using a microservice architecture.

2:55pm-3:35pm (40m) Big data and data science in the cloud

Optimizing Apache Impala for a cloud-based data warehouse

Greg Rahn (Cloudera)

Cloud object stores are becoming the bedrock of cloud data warehouses for modern data-driven enterprises, and it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. Greg Rahn and Mostafa Mokhtar discuss optimal end-to-end workflows and technical considerations for using Apache Impala over object stores for your cloud data warehouse.

4:35pm-5:15pm (40m) Data engineering and architecture

Setting up a lightweight distributed caching layer using Apache Arrow

Jacques Nadeau (Dremio)

Jacques Nadeau leads a deep dive into a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture—including the cache life cycle, update patterns, cache cohesion, and appropriate use cases—learn how it all works, and see it in action.

5:25pm-6:05pm (40m) Data engineering and architecture

From flat files to deconstructed database: The evolution and future of the big data ecosystem

Julien Le Dem (WeWork)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.

11:20am-12:00pm (40m) Data engineering and architecture, Platform security and cybersecurity Ethics and Privacy

Protecting sensitive data in huge datasets: Cloud tools you can use

Felipe Hoffa (Google), Damien Desfontaines (Google | ETH Zürich)

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa and Damien Desfontaines explore how to handle massive public datasets, taking you from theory to real life as they showcase newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm.

1:15pm-1:55pm (40m) Data engineering and architecture, Emerging technologies & case studies Blockchain and decentralization, Data preparation, governance and privacy

A data marketplace case study with the blockchain and advanced multitenant Hadoop in a smart open data platform

Minh Chau Nguyen (ETRI), Heesun Won (ETRI)

Minh Chau Nguyen and Heesun Won explain how to implement analytics services in data marketplace systems on a single Hadoop cluster across distributed data centers. The solution extends the overall architecture of the Hadoop ecosystem with the blockchain so that multiple tenants and authorized third parties can securely access data while still maintaining privacy, scalability, and reliability.

2:05pm-2:45pm (40m) Data engineering and architecture Blockchain and decentralization, Financial Services

Using the blockchain in the enterprise

Jim Scott (NVIDIA)

Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures.

2:55pm-3:35pm (40m) Data engineering and architecture Data Platforms, Retail and e-commerce

Zipline: Airbnb's data management platform for machine learning

Varant Zanoyan (Airbnb)

Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems.

4:35pm-5:15pm (40m) Data engineering and architecture Data Platforms

How to cost-effectively and reliably build infrastructure for machine learning

Osman Sarood (Mist Systems)

Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million.

5:25pm-6:05pm (40m) Data engineering and architecture Model lifecycle management

Apache Kafka and the four challenges of production machine learning systems

Jay Kreps (Confluent)

Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. Jay Kreps explores some of the difficulties of building production machine learning systems and explains how Apache Kafka and stream processing can help.

11:20am-12:00pm (40m) Data engineering and architecture Data Integration and Data Pipelines

The future of ETL isn’t what it used to be.

Gwen Shapira (Confluent)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering. You'll learn how modern engineering organizations use Apache Kafka, microservices, and event streams to efficiently build data pipelines that are scalable, reliable, and built to evolve.

1:15pm-1:55pm (40m) Data engineering and architecture Data Integration and Data Pipelines

Lessons learned building a scalable and extendable data pipeline for Call of Duty

Yaroslav Tkachenko (Activision)

What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision.

2:05pm-2:45pm (40m) Data engineering and architecture Data Integration and Data Pipelines

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works.

2:55pm-3:35pm (40m) Data engineering and architecture Data Integration and Data Pipelines

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

Mauricio Aristizabal (Impact)

Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components.

4:35pm-5:15pm (40m) Data engineering and architecture Data Integration and Data Pipelines, Data preparation, governance and privacy

Tracking data lineage at Stitch Fix

Neelesh Salian (Stitch Fix)

Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases.

5:25pm-6:05pm (40m) Big data and data science in the cloud, Data engineering and architecture Data Integration and Data Pipelines, Financial Services

Circuit breakers to safeguard for garbage in, garbage out

Sandeep Uttamchandani (Intuit)

Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights.

11:20am-12:00pm (40m) Streaming systems & real-time applications

Processing fast data with Apache Spark: A tale of two APIs

Gerard Maas (Lightbend)

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences with regard to key aspects of a streaming application: API usability, dealing with time, dealing with state and machine learning capabilities, and more. You'll learn when to pick one over the other or combine both to implement resilient streaming pipelines.

1:15pm-1:55pm (40m) Streaming systems & real-time applications

Why and how to leverage the power and simplicity of SQL on Apache Flink

Fabian Hueske (Ververica)

Fabian Hueske discusses why SQL is a great approach to unify batch and stream processing. He gives an update on Apache Flink's SQL support and shares some interesting use cases from large-scale production deployments. Finally, Fabian presents Flink's new query service that enables users and applications to submit streaming and batch SQL queries and retrieve low-latency updated results.

2:05pm-2:45pm (40m) Streaming systems & real-time applications

Building Fabric Answers using Apache Heron

Karthik Ramasamy (Streamlio), Andrew Jorgensen (Google)

Streaming systems like Apache Heron are being used for an increasingly broad array of applications. Karthik Ramasamy and Andrew Jorgensen offer an overview of Fabric Answers, which provides real-time insights to mobile developers to improve their product experience at Google Fabric using Apache Heron.

2:55pm-3:35pm (40m) Streaming systems & real-time applications

Streaming big data in the cloud: What to consider and why

Bill Chambers (Databricks)

Streaming big data is a rapidly growing field but currently involves a lot of operational complexity and expertise. Bill Chambers shares a decision making framework for determining the best tools and technologies for successfully deploying and maintaining streaming data pipelines to solve business problems and offers an overview of Apache Spark’s Structured Streaming processing engine.

4:35pm-5:15pm (40m) Data engineering and architecture, Streaming systems & real-time applications

AppNexus's stream-based control system for automated buying of digital ads

Brian Wu (AppNexus)

Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. Brian Wu describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization, and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus.

5:25pm-6:05pm (40m) Data engineering and architecture, Streaming systems & real-time applications Data Integration and Data Pipelines

Hudi: Unifying storage and serving for batch and near-real-time analytics

Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond.

11:20am-12:00pm (40m) Data engineering and architecture Data Platforms

DIY versus designer approaches to deploying data center infrastructure for machine learning and analytics

Cory Minton (Dell EMC), Colm Moynihan (Cloudera)

Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble.

1:15pm-1:55pm (40m) Data engineering and architecture Data preparation, governance and privacy

Data governance: A big job that's getting bigger

Andrew Brust (Blue Badge Insights | ZDNet)

Data governance has grown from a set of mostly data management-oriented technologies in the data warehouse era to encompass catalogs, glossaries, and more in the data lake era. Now new requirements are emerging, and new products are rising to meet the challenge. Andrew Brust tracks data governance's past and present and offers a glimpse of the future.

2:05pm-2:45pm (40m) Data engineering and architecture, Emerging technologies & case studies

What's the Hadoop-la about Kubernetes?

Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)

Kubernetes (K8s)—the open source container orchestration system for modern big data workloads—is increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s.

2:55pm-3:35pm (40m) Data engineering and architecture

Clouds and containers: Case studies for big data

Paul Curtis (Weaveworks)

Once the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? Paul Curtis explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment.

4:35pm-5:15pm (40m) Data engineering and architecture Model lifecycle management

Using machine learning to drive intelligence at the edge

Dave Shuman (Cloudera), Bryan Dean (Red Hat)

The focus on the IoT is turning increasingly to the edge, and the way to make the edge more intelligent is by building machine learning models in the cloud and pushing them back out to the edge. Dave Shuman and Bryan Dean explain how Cloudera and Red Hat executed this architecture at one of Europe's leading manufacturers, along with a demo highlighting this architecture.

5:25pm-6:05pm (40m) Data engineering and architecture

Introducing Iceberg: Tables designed for object stores

Owen O'Malley (Cloudera), Ryan Blue (Netflix)

Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet.

11:20am-12:00pm (40m) Data-driven business management, Strata Business Summit Data preparation, governance and privacy, Machine Learning in the enterprise

From data governance to AI governance: The CIO's new role

JF Gagne (Element AI)

JF Gagne explains why the CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI governance team that is well staffed and deeply established in the company, in order to catch biases that can develop from faulty goals or flawed data.

1:15pm-1:55pm (40m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise, Retail and e-commerce

Data University: How Airbnb democratized data

Erin Coffman (Airbnb)

Airbnb has open-sourced many high-leverage data tools, including Airflow, Superset, and the Knowledge Repo, but adoption of these tools across the company was relatively low. Erin Coffman offers an overview of Data University, launched to make data more accessible and utilized in decision making at Airbnb.

2:05pm-2:45pm (40m) Data-driven business management, Strata Business Summit

Realizing the true value in your data: Data-drivenness assessment

Lawrence Cowan (Cicero Group)

Firms are struggling to leverage their data. Lawrence Cowan outlines a methodology for assessing four critical areas that firms must consider when looking to make the analytical leap: data strategy, data culture, data analysis and implementation, and data management and architecture.

2:55pm-3:35pm (40m) Data-driven business management, Strata Business Summit

From strategy to implementation: Putting data to work at USA for UNHCR

Friederike Schuur (Cloudera), Rita Ko (USA for UNHCR)

Friederike Schuur and Rita Ko explain how the Hive (an internal group at USA for UNHCR) and Cloudera Fast Forward Labs transformed USA for UNHCR, enabling the agency to use data science and machine learning (DS/ML) to address the refugee crisis. Along the way, they cover the development and implementation of a DS/ML strategy, identify use cases and success metrics, and showcase the value of DS/ML.

4:35pm-5:15pm (40m) Strata Business Summit

The lure of "the one metric that matters"

Adil Aijaz (Split Software)

Many products, whether data driven or not, chase “the one metric that matters.” It may be engagement, revenue, or conversion, but the common theme is the pursuit of improvement in one metric. Product development teams should instead focus on the design of metrics that measure our goals. Adil Aijaz shares an approach to designing metrics and discusses best practices and common pitfalls.

5:25pm-6:05pm (40m) Data science and machine learning Model lifecycle management

Deploying machine learning models in the enterprise

Diego Oppenheimer (Algorithmia)

After big investments in collecting and cleaning data and building machine learning (ML) models, enterprises face big challenges in deploying models to production and managing a growing portfolio of ML models. Diego Oppenheimer covers the strategic and technical hurdles each company must overcome and the best practices developed while deploying over 4,000 ML models for 70,000 engineers.

11:20am-12:00pm (40m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise

Agile for data science teams

Jennifer Prendki (Figure Eight)

Agile methodologies have been widely successful for software engineering teams but seem inappropriate for data science teams, because data science is part engineering, part research. Jennifer Prendki demonstrates how, with a minimum amount of tweaking, data science managers can adapt Agile techniques and establish best practices to make their teams more efficient.

1:15pm-1:55pm (40m) Platform security and cybersecurity, Strata Business Summit Data preparation, governance and privacy, Ethics and Privacy

Privacy by design: Building in data privacy and protection versus bolting it on later

Les McMonagle (BlueTalon)

Privacy by design is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. Les McMonagle outlines how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for noncompliance.

2:05pm-2:45pm (40m) Law, ethics, governance, Strata Business Summit Ethics and Privacy

An ethical foundation for the AI-driven future

Harry Glaser (Periscope Data)

What is the moral responsibility of a data team today? As AI and machine learning technologies become part of our everyday life and as data becomes accessible to everyone, CDOs and data teams are taking on a very important moral role as the conscience of the corporation. Harry Glaser highlights the risks companies will face if they don't empower data teams to lead the way for ethical data use.

2:55pm-3:35pm (40m) Law, ethics, governance, Strata Business Summit Data preparation, governance and privacy, Ethics and Privacy

Beyond explainability: Regulating machine learning in practice

Andrew Burt (bnh.ai)

Machine learning is becoming prevalent across industries, creating new types of risk. Managing this risk is quickly becoming the central challenge of major organizations, one that strains data science teams, legal personnel, and the C-suite alike. Andrew Burt shares lessons from past regulations focused on similar technology along with a proposal for new ways to manage risk in ML.

4:35pm-5:15pm (40m) Strata Business Summit Ethics and Privacy, Machine Learning in the enterprise

Rationalizing risk in AI and ML

Kimberly Nevala (SAS)

Too often, the discussion of AI and ML includes an expectation—if not a requirement—for infallibility. But as we know, this expectation is not realistic. So what’s a company to do? While risk can’t be eliminated, it can be rationalized. Kimberly Nevala demonstrates how an unflinching risk assessment enables AI/ML adoption and deployment.

5:25pm-6:05pm (40m) Law, ethics, governance, Strata Business Summit Media, Marketing, Advertising

If you thought politics was dirty, you should see the analytics behind it.

John Thuma (Arcadia Data)

Forget about the fake news; data and analytics in politics is what drives elections. John Thuma shares ethical dilemmas he faced while proposing analytical solutions to the RNC and DNC. Not only did he help causes he disagreed with, but he also armed politicians with real-time data to manipulate voters.

11:20am-12:00pm (40m) Law, ethics, governance, Strata Business Summit Data preparation, governance and privacy, Ethics and Privacy

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations

Mark Donsky (Okera), Steven Ross (Cloudera)

In May 2018, the General Data Protection Regulation (GDPR) went into effect for firms doing business in the EU, but many companies still aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.

1:15pm-1:55pm (40m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise

Executive Briefing: Profit from AI and machine learning—The best practices for people and process

Tony Baer (dbInsight), Florian Douetteau (DATAIKU)

Tony Baer and Florian Douetteau share the results of research cosponsored by Ovum and Dataiku that surveyed a specially selected sample of chief data officers and data scientists on how to map roles and processes to make success with AI in the business repeatable.

2:05pm-2:45pm (40m) Strata Business Summit Machine Learning in the enterprise, Model lifecycle management

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it

David Talby (Pacific AI)

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries.

2:55pm-3:35pm (40m) Data engineering and architecture, Strata Business Summit Machine Learning in the enterprise, Media, Marketing, Advertising

Executive Briefing: Managing successful data projects—Technology selection and team building

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation.

4:35pm-5:15pm (40m) Data-driven business management, Strata Business Summit Machine Learning in the enterprise, Media, Marketing, Advertising

Executive Briefing: Most data-driven cultures aren’t

Cassie Kozyrkov (Google)

Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness, and hiring experts doesn’t seem to help. Cassie Kozyrkov examines what it takes to build a truly data-driven organizational culture and highlights a vital yet often neglected job function: the data science manager.

5:25pm-6:05pm (40m) Data-driven business management, Strata Business Summit Data preparation, governance and privacy

Executive Briefing: Enhance your data lake with comprehensive data governance to improve adoption and meet compliance needs

Sanjeev Mohan (Gartner)

If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. But how do you do so if you don’t know what data is in the lake, the level of its quality, or the trustworthiness of models? Sanjeev Mohan explains why data governance is the linchpin to success.

11:20am-12:00pm (40m) Data-driven business management, Expo Hall Data Integration and Data Pipelines, Financial Services

Next-generation cybersecurity via data fusion, AI, and big data: Pragmatic lessons from the front lines in financial services

Usama Fayyad (Open Insights & OODA Health, Inc.), Troels Oerting (WEF Global Cybersecurity Center)

Usama Fayyad and Troels Oerting share outcomes and lessons learned from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on big data and AI at a major EU bank, in collaboration with several financial services institutions.

1:15pm-1:55pm (40m) Data engineering and architecture, Expo Hall

A comparative analysis of the fundamentals of AWS and Azure

Jason Wang (Cloudera), Suraj Acharya (Cloudera), Tony Wu (Cloudera)

The largest infrastructure paradigm change of the 21st century is the shift to the cloud. Companies now face the difficult decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. Jason Wang, Suraj Acharya, and Tony Wu compare the relative strengths and weaknesses of AWS and Azure.

2:05pm-2:45pm (40m) Data engineering and architecture, Expo Hall Model lifecycle management

MLflow: An open platform to simplify the machine learning lifecycle

Mani Parkhe (Databricks), Andrew Chen (Databricks)

Successfully building and deploying a machine learning model is difficult to do once. Enabling other data scientists to reproduce your pipeline, compare the results of different versions, track what's running where, and redeploy and rollback updated models is much harder. Mani Parkhe and Andrew Chen offer an overview of MLflow—a new open source project from Databricks that simplifies this process.

2:55pm-3:35pm (40m) Data engineering and architecture, Expo Hall

Performant time series data management and analytics with Postgres

Michael Freedman (TimescaleDB)

Michael Freedman explains how to leverage Postgres for high-volume time series workloads using TimescaleDB, an open source time series database built as a Postgres plug-in. Michael covers the general architectural design principles and new time series data management features, including adaptive time partitioning and near-real-time continuous aggregations.

4:35pm-5:15pm (40m) Data engineering and architecture, Expo Hall, Streaming systems & real-time applications Blockchain and decentralization, Data Platforms

Architectural principles for building trusted, real-time, distributed IoT systems

Dan Harple (Context Labs)

Dan Harple explains how distributed systems are being influenced by and are influencing operational, financial, and social impact requirements of a wide range of enterprises and how trust in these distributed systems is being challenged, elevated, and resolved by engineers and architects today.

5:25pm-6:05pm (40m) Expo Hall Machine Learning in the enterprise, Text and Language processing and analysis

Automating business processes with large-scale knowledge graphs

Mike Tung (Diffbot)

Mike Tung offers an overview of available open source and commercial knowledge graphs and explains how consumer and business applications are already taking advantage of them to provide intelligent experiences and enhanced business efficiency. Mike then discusses what's coming in the future.

6:05pm-7:05pm (1h)

Booth Crawl

Make your way from booth to booth while you check out all the exhibitors in the Expo Hall on Wednesday after sessions end.

11:20am-12:00pm (40m) Data science and machine learning, Deep Learning sponsored by NVIDIA, Sponsored

Deep learning: Assessing analytics project feasibility and requirements (sponsored by NVIDIA)

Ward Eldred (NVIDIA)

Ward Eldred offers an overview of the types of analytical problems that can be solved using deep learning and shares a set of heuristics that can be used to evaluate the feasibility of analytical AI projects.

1:15pm-1:55pm (40m) Data science and machine learning, Deep Learning sponsored by NVIDIA, Sponsored

Simplifying AI infrastructure: Lessons in scaling a deep learning enterprise (sponsored by NVIDIA)

Darrin Johnson (NVIDIA)

While every enterprise is on a mission to infuse its business with deep learning, few know how to build the infrastructure to get them there. Darrin Johnson shares insights and best practices learned from NVIDIA's deep learning deployments around the globe that you can leverage to shorten deployment timeframes, improve developer productivity, and streamline operations.

2:05pm-2:45pm (40m) Data science and machine learning, Deep Learning sponsored by NVIDIA, Sponsored

Kubernetes on GPUs (sponsored by NVIDIA)

Michael Balint (NVIDIA)

Michael Balint explains how NVIDIA employs its own distribution of Kubernetes, in conjunction with DGX hardware, to make the most efficient use of GPU resources and scale its efforts across a cluster, allowing multiple users to run experiments and push their finished work to production.

4:35pm-5:15pm (40m) Data science and machine learning, Deep Learning sponsored by NVIDIA, Sponsored

GPU-accelerated analytics and machine learning ecosystems (Inception Showcase sponsored by NVIDIA)

Alen Capalik (FASTDATA.io), Jim McHugh (NVIDIA), SriSatish Ambati (H2O.ai), Tim Delisle (Datalogue)

Explore case studies from Datalogue, FASTDATA.io, and H20.ai that demonstrate how GPU-accelerated analytics, machine learning, and ETL help companies overcome slow queries and tedious data preparation process, dynamically correlate among data, and enjoy automatic feature engineering.

5:25pm-6:05pm (40m) Data science and machine learning, Deep Learning sponsored by NVIDIA, Sponsored

Accelerate AI with synthetic data using generative adversarial networks (GAN) (sponsored by NVIDIA)

Renee Yao (NVIDIA)

Renee Yao explains how generative adversarial networks (GAN) are successfully used to improve data generation and explores specific real-world examples where customers have deployed GANs to solve challenges in healthcare, space, transportation, and retail industries.

11:20am-12:00pm (40m) Sponsored

Guidebook to unwind the enterprise "data hairball" and get ready for AI (sponsored by IBM)

Tim Davis (IBM)

Tim Davis discusses key pain points and solutions to problems many enterprises face with data in silos, poor-quality data that cannot always be trusted, and managing and making large volumes of data available to derive more accurate insights and machine learning models.

1:15pm-1:55pm (40m) Sponsored

Refactor your data warehouse with mobile analytics products (sponsored by Kyligence)

Zhi Zhu (China Construction Bank ), Luke Han (Kyligence)

When China Construction Bank wanted to migrate 23,000+ reports to mobile, it chose Apache Kylin as the high-performance and high-concurrency platform to refactor its data warehouse architecture to serving 400K+ users. Zhi Zhu and Luke Han detail the necessary architecture and best practices for refactoring a data warehouse for mobile analytics.

2:05pm-2:45pm (40m) Sponsored

Bringing together machine and human intelligence (sponsored by SAP)

Richard Mooney (SAP)

Intelligent enterprises—fueled by rapid advances in artificial intelligence (AI), machine learning (ML), and the internet of things (IoT)—promise significant business value. Richard Mooney explains how to achieve the game-changing outcomes of an intelligent enterprise, delivering value across business functions with the synergy of machine and human intelligence.

2:55pm-3:35pm (40m) Sponsored

The big data makeover: 10 months from ideation to enterprise-scale solution (sponsored by Infoworks)

Chris Stirrat (Eagle Investment Systems)

Eagle Investment Systems, a leading provider of financial services technology, is building a new Hadoop and cloud-based data management solution. Chris Stirrat explains how Eagle went from incubation to an enterprise-scale solution in just 10 months, using a Hadoop-based big data stack and multitenant architecture, transforming software creation, delivery, quality, technology, and culture.

4:35pm-5:15pm (40m) Sponsored

Hadoop-compatible filesystems: The limits of "compatible" (sponsored by WANdisco)

Paul Scott-Murphy (WANdisco)

Every organization is considering its storage options, with an eye toward the cloud. Paul Scott-Murphy explores what makes different large-scale storage systems and services unique, their clear (and unexpected) differences, the options you have to use them, and the surprises you can expect along the way.

5:25pm-6:05pm (40m) Sponsored

From data lakes to the data fabric: Our vision for digital strategy (sponsored by Cambridge Semantics)

Sam Chance (Cambridge Semantics), Partha Bhattachargee (Cambridge Semantics)

Ben Szekely shares a vision for digital innovation: The data fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future.

11:20am-12:00pm (40m) Sponsored

Data operations problems created by deep learning and how to fix them (sponsored by MapR)

Jim Scott (NVIDIA)

Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of deep learning projects and solutions while walking you through a customer use case.

1:15pm-1:55pm (40m) Sponsored

Kubernetes plays Cupid for data scientists and IT (sponsored by MapR)

Skyler Thomas (MapR)

In the past, there have been major challenges in quickly creating machine learning training environments and deploying trained models into production. Skyler Thomas details how Kubernetes helps data scientists and IT work in concert to speed model training and time-to-value.

2:05pm-2:45pm (40m) Sponsored

A developer's guide to building AI applications (sponsored by Microsoft)

Anand Raman (Microsoft), Wee Hyong Tok (Microsoft)

Anand Raman and Wee Hyong Tok walk you through applying AI technologies in the cloud. You'll learn how to add prebuilt AI capabilities like object detection, face understanding, translation, and speech to applications, build cognitive search applications that understand deep content in images, text, and other data, use the Azure platform to accelerate machine learning, and more.

2:55pm-3:35pm (40m) Sponsored

Use of modern data environments in telecom (sponsored by Microstrategy)

Sara Alavi (Bell Canada)

Bell Canada, Canada's largest communications company, leads the industry in providing world-class broadband communications services to consumers and business customers. Join Sara Alavi to learn how the network big data and AI team within Bell is using modern data environments and applying a startup mindset to transform traditional networks into insight-driven intelligent networks.

4:35pm-5:15pm (40m) Sponsored

TD Bank’s journey to turn its big data environment into a true data lake (sponsored by Talend)

Joseph ( Joe ) DosSantos (TD Bank)

TD Bank’s data analytics team has undertaken a multiyear journey to modernize its data infrastructure for today and future needs. Joseph DosSantos explains how the team built a governed data lake foundation, enabling business users to leverage its big data environment to extract analytical insights while minimizing risks.

5:25pm-6:05pm (40m) Sponsored

How to avoid drowning in logs: Streaming 80 billion events and batch processing 40 TB/hour (sponsored by Pure Storage)

Ivan Jibaja (Pure Storage)

Pure Storage runs over 70,000 tests per day. Using Spark’s flexible computing platform, the company can write a single application for both streaming and batch jobs so the company's team of triage engineers can understand the state of the continuous integration pipeline. Ivan Jibaja discusses the use case for big data analytics technologies, the architecture of the solution, and lessons learned.

11:20am-12:00pm (40m) Sponsored

Ubiquitous machine learning (sponsored by Cisco)

Chiang Yang (Cisco)

Data is the lifeblood of an enterprise, and it's being generated everywhere. To overcome the challenges of data gravity, data analytics, including machine learning, is best done where the data is located: ubiquitous machine learning. Han Yang explains how to overcome the challenges of machine learning everywhere.

1:15pm-1:55pm (40m) Sponsored

Interactive business intelligence and OLAP on big data lakes using a Spark-native fast data mart (sponsored by Oracle + DataScience.com)

Srikanth Desikan (Oracle)

SparklineData is an in-memory distributed scale-out analytics platform built on Apache Spark to enable enterprises to query on data lakes directly with instant response times. Srikanth Desikan offers an overview of SparklineData and explains how it can enable new analytics use cases working on the most granular data directly on data lakes.

2:05pm-2:45pm (40m) Sponsored

Preventing more fraud in less time with machine learning-driven data management (sponsored by Informatica)

chris wojdak (Symcor)

Chris Wojdak explains how Symcor has transformed its big data architecture using Informatica’s comprehensive machine learning-based solutions for data integration, data quality, data cataloging, and data governance.

2:55pm-3:35pm (40m) Sponsored

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

Mathew Lodge (Anaconda)

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how.

4:35pm-5:15pm (40m) Sponsored

Accelerate big data analytics and AI with NetApp hybrid cloud architecture (sponsored by NetApp)

Karthikeyan Nagalingam (NetApp)

As the data authority for hybrid cloud for big data analytics and AI, NetApp understands the value of the access, management, and control of data. Karthikeyan Nagalingam discusses the NetApp Data Fabric, which provides a unified data management environment that spans edge devices, data centers, and multiple hyperscale clouds using ONTAP software, all-flash systems, ONTAP Select, and cloud volumes.

5:25pm-6:05pm (40m) Sponsored

Data for posterity: Nobody licenses or builds data just to have it. (sponsored by Pitney Bowes)

Dan Adams (Pitney Bowes)

The role of data and the demand to get it right, coupled with competitive pressures to move faster, have dramatically increased. Companies now recognize data as an asset and need to manage it that way. Join Dan Adams for the insights you need to ensure that your data addresses current and future needs and that your organization is set up for success.

11:20am-12:00pm (40m) Sponsored

Using modern database and open source tools to accelerate client service delivery (sponsored by MemSQL)

Petrus Smith (PwC)

Peet Smith explains how PwC is using modern database tools with a combination of open source technologies to automate and scale data ingestion and transformation to get data to engagement teams to help them streamline and accelerate client service delivery.

1:15pm-1:55pm (40m) Sponsored

How the blurring of memory and storage is revolutionizing the data era (sponsored by Intel)

Arakere Ramesh (Intel), Bharath Yadla (Aerospike)

Persistent memory accelerates analytics, database, and storage workloads across a variety of use cases, bringing new levels of speed and efficiency to the data center and to in-memory computing. Arakere Ramesh and Bharath Yadla offer an overview of the newly announced Intel Optane data center persistent memory and share the exciting potential of this technology in analytics solutions.

2:05pm-2:45pm (40m) Sponsored

Governing your cloud-based enterprise data lake (sponsored by Zaloni)

Ben Sharma (Zaloni), Selwyn Collaco (TMX)

Selwyn Collaco and Ben Sharma share insights from their real-world experience and discuss best practices for architecture, technology, data management, and governance to enable centralized data services and explain how to leverage the Zaloni Data Platform (ZDP), an integrated self-service data platform, to operationalize the enterprise data lake .

2:55pm-3:35pm (40m) Sponsored

Speed, scale, smarts: GPU-powered analytics for the extreme data economy (sponsored by Kinetica)

Michael Mahoney (Kinetica)

Michael Mahoney demonstrates how to leverage the power of GPUs to converge streaming data analysis, location analysis, and streamlined machine learning with a single engine. Along the way, Michael shares real-world case studies on how Kinetica is used to solve complex data challenges.

4:35pm-5:15pm (40m) Sponsored

Keys to operationalize enterprise 360 (sponsored by Impetus)

Anand Raman (Impetus Technologies)

Is a single source of truth across the enterprise possible, or is it just an expensive myth? Anand Raman explains why you need a holistic decision framework that addresses multiple facets from platform to processes. Join in to explore EDW modernization strategies, self-service analytics, and interactive insights on big data and discover a process to get to a unified data model.

5:25pm-6:05pm (40m) Sponsored

How Bell Canada increased the scale of BI exponentially with OLAP on big data (sponsored by Kyvos Insights)

Mark Huang (Bell Canada)

Like all telecommunication giants, Bell Canada relies on huge volumes of data to make accurate business decisions and deliver better services. Mark Huang discusses why Bell Canada chose Kyvos’s OLAP on big data technology to achieve multidimensional analytics and how it helped the company deliver to its growing business reporting demands.

11:20am-12:00pm (40m) Sponsored

Commercial software in an increasingly open source ecosystem (sponsored by SAS)

Paul Kent (SAS)

Software is eating the world, and open source is eating the software. Most contemporary analytics shops use a lot of open source software in their analytics platform. So where does commercial software like SAS fit? Paul Kent explains how you can achieve the best of both worlds by combining your favorite open source software with the power of SAS analytics.

1:15pm-1:55pm (40m) Sponsored

Feet on the ground, head in the clouds (sponsored by AtScale)

Mark Stange-Tregear (Ebates)

Interested in how Ebates is using a hybrid on-premises and cloud implementation to scale out its centralized business intelligence and data hub? Mark Stange-Tregear shares the history, business context, and technical plan around Ebates’s hybrid Hadoop-AWS cloud approach.

2:05pm-2:45pm (40m) Sponsored

A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data)

Randy Lea (Arcadia Data)

The use of data lakes continue to grow, and the right business intelligence (BI) and analytics tools on data lakes are critical to data lake success. Randy Lea explains why existing BI tools work well for data warehouses but not data lakes and why every organization should have two BI standards: one for data warehouses and one for data lakes.

2:55pm-3:35pm (40m) Sponsored

Enabling predictive maintenance using automated IoT data pipelines (sponsored by BMC)

Basil Faruqui (BMC Software)

Basil Faruqui demonstrates how to simplify the automation and orchestration of an IoT-driven data pipeline in a cloud environment where machine learning algorithms predict failures.

4:35pm-5:15pm (40m) Sponsored

Best practices for migrating big data workloads to Amazon Web Services (sponsored by Amazon Web Services)

Faria Bruno (Amazon Web Services)

Bruno Faria explains how to identify the components and workflows in your current environment and shares best practices to migrate these workloads to AWS.

5:25pm-6:05pm (40m) Sponsored

A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data)

Randy Lea (Arcadia Data)

8:50am-9:00am (10m)

Wednesday keynotes

Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes.

9:00am-9:15am (15m)

The future of data warehousing

Anupam Singh (Cloudera), brian coyne (PNC)

Data volumes don’t translate to business value. What matters is your data platform’s ability to support unprecedented numbers of business users and use cases. Anupam Singh and Brian Coyne look at some of the challenges posed by data-hungry organizations and share new techniques to extract meaningful insights at the speed of today’s modern business.

9:15am-9:25am (10m) Ethics and Privacy

Managing risk in machine learning

Ben Lorica (O'Reilly)

As companies begin adopting machine learning, important considerations, including fairness, transparency, privacy, and security, need to be accounted for. Ben Lorica offers an overview of recent tools for building privacy-preserving and secure machine learning products and services.

9:25am-9:35am (10m) Sponsored

The answer to life, the universe, and everything: But can you get that into production? (sponsored by MapR)

Ted Dunning (MapR, now part of HPE)

There’s real value in big data and more waiting when you add real-time, but to get the payoff, you need successful deployments of your AI and data-intensive applications. You need to be ready with your current applications in production but must have an architecture and infrastructure that are ready for the next ones as well. Ted Dunning explores how others have fared in this journey.

9:35am-9:50am (15m) Financial Services, Machine Learning in the enterprise

Von Neumann to deep learning: Data revolutionizing the future

Jeffrey Wecker (Goldman Sachs)

Jeffrey Wecker leads a deep dive on data in financial services, with perspectives on the evolving landscape of data science, the advent of alternative data, the importance of data centricity, and the future for machine learning and AI.

9:50am-9:55am (5m) Sponsored

AI, ML, and the IoT will destroy the data center and the cloud (just not in the way you think) (sponsored by Cisco)

DD Dasgupta (Cisco)

DD Dasgupta explores the exciting development of the edge-cloud continuum, which is redefining business models and technology strategies while creating a vast array of new applications that will power the digital age. The continuum is also destroying what we know about the centralized data centers and cloud computing infrastructures that were so vital to the success of the previous computing eras.

9:55am-10:15am (20m)

The Missing Piece

Cassie Kozyrkov (Google)

Why do businesses fail at machine learning despite its tremendous potential and the excitement it generates? Is the answer always in data, algorithms, and infrastructure, or is there a subtler problem? Will things improve in the near future? Let's talk about some lessons learned at Google and what they mean for applied data science.

10:15am-10:25am (10m) Sponsored

Leveraging the best of the past to power a better future (sponsored by MemSQL)

Drew Paroski (MemSQL), Aatif Din (Fanatics)

Today’s successful businesses utilize data better than their competitors; however, data sprawl and inefficient data infrastructure restrict what’s possible. Blending the best of the past with the software innovations of today will solve future data challenges. Drew Paroski shares how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility.

10:25am-10:45am (20m) Blockchain and decentralization, Financial Services

The power of Ethereum

Joseph Lubin (Consensus Systems)

Ethereum is a world computer on top of a peer-to-peer network that runs smart contracts - applications that run exactly as programmed without the possibility of censorship, fraud, or third-party interference. Until now, businesses had to build their systems on database technologies that resulted in siloed and redundant information in typically adversarial contexts.

10:50am-11:20am (30m)

Break: Morning break sponsored by Cisco

3:35pm-4:35pm (1h)

Break: Afternoon Break sponsored by Intel

7:30am-8:45am (1h 15m)

Break: Morning Coffee

8:00am-8:30am (30m)

Speed Networking

Gather before keynotes on Wednesday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees.

7:05pm-7:30pm (25m)

Plenary: Grey space closer slot only

12:00pm-1:15pm (1h 15m)

Wednesday Topic Tables at Lunch

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics.

12:00pm-1:15pm (1h 15m)

Wednesday Business Summit Lunch

Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers.

12:00pm-1:15pm (1h 15m)

Better Together: Women in Big Data Luncheon (sponsored by SAP and Intel)

If you’re looking to find like minds and make new professional connections, come to the women's networking lunch on Wednesday.

7:30pm-10:30pm (3h)

Data After Dark

Don't miss an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in New York.

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

Schedule List ViewGrid View

Topics

Sponsorship Opportunities

Partner Opportunities

Contact Us

Schedule List View Grid View