Presented By O’Reilly and Cloudera

San Francisco • London • New York

Make Data Work

September 11, 2018: Training & Tutorials
September 12–13, 2018: Keynotes & Sessions
New York, NY

Schedule List View Grid View

Tuesday, 09/11/2018

9:00am

Expand your data science and machine learning skills with Python, R, SQL, Spark, and TensorFlow

9:00am–5:00pm Tuesday, 09/11/2018

Location: 1A 01/02 Level: Beginner

Ian Cook (Cloudera)

Average rating:

(4.86, 7 ratings)

Advancing your career in data science requires learning new languages and frameworks—but learners face an overwhelming array of choices, each with different syntaxes, conventions, and terminology. Ian Cook simplifies the learning process by elucidating the abstractions common to these systems. Through hands-on exercises, you'll overcome obstacles to getting started using new tools. Read more.

Machine learning from scratch in TensorFlow

9:00am–5:00pm Tuesday, 09/11/2018

Location: 1A 03

Secondary topics: Deep Learning

Dylan Bargteil (The Data Incubator)

The TensorFlow library provides for the use of data flow graphs for numerical computations, with automatic parallelization across several CPUs or GPUs. This architecture makes it ideal for implementing neural networks and other machine learning algorithms. Dylan Bargteil introduces TensorFlow's capabilities through its Python interface. Read more.

Minimum viable machine learning: The applied data science bootcamp (sponsored by DXC Technology)

9:00am–5:00pm Tuesday, 09/11/2018

Location: 1A 04/05

Tags:

Jerry Overton (DXC), Ashim Bose (DXC), Samir Sehovic (DXC)

Average rating:

(5.00, 1 rating)

Acquiring machine learning (ML) technology is relatively straightforward, but ML must be applied to be useful. In this one-day boot camp that is equal parts hackathon, presentation, and group participation, Jerry Overton, Ashim Bose, and Samir Sehovic teach you how to apply advanced analytics in ways that reshape the enterprise and improve outcomes. Read more.

Real-time systems with Spark Streaming and Kafka

9:00am–5:00pm Tuesday, 09/11/2018

Location: 1A 15/16 Level: Intermediate

Jesse Anderson (Big Data Institute)

Average rating:

(1.00, 1 rating)

To handle real-time big data, you need to solve two difficult problems: how do you ingest that much data and how will you process that much data? Jesse Anderson explores the latest real-time frameworks and explains how to choose the right one for your company. Read more.

Apache Spark programming

9:00am–5:00pm Tuesday, 09/11/2018

Location: 1A 17

Kenneth Jones (Databricks, Inc.)

Ken Jones walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs. Read more.

Hands-on data science with Python

9:00am–5:00pm Tuesday, 09/11/2018

Location: 1E 17

Zachary Glassman (The Data Incubator)

Zachary Glassman leads a hands-on dive into building intelligent business applications using machine learning, walking you through all the steps of developing a machine learning pipeline. You'll explore data cleaning, feature engineering, model building and evaluation, and deployment and extend these models into two applications using a real-world dataset. Read more.

Architecting a data platform for enterprise use

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1A 06/07 Level: Intermediate

Secondary topics: Data Platforms

Mark Madsen (Teradata), Todd Walter (Archimedata)

Average rating:

(3.50, 10 ratings)

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that is not subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure. Read more.

Findata Day

9:00am–5:00pm Tuesday, 09/11/2018

Location: 1A 08

Alistair Croll (Solve For Interesting), Robert Passarella (Alpha Features), Amro Alkhatib (National Health Insurance Company-Daman), Mridul Mishra (Fidelity Investments), Patrick Angeles (Cloudera), James Psota (Panjiva ), Andreas Kohlmaier (Munich Re), Paul Lashmet (Arcadia Data), Nick Curcuru (Mastercard), Robin Way (Corios), Theresa Johnson (Airbnb), Jane Tran (Unqork), Swatee Singh (American Express)

From analyzing risk and detecting fraud to predicting payments and improving customer experience, take a deep dive into the ways data technologies are transforming the financial industry. Read more.

Building a large-scale machine learning application using Amazon SageMaker and Spark

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1A 10 Level: Intermediate

David Arpin (Amazon Web Services)

Average rating:

(2.80, 10 ratings)

David Arpin walks you through building a machine learning application, from data manipulation to algorithm training to deployment to a real-time prediction endpoint, using Spark and Amazon SageMaker. Read more.

Managing data science in the enterprise

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1A 12/14 Level: Non-technical

Secondary topics: Machine Learning in the enterprise

Joshua Poduska (Domino Data Lab), Patrick Harrison (S&P Global)

Average rating:

(4.29, 7 ratings)

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders deliver measurable impact on an increasing share of an enterprise’s KPIs. Joshua Poduska and Patrick Harrison detail how leading organizations have taken a holistic approach to people, process, and technology to build a sustainable competitive advantage Read more.

Deep learning methods for natural language processing

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Deep Learning, Text and Language processing and analysis

Garrett Hoffman (StockTwits)

Average rating:

(4.75, 4 ratings)

Garrett Hoffman walks you through deep learning methods for natural language processing and natural language understanding tasks, using a live example in Python and TensorFlow with StockTwits data. Methods include word2vec, recurrent neural networks and variants (LSTM, GRU), and convolutional neural networks. Read more.

Practical techniques for interpreting machine learning models

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1A 23/24 Level: Intermediate

Secondary topics: Ethics and Privacy, Health and Medicine

Patrick Hall (bnh.ai | H2O.ai), Avni Wadhwa (H20.ai), Mark Chan (H2O.ai)

Average rating:

(4.50, 4 ratings)

Transparency, auditability, and stability are crucial for business adoption and human acceptance of complex machine learning models. Patrick Hall, Avni Wadhwa, and Mark Chan share practical and productizable approaches for explaining, testing, and visualizing machine learning models using open source, Python-friendly tools such as GraphViz, H2O, and XGBoost. Read more.

Model serving and management at scale using open source tools

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1E 06 Level: Intermediate

Secondary topics: Model lifecycle management

Dan Crankshaw (UC Berkeley RISELab)

Average rating:

(5.00, 1 rating)

Dan Crankshaw offers an overview of the current challenges in deploying machine applications into production and the current state of prediction serving infrastructure. He then leads a deep dive into the Clipper serving system and shows you how to get started. Read more.

Stream processing with Kafka and KSQL

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1E 07/08 Level: Intermediate

Tim Berglund (Confluent)

Average rating:

(4.33, 3 ratings)

Tim Berglund leads this solid introduction to Apache Kafka as a streaming data platform. You'll cover the internal architecture, APIs, and platform components like Kafka Connect and Kafka Streams, then finish with an exercise processing streaming data using KSQL, the new SQL-like declarative stream processing language for Kafka. Read more.

Making interactive browser-based visualizations easy in Python

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1E 09 Level: Intermediate

James Bednar (Anaconda)

Average rating:

(4.60, 5 ratings)

Python lets you solve data science problems by stitching together packages from the Python ecosystem, but it can be difficult to assemble the right tools to solve real-world problems. James Bednar walks you through using the 15+ packages covered by the new PyViz.org initiative to make it simple to build interactive plots and dashboards, even for large, streaming, and highly multidimensional data. Read more.

Data Case Studies

9:00am–5:00pm Tuesday, 09/11/2018

Location: 1E 10

Paco Nathan (derwen.ai), Katharina Warzel (EveryMundo), Mike Berger (Mount Sinai Health System), Sam Helmich (Deere & Company), Stephanie Fischer (datanizing GmbH), Maryam Jahanshahi (TapRecruit), Greg Quist (SmartCover Systems), Ann Nguyen (Whole Whale), Steve Otto (Navistar), Jennifer Lim (Cerner), S Anand (Gramener), Ian Brooks (Cloudera)

Hear practical insights from household brands and global companies: the challenges they tackled, approaches they took, and the benefits—and drawbacks—of their solutions. Read more.

Getting ready for GDPR: Securing and governing hybrid, cloud, and on-premises big data deployments, step by step

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1E 11 Level: Intermediate

Secondary topics: Data preparation, governance and privacy, Ethics and Privacy

Mark Donsky (Okera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera), Ifigeneia Derekli (Cloudera), Camila Hiskey (Cloudera)

Average rating:

(4.50, 2 ratings)

New regulations such as GDPR are driving new compliance, governance, and security challenges for big data. Infosec and security groups must ensure a consistently secured and governed environment across multiple workloads. Mark Donsky, Syed Rafice, Mubashir Kazia, Ifigeneia Derekli, and Camila Hiskey share hands-on best practices for meeting these challenges, with special attention paid to GDPR. Read more.

Designing modern streaming data applications

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1E 12/13 Level: Intermediate

Secondary topics: Data Platforms

Arun Kejariwal (Independent), Karthik Ramasamy (Streamlio)

Average rating:

(3.12, 8 ratings)

Arun Kejariwal and Karthik Ramasamy lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, covering messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. They also share case studies from the IoT, gaming, and healthcare and their experience operating these systems at internet scale. Read more.

Learning machine learning using astronomy datasets

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1E 14 Level: Beginner

Viviana Acquaviva (CUNY New York City College of Technology)

Average rating:

(4.75, 4 ratings)

Using interesting, diverse publicly available datasets and actual problems in astronomy research, Viviana Acquaviva leads an intermediate tutorial on machine learning. You'll learn how to customize algorithms and evaluation metrics required by scientific applications and discover best practices for choosing, developing, and evaluating machine learning algorithms in "real-world" datasets. Read more.

Deep learning-based search and recommendation systems using TensorFlow

9:00am–12:30pm Tuesday, 09/11/2018

Location: 1E 15/16 Level: Intermediate

Secondary topics: Deep Learning, Recommendation Systems

Vijay Agneeswaran (Walmart Labs), Abhishek Kumar (Publicis Sapient)

Average rating:

(4.40, 5 ratings)

Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client. Read more.

10:30am

10:30am–11:00am Tuesday, 09/11/2018

Location: 1A & 1E Halls

Morning Break (30m)

12:30pm

12:30pm–1:30pm Tuesday, 09/11/2018

Location: 3A

Lunch (1h)

1:30pm

Architecting a next-generation data platform

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1A 06/07 Level: Advanced

Secondary topics: Data Platforms

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Average rating:

(3.12, 8 ratings)

Using Customer 360 and the internet of things as examples, Jonathan Seidman and Ted Malaska explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, including components like Kafka, Flink, Kudu, Spark Streaming, and Spark SQL and modern storage engines to enable new forms of data processing and analytics. Read more.

Data science with Unix power tools

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1A 10 Level: Intermediate

Jeroen Janssens (Data Science Workshops)

Average rating:

(3.00, 3 ratings)

The Unix command line remains an amazing environment for efficiently performing tedious but essential data science tasks. By combining small, powerful command-line tools, you can quickly scrub, explore, and model your data as well as hack together prototypes. Join Jeroen Janssens for a hands-on workshop based on his book Data Science at the Command Line. Read more.

Recurrent neural networks for time series analysis

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1A 12/14 Level: Intermediate

Secondary topics: Deep Learning, Temporal data and time-series analytics

Bruno Goncalves (Data For Science)

Average rating:

(3.14, 7 ratings)

Time series are everywhere around us. Understanding them requires taking into account the sequence of values seen in previous steps and even long-term temporal correlations. Join Bruno Gonçalves to learn how to use recurrent neural networks to model and forecast time series and discover the advantages and disadvantages of recurrent neural networks with respect to more traditional approaches. Read more.

Natural language understanding at scale with Spark NLP

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Text and Language processing and analysis

David Talby (Pacific AI), Claudiu Branzan (Accenture), Alex Thomas (John Snow Labs)

Average rating:

(3.00, 7 ratings)

David Talby, Claudiu Branzan, and Alex Thomas lead a hands-on tutorial for scalable NLP using the highly performant, highly scalable open source Spark NLP library. You’ll spend about half your time coding as you work through four sections, each with an end-to-end working codebase that you can change and improve. Read more.

Hands-on Kafka streaming microservices with Akka Streams and Kafka Streams

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1A 23/24 Level: Intermediate

Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)

Average rating:

(3.67, 3 ratings)

Dean Wampler and Boris Lublinsky walk you through building streaming apps as microservices using Akka Streams and Kafka Streams. Dean and Boris discuss the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead. You'll also discover a few ML model serving ideas along the way. Read more.

Apache Metron: Open source cybersecurity at scale

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1E 06 Level: Intermediate

Carolyn Duby (Cloudera)

Carolyn Duby shows you how to find the cybersecurity threat needle in your event haystack using Apache Metron: a real-time, horizontally scalable open source platform. After this interactive overview of the platform's major features, you'll be ready to analyze your own haystack back at the office. Read more.

Leveraging Spark and deep learning frameworks to understand data at scale

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1E 07/08 Level: Intermediate

Secondary topics: Deep Learning

Vartika Singh (Cloudera), Alan Silva (Cloudera), Alex Bleakley (Cloudera), Steven Totman (Cloudera), Mirko Kämpf (Cloudera), Syed Nasar (Cloudera)

Average rating:

(1.00, 1 rating)

Vartika Singh, Alan Silva, Alex Bleakley, Steven Totman, Mirko Kämpf, and Syed Nasar outline approaches for preprocessing, training, inference, and deployment across datasets (time series, audio, video, text, etc.) that leverage Spark, its extended ecosystem of libraries, and deep learning frameworks. Read more.

From training to serving: Deploying TensorFlow models with Kubernetes

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1E 09 Level: Intermediate

Secondary topics: Model lifecycle management

Brian Foo (Google), Holden Karau (Independent), Jay Smith (Google)

Average rating:

(2.00, 7 ratings)

TensorFlow and Keras are popular libraries for training deep models due to hardware accelerator support. Brian Foo, Jay Smith, and Holden Karau explain how to bring deep learning models from training to serving in a cloud production environment. You'll learn how to unit-test, export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes. Read more.

How to be fair: A tutorial for beginners

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1E 11 Level: Intermediate

Secondary topics: Ethics and Privacy

Aileen Nielsen (Skillman Consulting)

Average rating:

(4.00, 4 ratings)

There is mounting evidence that the widespread deployment of machine learning and artificial intelligence in business and government applications is reproducing or even amplifying existing prejudices and social inequalities. Aileen Nielsen demonstrates how to identify and avoid bias and other unfairness in your analyses. Read more.

Building your first big data application on AWS

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1E 12/13 Level: Intermediate

Jorge Lopez (Amazon Web Services), Radhika Ravirala (Amazon Web Services), Paul Sears (Amazon Web Services), Faria Bruno (Amazon Web Services)

Average rating:

(2.86, 7 ratings)

Want to learn how to use Amazon's big data web services to launch your first big data application in the cloud? Jorge Lopez, Radhika Ravirala, Paul Sears, and Bruno Faria walk you through building a big data application using a combination of open source technologies and AWS managed services. Read more.

Running multidisciplinary big data workloads in the cloud

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1E 14 Level: Intermediate

Sudhanshu Arora (Cloudera), Stefan Salandy (Cloudera), Suraj Acharya (Cloudera), Brandon Freeman (Cloudera), Jason Wang (Cloudera), Shravan Pabba (Cloudera)

Attend this tutorial to learn how to successfully run a data analytics pipeline in the cloud and integrate data engineering and data analytic workflows and explore considerations and best practices for data analytics pipelines in the cloud. Along the way, you'll see how to share metadata across workloads in a big data PaaS. Read more.

From theory to data product: Applying data science methods to effect business change

1:30pm–5:00pm Tuesday, 09/11/2018

Location: 1E 15/16 Level: Beginner

Secondary topics: Machine Learning in the enterprise

Janet Forbes (T4G), Danielle Leighton (T4G), Lindsay Brin (T4G)

Average rating:

(2.67, 9 ratings)

Janet Forbes, Danielle Leighton, and Lindsay Brin lead a primer on crafting well-conceived data science projects that uncover valuable business insights. Using case studies and hands-on skills development, Janet, Danielle, and Lindsay walk you through essential techniques for effecting real business change. Read more.

3:00pm

3:00pm–3:30pm Tuesday, 09/11/2018

Location: 1A & 1E Halls

Afternoon Break (30m)

5:00pm

Opening Reception

5:00pm–6:30pm Tuesday, 09/11/2018

Location: 3B | Expo Hall

Enjoy delicious snacks and beverages with fellow Strata attendees, speakers, and sponsors at the Opening Reception, happening immediately after tutorials on Tuesday. Read more.

Wednesday, 09/12/2018

7:30am

7:30am–8:45am Wednesday, 09/12/2018

Location: 3E Foyer

Morning Coffee (1h 15m)

8:00am

Speed Networking

8:00am–8:30am Wednesday, 09/12/2018

Location: Crystal Palace

Gather before keynotes on Wednesday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees. Read more.

8:50am

Wednesday keynotes

8:50am–9:00am Wednesday, 09/12/2018

Location: 3E

Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

Average rating:

(2.88, 8 ratings)

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the first day of keynotes. Read more.

9:00am

The future of data warehousing

9:00am–9:15am Wednesday, 09/12/2018

Location: 3E

Anupam Singh (Cloudera), brian coyne (PNC)

Average rating:

(3.24, 17 ratings)

Data volumes don’t translate to business value. What matters is your data platform’s ability to support unprecedented numbers of business users and use cases. Anupam Singh and Brian Coyne look at some of the challenges posed by data-hungry organizations and share new techniques to extract meaningful insights at the speed of today’s modern business. Read more.

9:15am

Managing risk in machine learning

9:15am–9:25am Wednesday, 09/12/2018

Location: 3E

Secondary topics: Ethics and Privacy

Ben Lorica (O'Reilly)

Average rating:

(3.92, 13 ratings)

As companies begin adopting machine learning, important considerations, including fairness, transparency, privacy, and security, need to be accounted for. Ben Lorica offers an overview of recent tools for building privacy-preserving and secure machine learning products and services. Read more.

9:25am

The answer to life, the universe, and everything: But can you get that into production? (sponsored by MapR)

9:25am–9:35am Wednesday, 09/12/2018

Location: 3E

Ted Dunning (MapR, now part of HPE)

Average rating:

(2.79, 19 ratings)

There’s real value in big data and more waiting when you add real-time, but to get the payoff, you need successful deployments of your AI and data-intensive applications. You need to be ready with your current applications in production but must have an architecture and infrastructure that are ready for the next ones as well. Ted Dunning explores how others have fared in this journey. Read more.

9:35am

Von Neumann to deep learning: Data revolutionizing the future

9:35am–9:50am Wednesday, 09/12/2018

Location: 3E

Secondary topics: Financial Services, Machine Learning in the enterprise

Jeffrey Wecker (Goldman Sachs)

Average rating:

(3.12, 26 ratings)

Jeffrey Wecker leads a deep dive on data in financial services, with perspectives on the evolving landscape of data science, the advent of alternative data, the importance of data centricity, and the future for machine learning and AI. Read more.

9:50am

AI, ML, and the IoT will destroy the data center and the cloud (just not in the way you think) (sponsored by Cisco)

9:50am–9:55am Wednesday, 09/12/2018

Location: 3E

DD Dasgupta (Cisco)

Average rating:

(3.60, 15 ratings)

DD Dasgupta explores the exciting development of the edge-cloud continuum, which is redefining business models and technology strategies while creating a vast array of new applications that will power the digital age. The continuum is also destroying what we know about the centralized data centers and cloud computing infrastructures that were so vital to the success of the previous computing eras. Read more.

9:55am

The Missing Piece

9:55am–10:15am Wednesday, 09/12/2018

Location: 3E

Cassie Kozyrkov (Google)

Average rating:

(4.67, 30 ratings)

Why do businesses fail at machine learning despite its tremendous potential and the excitement it generates? Is the answer always in data, algorithms, and infrastructure, or is there a subtler problem? Will things improve in the near future? Let's talk about some lessons learned at Google and what they mean for applied data science. Read more.

10:15am

Leveraging the best of the past to power a better future (sponsored by MemSQL)

10:15am–10:25am Wednesday, 09/12/2018

Location: 3E

Drew Paroski (MemSQL), Aatif Din (Fanatics)

Average rating:

(2.92, 13 ratings)

Today’s successful businesses utilize data better than their competitors; however, data sprawl and inefficient data infrastructure restrict what’s possible. Blending the best of the past with the software innovations of today will solve future data challenges. Drew Paroski shares how to develop modern database applications without sacrificing cost savings, data familiarity, and flexibility. Read more.

10:25am

The power of Ethereum

10:25am–10:45am Wednesday, 09/12/2018

Location: 3E

Secondary topics: Blockchain and decentralization, Financial Services

Joseph Lubin (Consensus Systems)

Average rating:

(3.00, 12 ratings)

Ethereum is a world computer on top of a peer-to-peer network that runs smart contracts - applications that run exactly as programmed without the possibility of censorship, fraud, or third-party interference. Until now, businesses had to build their systems on database technologies that resulted in siloed and redundant information in typically adversarial contexts. Read more.

10:50am

10:50am–11:20am Wednesday, 09/12/2018

Location: 3B | Expo Hall

Morning break sponsored by Cisco (30m)

11:20am

Data operations problems created by deep learning and how to fix them (sponsored by MapR)

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 01/02

Jim Scott (NVIDIA)

Drawing on his experience working with customers across many industries, including chemical sciences, healthcare, and oil and gas, Jim Scott details the major impediments to successful completion of deep learning projects and solutions while walking you through a customer use case. Read more.

Ubiquitous machine learning (sponsored by Cisco)

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 03

Chiang Yang (Cisco)

Data is the lifeblood of an enterprise, and it's being generated everywhere. To overcome the challenges of data gravity, data analytics, including machine learning, is best done where the data is located: ubiquitous machine learning. Han Yang explains how to overcome the challenges of machine learning everywhere. Read more.

Using modern database and open source tools to accelerate client service delivery (sponsored by MemSQL)

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 04/05

Petrus Smith (PwC)

Peet Smith explains how PwC is using modern database tools with a combination of open source technologies to automate and scale data ingestion and transformation to get data to engagement teams to help them streamline and accelerate client service delivery. Read more.

Machine learning for time series: What works and what doesn't

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 15/16 Level: Intermediate

Secondary topics: Deep Learning, Retail and e-commerce, Temporal data and time-series analytics

Mikio Braun (Zalando)

Average rating:

(4.86, 7 ratings)

Time series data has many applications in industry, from analyzing server metrics to monitoring IoT signals and outlier detection. Mikio Braun offers an overview of time series analysis with a focus on modern machine learning approaches and practical considerations, including recommendations for what works and what doesn’t, and industry use cases. Read more.

Guidebook to unwind the enterprise "data hairball" and get ready for AI (sponsored by IBM)

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1E 17

Tim Davis (IBM)

Average rating:

(3.33, 3 ratings)

Tim Davis discusses key pain points and solutions to problems many enterprises face with data in silos, poor-quality data that cannot always be trusted, and managing and making large volumes of data available to derive more accurate insights and machine learning models. Read more.

Semantic recommendations

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 06/07 Level: Beginner

Secondary topics: Deep Learning, Recommendation Systems

Shioulin Sam (Cloudera Fast Forward Labs)

Average rating:

(3.25, 4 ratings)

Recent advances in deep learning allow us to use the semantic content of items in recommendation systems, addressing a weakness of traditional methods. Shioulin Sam explores the limitations of classical approaches and explains how using the content of items can help solve common recommendation pitfalls, such as the cold start problem, and open up new product possibilities. Read more.

BlazeIt: An exploratory video analytics engine

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 08 Level: Advanced

Secondary topics: Media, Marketing, Advertising

Daniel Kang (Stanford University)

Average rating:

(4.00, 2 ratings)

Daniel Kang offers an overview of exploratory video analytics engine BlazeIt, which offers FrameQL, a declarative SQL-like language for querying video, and a query optimizer for executing these queries. You'll see how FrameQL can capture a large set of real-world queries ranging from aggregation and scrubbing and how BlazeIt can execute certain queries up to 2,000x faster than a naive approach. Read more.

Your 10 billion rides are arriving now: Scaling Apache Spark for data pipelines and intelligent systems at Uber

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 10 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines, Transportation and Logistics

Felix Cheung (Uber)

Average rating:

(4.60, 5 ratings)

Did you know that your Uber rides are powered by Apache Spark? Join Felix Cheung to learn how Uber is building its data platform with Apache Spark at enormous scale and discover the unique challenges the company faced and overcame. Read more.

Breaking the rules: End-stage renal disease prediction

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 12/14 Level: Beginner

Secondary topics: Health and Medicine

Olga Cuznetova (Optum), Manna Chang (Optum)

Average rating:

(3.33, 3 ratings)

Olga Cuznetova and Manna Chang demonstrate supervised and unsupervised learning methods to work with claims data and explain how the methods complement each other. The supervised method looks at CKD patients at risk of developing end-stage renal disease (ESRD), while the unsupervised approach looks at the classification of patients that tend to develop this disease faster than others. Read more.

Protecting sensitive data in huge datasets: Cloud tools you can use

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Ethics and Privacy

Felipe Hoffa (Google), Damien Desfontaines (Google | ETH Zürich)

Average rating:

(4.00, 1 rating)

Before releasing a public dataset, practitioners need to thread the needle between utility and protection of individuals. Felipe Hoffa and Damien Desfontaines explore how to handle massive public datasets, taking you from theory to real life as they showcase newly available tools that help with PII detection and brings concepts like k-anonymity and l-diversity to the practical realm. Read more.

The future of ETL isn’t what it used to be.

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Gwen Shapira (Confluent)

Average rating:

(4.00, 4 ratings)

Gwen Shapira shares design and architecture patterns that are used to modernize data engineering. You'll learn how modern engineering organizations use Apache Kafka, microservices, and event streams to efficiently build data pipelines that are scalable, reliable, and built to evolve. Read more.

Commercial software in an increasingly open source ecosystem (sponsored by SAS)

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1E 06

Paul Kent (SAS)

Average rating:

(5.00, 1 rating)

Software is eating the world, and open source is eating the software. Most contemporary analytics shops use a lot of open source software in their analytics platform. So where does commercial software like SAS fit? Paul Kent explains how you can achieve the best of both worlds by combining your favorite open source software with the power of SAS analytics. Read more.

Processing fast data with Apache Spark: A tale of two APIs

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1E 07/08 Level: Intermediate

Gerard Maas (Lightbend)

Average rating:

(5.00, 1 rating)

Apache Spark has two streaming APIs: Spark Streaming and Structured Streaming. Gerard Maas offers a critical overview of their differences with regard to key aspects of a streaming application: API usability, dealing with time, dealing with state and machine learning capabilities, and more. You'll learn when to pick one over the other or combine both to implement resilient streaming pipelines. Read more.

DIY versus designer approaches to deploying data center infrastructure for machine learning and analytics

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1E 09 Level: Beginner

Secondary topics: Data Platforms

Cory Minton (Dell EMC), Colm Moynihan (Cloudera)

Average rating:

(5.00, 1 rating)

Cory Minton and Colm Moynihan explain how to choose the right deployment model for on-premises infrastructure to reduce risk, reduce costs, and be more nimble. Read more.

From data governance to AI governance: The CIO's new role

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1E 10/11 Level: Non-technical

Secondary topics: Data preparation, governance and privacy, Machine Learning in the enterprise

JF Gagne (Element AI)

Average rating:

(3.50, 4 ratings)

JF Gagne explains why the CIO is going to need a broader mandate in the company to better align their AI training and outcomes with business goals and compliance. This mandate should include an AI governance team that is well staffed and deeply established in the company, in order to catch biases that can develop from faulty goals or flawed data. Read more.

Agile for data science teams

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1E 12/13 Level: Intermediate

Secondary topics: Machine Learning in the enterprise

Jennifer Prendki (Figure Eight)

Average rating:

(4.38, 8 ratings)

Agile methodologies have been widely successful for software engineering teams but seem inappropriate for data science teams, because data science is part engineering, part research. Jennifer Prendki demonstrates how, with a minimum amount of tweaking, data science managers can adapt Agile techniques and establish best practices to make their teams more efficient. Read more.

Executive Briefing: GDPR—Getting your data ready for heavy, new EU privacy regulations

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1E 14 Level: Intermediate

Secondary topics: Data preparation, governance and privacy, Ethics and Privacy

Mark Donsky (Okera), Steven Ross (Cloudera)

In May 2018, the General Data Protection Regulation (GDPR) went into effect for firms doing business in the EU, but many companies still aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations. Read more.

Next-generation cybersecurity via data fusion, AI, and big data: Pragmatic lessons from the front lines in financial services

11:20am–12:00pm Wednesday, 09/12/2018

Location: Expo Hall Level: Non-technical

Secondary topics: Data Integration and Data Pipelines, Financial Services

Usama Fayyad (Open Insights & OODA Health, Inc.), Troels Oerting (WEF Global Cybersecurity Center)

Average rating:

(3.00, 1 rating)

Usama Fayyad and Troels Oerting share outcomes and lessons learned from building and deploying a global data fusion, incident analysis/visualization, and effective cybersecurity defense based on big data and AI at a major EU bank, in collaboration with several financial services institutions. Read more.

Deep learning: Assessing analytics project feasibility and requirements (sponsored by NVIDIA)

11:20am–12:00pm Wednesday, 09/12/2018

Location: 1 E15

Ward Eldred (NVIDIA)

Average rating:

(5.00, 2 ratings)

Ward Eldred offers an overview of the types of analytical problems that can be solved using deep learning and shares a set of heuristics that can be used to evaluate the feasibility of analytical AI projects. Read more.

12:00pm

Wednesday Topic Tables at Lunch

12:00pm–1:15pm Wednesday, 09/12/2018

Location: Expo Hall (Hall 3B)

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.

Better Together: Women in Big Data Luncheon (sponsored by SAP and Intel)

12:00pm–1:15pm Wednesday, 09/12/2018

Location: 3D 10/11

Average rating:

(3.20, 5 ratings)

If you’re looking to find like minds and make new professional connections, come to the women's networking lunch on Wednesday. Read more.

Wednesday Business Summit Lunch

12:00pm–1:15pm Wednesday, 09/12/2018

Location: 3D 09

Join fellow executives, business leaders, and strategists for a networking lunch on Wednesday for Strata Business Summit attendees and speakers. Read more.

1:15pm

Kubernetes plays Cupid for data scientists and IT (sponsored by MapR)

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 01/02

Skyler Thomas (MapR)

Average rating:

(5.00, 2 ratings)

In the past, there have been major challenges in quickly creating machine learning training environments and deploying trained models into production. Skyler Thomas details how Kubernetes helps data scientists and IT work in concert to speed model training and time-to-value. Read more.

Interactive business intelligence and OLAP on big data lakes using a Spark-native fast data mart (sponsored by Oracle + DataScience.com)

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 03

Srikanth Desikan (Oracle)

Average rating:

(5.00, 1 rating)

SparklineData is an in-memory distributed scale-out analytics platform built on Apache Spark to enable enterprises to query on data lakes directly with instant response times. Srikanth Desikan offers an overview of SparklineData and explains how it can enable new analytics use cases working on the most granular data directly on data lakes. Read more.

How the blurring of memory and storage is revolutionizing the data era (sponsored by Intel)

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 04/05

Arakere Ramesh (Intel), Bharath Yadla (Aerospike)

Persistent memory accelerates analytics, database, and storage workloads across a variety of use cases, bringing new levels of speed and efficiency to the data center and to in-memory computing. Arakere Ramesh and Bharath Yadla offer an overview of the newly announced Intel Optane data center persistent memory and share the exciting potential of this technology in analytics solutions. Read more.

Harnessing and customizing state-of-the-art recommendation solutions with OpenRec

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 15/16 Level: Intermediate

Secondary topics: Deep Learning, Media, Marketing, Advertising, Recommendation Systems, Retail and e-commerce

Longqi Yang (Cornell Tech, Cornell University)

State-of-the-art recommendation algorithms are increasingly complex and no longer one size fits all. Current monolithic development practice poses significant challenges to rapid, iterative, and systematic, experimentation. Longqi Yang explains how to use OpenRec to easily customize state-of-the-art solutions for diverse scenarios. Read more.

Refactor your data warehouse with mobile analytics products (sponsored by Kyligence)

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1E 17

Zhi Zhu (China Construction Bank ), Luke Han (Kyligence)

When China Construction Bank wanted to migrate 23,000+ reports to mobile, it chose Apache Kylin as the high-performance and high-concurrency platform to refactor its data warehouse architecture to serving 400K+ users. Zhi Zhu and Luke Han detail the necessary architecture and best practices for refactoring a data warehouse for mobile analytics. Read more.

Document vectors in the wild: Building a content recommendation system for Reuters.com

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 06/07 Level: Intermediate

Secondary topics: Media, Marketing, Advertising, Recommendation Systems, Text and Language processing and analysis

James Dreiss (Reuters)

Average rating:

(3.67, 3 ratings)

James Dreiss discusses the challenges in building a content recommendation system for one of the largest news sites in the world, Reuters.com. The particularities of the system include developing a scrolling newsfeed and the use of document vectors for semantic representation of content. Read more.

Why data scientists should love Linux containers

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 08 Level: Beginner

Secondary topics: Model lifecycle management

William Benton (Red Hat)

Average rating:

(5.00, 2 ratings)

Containers are a hot technology for application developers, but they also provide key benefits for data scientists. William Benton details the advantages of containers for data scientists and AI developers, focusing on high-level tools that will enable you to become more productive and collaborate more effectively. Read more.

The evolution of Netflix's S3 data warehouse

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 10 Level: Intermediate

Secondary topics: Data Platforms

Ryan Blue (Netflix), Daniel Weeks (Netflix)

Average rating:

(5.00, 3 ratings)

In the last few years, Netflix's data warehouse has grown to more than 100 PB in S3. Ryan Blue and Daniel Weeks share lessons learned, the tools Netflix currently uses and those it has retired, and the improvements it is rolling out, including Iceberg, a new table format for S3. Read more.

Correlation analysis on live data streams

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 12/14 Level: Intermediate

Secondary topics: Media, Marketing, Advertising, Temporal data and time-series analytics

Arun Kejariwal (Independent), Francois Orsini (MZ)

Average rating:

(4.00, 1 rating)

The rate of growth of data volume and velocity has been accelerating along with increases in the variety of data sources. This poses a significant challenge to extracting actionable insights in a timely fashion. Arun Kejariwal and Francois Orsini explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making. Read more.

A data marketplace case study with the blockchain and advanced multitenant Hadoop in a smart open data platform

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Blockchain and decentralization, Data preparation, governance and privacy

Minh Chau Nguyen (ETRI), Heesun Won (ETRI)

Average rating:

(2.20, 5 ratings)

Minh Chau Nguyen and Heesun Won explain how to implement analytics services in data marketplace systems on a single Hadoop cluster across distributed data centers. The solution extends the overall architecture of the Hadoop ecosystem with the blockchain so that multiple tenants and authorized third parties can securely access data while still maintaining privacy, scalability, and reliability. Read more.

Lessons learned building a scalable and extendable data pipeline for Call of Duty

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Yaroslav Tkachenko (Activision)

Average rating:

(4.67, 3 ratings)

What's easier than building a data pipeline? You add a few Apache Kafka clusters and a way to ingest data, design a way to route your data streams, add a few stream processors and consumers, integrate with a data warehouse. . .wait, this looks like a lot of things. Join Yaroslav Tkachenko to learn best practices for building a data pipeline, drawn from his experience at Demonware/Activision. Read more.

Feet on the ground, head in the clouds (sponsored by AtScale)

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1E 06

Mark Stange-Tregear (Ebates)

Interested in how Ebates is using a hybrid on-premises and cloud implementation to scale out its centralized business intelligence and data hub? Mark Stange-Tregear shares the history, business context, and technical plan around Ebates’s hybrid Hadoop-AWS cloud approach. Read more.

Why and how to leverage the power and simplicity of SQL on Apache Flink

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1E 07/08 Level: Intermediate

Fabian Hueske (Ververica)

Average rating:

(5.00, 1 rating)

Fabian Hueske discusses why SQL is a great approach to unify batch and stream processing. He gives an update on Apache Flink's SQL support and shares some interesting use cases from large-scale production deployments. Finally, Fabian presents Flink's new query service that enables users and applications to submit streaming and batch SQL queries and retrieve low-latency updated results. Read more.

Data governance: A big job that's getting bigger

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1E 09 Level: Intermediate

Secondary topics: Data preparation, governance and privacy

Andrew Brust (Blue Badge Insights | ZDNet)

Average rating:

(4.50, 2 ratings)

Data governance has grown from a set of mostly data management-oriented technologies in the data warehouse era to encompass catalogs, glossaries, and more in the data lake era. Now new requirements are emerging, and new products are rising to meet the challenge. Andrew Brust tracks data governance's past and present and offers a glimpse of the future. Read more.

Data University: How Airbnb democratized data

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1E 10/11 Level: Non-technical

Secondary topics: Machine Learning in the enterprise, Retail and e-commerce

Erin Coffman (Airbnb)

Average rating:

(5.00, 7 ratings)

Airbnb has open-sourced many high-leverage data tools, including Airflow, Superset, and the Knowledge Repo, but adoption of these tools across the company was relatively low. Erin Coffman offers an overview of Data University, launched to make data more accessible and utilized in decision making at Airbnb. Read more.

Privacy by design: Building in data privacy and protection versus bolting it on later

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1E 12/13 Level: Advanced

Secondary topics: Data preparation, governance and privacy, Ethics and Privacy

Les McMonagle (BlueTalon)

Average rating:

(5.00, 2 ratings)

Privacy by design is a fundamentally important approach to achieving compliance with GDPR and other data privacy or data protection regulations. Les McMonagle outlines how organizations can save time and money while improving data security and regulatory compliance and dramatically reduce the risk of a data breach or expensive penalties for noncompliance. Read more.

Executive Briefing: Profit from AI and machine learning—The best practices for people and process

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1E 14 Level: Non-technical

Secondary topics: Machine Learning in the enterprise

Tony Baer (dbInsight), Florian Douetteau (DATAIKU)

Average rating:

(3.40, 5 ratings)

Tony Baer and Florian Douetteau share the results of research cosponsored by Ovum and Dataiku that surveyed a specially selected sample of chief data officers and data scientists on how to map roles and processes to make success with AI in the business repeatable. Read more.

A comparative analysis of the fundamentals of AWS and Azure

1:15pm–1:55pm Wednesday, 09/12/2018

Location: Expo Hall Level: Beginner

Jason Wang (Cloudera), Suraj Acharya (Cloudera), Tony Wu (Cloudera)

The largest infrastructure paradigm change of the 21st century is the shift to the cloud. Companies now face the difficult decision of which cloud to go with. This decision is not just financial and in many cases rests on the underlying infrastructure. Jason Wang, Suraj Acharya, and Tony Wu compare the relative strengths and weaknesses of AWS and Azure. Read more.

Simplifying AI infrastructure: Lessons in scaling a deep learning enterprise (sponsored by NVIDIA)

1:15pm–1:55pm Wednesday, 09/12/2018

Location: 1 E15

Darrin Johnson (NVIDIA)

While every enterprise is on a mission to infuse its business with deep learning, few know how to build the infrastructure to get them there. Darrin Johnson shares insights and best practices learned from NVIDIA's deep learning deployments around the globe that you can leverage to shorten deployment timeframes, improve developer productivity, and streamline operations. Read more.

2:05pm

A developer's guide to building AI applications (sponsored by Microsoft)

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 01/02

Anand Raman (Microsoft), Wee Hyong Tok (Microsoft)

Anand Raman and Wee Hyong Tok walk you through applying AI technologies in the cloud. You'll learn how to add prebuilt AI capabilities like object detection, face understanding, translation, and speech to applications, build cognitive search applications that understand deep content in images, text, and other data, use the Azure platform to accelerate machine learning, and more. Read more.

Preventing more fraud in less time with machine learning-driven data management (sponsored by Informatica)

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 03

chris wojdak (Symcor)

Average rating:

(4.67, 3 ratings)

Chris Wojdak explains how Symcor has transformed its big data architecture using Informatica’s comprehensive machine learning-based solutions for data integration, data quality, data cataloging, and data governance. Read more.

Governing your cloud-based enterprise data lake (sponsored by Zaloni)

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 04/05

Ben Sharma (Zaloni), Selwyn Collaco (TMX)

Average rating:

(5.00, 2 ratings)

Selwyn Collaco and Ben Sharma share insights from their real-world experience and discuss best practices for architecture, technology, data management, and governance to enable centralized data services and explain how to leverage the Zaloni Data Platform (ZDP), an integrated self-service data platform, to operationalize the enterprise data lake . Read more.

Achieving personalization with LSTMs

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 15/16 Level: Intermediate

Secondary topics: Deep Learning, Recommendation Systems, Temporal data and time-series analytics, Transportation and Logistics

Ankit Jain (Uber)

Average rating:

(3.00, 3 ratings)

Personalization is a common theme in social networks and ecommerce businesses. Personalization at Uber involves an understanding of how each driver and rider is expected to behave on the platform. Ankit Jain explains how Uber employs deep learning using LSTMs and its huge database to understand and predict the behavior of each and every user on the platform. Read more.

Bringing together machine and human intelligence (sponsored by SAP)

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1E 17

Richard Mooney (SAP)

Intelligent enterprises—fueled by rapid advances in artificial intelligence (AI), machine learning (ML), and the internet of things (IoT)—promise significant business value. Richard Mooney explains how to achieve the game-changing outcomes of an intelligent enterprise, delivering value across business functions with the synergy of machine and human intelligence. Read more.

Diversification in recommender systems: Using topical variety to increase user satisfaction

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 06/07 Level: Intermediate

Secondary topics: Media, Marketing, Advertising, Recommendation Systems

Ahsan Ashraf (Pinterest)

Online recommender systems often rely heavily on user engagement features. This can cause a bias toward exploitation over exploration, overoptimizing on users' interests. Content diversification is important for user satisfaction, but measuring and evaluating impact is challenging. Ahsan Ashraf outlines techniques used at Pinterest that drove ~2–3% impression gains and a ~1% time-spent gain. Read more.

Bighead: Airbnb's end-to-end machine learning platform

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 08 Level: Beginner

Secondary topics: Data Platforms, Model lifecycle management, Retail and e-commerce

Atul Kale (Airbnb), Xiaohan Zeng (Airbnb)

Average rating:

(5.00, 3 ratings)

Atul Kale and Xiaohan Zeng offer an overview of Bighead, Airbnb's user-friendly and scalable end-to-end machine learning framework that powers Airbnb's data-driven products. Built on Python, Spark, and Kubernetes, Bighead integrates popular libraries like TensorFlow, XGBoost, and PyTorch and is designed be used in modular pieces. Read more.

Building a recommendation engine

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 10 Level: Beginner

Sophie Watson (Red Hat)

Average rating:

(3.50, 6 ratings)

Recommender systems enhance user experience and business revenue every day. Sophie Watson demonstrates how to develop a robust recommendation engine using a microservice architecture. Read more.

Continuous machine learning over streaming data: The story continues.

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 12/14 Level: Intermediate

Secondary topics: Retail and e-commerce, Temporal data and time-series analytics

Roger Barga (Amazon Web Services), Sudipto Guha (Amazon Web Services), Kapil Chhabra (Amazon Web Services )

Average rating:

(5.00, 3 ratings)

Roger Barga, Sudipto Guha, and Kapil Chhabra explain how unsupervised learning with the robust random cut forest (RRCF) algorithm enables insights into streaming data and share new applications to impute missing values, forecast future values, detect hotspots, and perform classification tasks. They also demonstrate how to implement unsupervised learning over massive data streams. Read more.

Using the blockchain in the enterprise

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 21/22 Level: Non-technical

Secondary topics: Blockchain and decentralization, Financial Services

Jim Scott (NVIDIA)

Average rating:

(2.67, 3 ratings)

Jim Scott details relevant use cases for blockchain-based solutions across a variety of industries, focusing on a suggested architecture to achieve high-transaction-rate private blockchains and decentralized applications backed by a blockchain. Along the way, Jim compares public and private blockchain architectures. Read more.

Marmaray: A generic, scalable, and pluggable Hadoop data ingestion and dispersal framework

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Danny Chen (Uber Technologies), Omkar Joshi (Uber), Eric Sayle (Uber Technologies)

Average rating:

(3.80, 5 ratings)

Danny Chen, Omkar Joshi, and Eric Sayle offer an overview of Marmaray, a generic Hadoop ingestion and dispersal framework recently released to production at Uber. You'll learn how Marmaray can meet a team's data needs by ensuring that data can be reliably ingested into Hive or dispersed into online data stores and take a deep dive into the architecture to see how it all works. Read more.

A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data)

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1E 06

Randy Lea (Arcadia Data)

Average rating:

(5.00, 1 rating)

The use of data lakes continue to grow, and the right business intelligence (BI) and analytics tools on data lakes are critical to data lake success. Randy Lea explains why existing BI tools work well for data warehouses but not data lakes and why every organization should have two BI standards: one for data warehouses and one for data lakes. Read more.

Building Fabric Answers using Apache Heron

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1E 07/08 Level: Beginner

Karthik Ramasamy (Streamlio), Andrew Jorgensen (Google)

Average rating:

(4.00, 1 rating)

Streaming systems like Apache Heron are being used for an increasingly broad array of applications. Karthik Ramasamy and Andrew Jorgensen offer an overview of Fabric Answers, which provides real-time insights to mobile developers to improve their product experience at Google Fabric using Apache Heron. Read more.

What's the Hadoop-la about Kubernetes?

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1E 09 Level: Advanced

Anant Chintamaneni (BlueData), Nanda Vijaydev (BlueData)

Average rating:

(5.00, 1 rating)

Kubernetes (K8s)—the open source container orchestration system for modern big data workloads—is increasingly popular. While the promised land is a unified platform for cloud-native stateless and stateful data services, stateful, multiservice big data cluster orchestration brings unique challenges. Anant Chintamaneni and Nanda Vijaydev outline the considerations for big data services for K8s. Read more.

Realizing the true value in your data: Data-drivenness assessment

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1E 10/11 Level: Beginner

Lawrence Cowan (Cicero Group)

Average rating:

(3.00, 3 ratings)

Firms are struggling to leverage their data. Lawrence Cowan outlines a methodology for assessing four critical areas that firms must consider when looking to make the analytical leap: data strategy, data culture, data analysis and implementation, and data management and architecture. Read more.

An ethical foundation for the AI-driven future

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1E 12/13 Level: Beginner

Secondary topics: Ethics and Privacy

Harry Glaser (Periscope Data)

Average rating:

(5.00, 2 ratings)

What is the moral responsibility of a data team today? As AI and machine learning technologies become part of our everyday life and as data becomes accessible to everyone, CDOs and data teams are taking on a very important moral role as the conscience of the corporation. Harry Glaser highlights the risks companies will face if they don't empower data teams to lead the way for ethical data use. Read more.

Executive Briefing: Why machine-learned models crash and burn in production and what to do about it

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1E 14 Level: Intermediate

Secondary topics: Machine Learning in the enterprise, Model lifecycle management

David Talby (Pacific AI)

Average rating:

(4.40, 5 ratings)

Machine learning and data science systems often fail in production in unexpected ways. David Talby shares real-world case studies showing why this happens and explains what you can do about it, covering best practices and lessons learned from a decade of experience building and operating such systems at Fortune 500 companies across several industries. Read more.

MLflow: An open platform to simplify the machine learning lifecycle

2:05pm–2:45pm Wednesday, 09/12/2018

Location: Expo Hall

Secondary topics: Model lifecycle management

Mani Parkhe (Databricks), Andrew Chen (Databricks)

Successfully building and deploying a machine learning model is difficult to do once. Enabling other data scientists to reproduce your pipeline, compare the results of different versions, track what's running where, and redeploy and rollback updated models is much harder. Mani Parkhe and Andrew Chen offer an overview of MLflow—a new open source project from Databricks that simplifies this process. Read more.

Kubernetes on GPUs (sponsored by NVIDIA)

2:05pm–2:45pm Wednesday, 09/12/2018

Location: 1 E15

Michael Balint (NVIDIA)

Michael Balint explains how NVIDIA employs its own distribution of Kubernetes, in conjunction with DGX hardware, to make the most efficient use of GPU resources and scale its efforts across a cluster, allowing multiple users to run experiments and push their finished work to production. Read more.

2:55pm

Use of modern data environments in telecom (sponsored by Microstrategy)

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 01/02

Sara Alavi (Bell Canada)

Bell Canada, Canada's largest communications company, leads the industry in providing world-class broadband communications services to consumers and business customers. Join Sara Alavi to learn how the network big data and AI team within Bell is using modern data environments and applying a startup mindset to transform traditional networks into insight-driven intelligent networks. Read more.

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 03

Mathew Lodge (Anaconda)

Average rating:

(5.00, 1 rating)

The days of deploying Java code to Hadoop and Spark data lakes for data science and ML are numbered. Welcome to the future. Containers and Kubernetes make great language-agnostic distributed computing clusters: it's just as easy to deploy Python as it is Java. Mathew Lodge shows you how. Read more.

Speed, scale, smarts: GPU-powered analytics for the extreme data economy (sponsored by Kinetica)

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 04/05

Michael Mahoney (Kinetica)

Michael Mahoney demonstrates how to leverage the power of GPUs to converge streaming data analysis, location analysis, and streamlined machine learning with a single engine. Along the way, Michael shares real-world case studies on how Kinetica is used to solve complex data challenges. Read more.

A deep learning approach for precipitation nowcasting with RNN using Analytics Zoo on BigDL

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 15/16 Level: Intermediate

Secondary topics: Deep Learning, Temporal data and time-series analytics

Alex Heye (Cray), Ding Ding (Intel)

Precipitation nowcasting is used to predict the future rainfall intensity over a relatively short timeframe. The forecasting resolution and time accuracy required are much higher than for other traditional forecasting tasks. Alexander Heye and Ding Ding explain how to build a precipitation nowcasting system with recurrent neural networks using BigDL on Apache Spark. Read more.

The big data makeover: 10 months from ideation to enterprise-scale solution (sponsored by Infoworks)

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1E 17

Chris Stirrat (Eagle Investment Systems)

Average rating:

(3.00, 1 rating)

Eagle Investment Systems, a leading provider of financial services technology, is building a new Hadoop and cloud-based data management solution. Chris Stirrat explains how Eagle went from incubation to an enterprise-scale solution in just 10 months, using a Hadoop-based big data stack and multitenant architecture, transforming software creation, delivery, quality, technology, and culture. Read more.

Perverse incentives in metrics: Inequality in the like economy

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 06/07 Level: Intermediate

Secondary topics: Ethics and Privacy, Media, Marketing, Advertising, Recommendation Systems

Bonnie Barrilleaux (LinkedIn)

Average rating:

(4.50, 4 ratings)

As LinkedIn encouraged members to join conversations, it found itself in danger of creating a "rich get richer" economy in which a few creators got an increasing share of all feedback. Bonnie Barrilleaux explains why you must regularly reevaluate metrics to avoid perverse incentives—situations where efforts to increase the metric cause unintended negative side effects. Read more.

Solving the cold start problem: Data and model aggregation using differential privacy

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 08 Level: Beginner

Secondary topics: Ethics and Privacy

Chang Liu (Georgian Partners )

Average rating:

(5.00, 1 rating)

Chang Liu offers an overview of a common problem faced by many software companies, the cold-start problem, and explains how Georgian Partners has been successful at solving this problem by transferring knowledge from existing data through differentially private data aggregation. Read more.

Optimizing Apache Impala for a cloud-based data warehouse

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 10 Level: Intermediate

Greg Rahn (Cloudera)

Average rating:

(5.00, 1 rating)

Cloud object stores are becoming the bedrock of cloud data warehouses for modern data-driven enterprises, and it's become a necessity for data teams to have the ability to directly query data stored in S3 or ADLS. Greg Rahn and Mostafa Mokhtar discuss optimal end-to-end workflows and technical considerations for using Apache Impala over object stores for your cloud data warehouse. Read more.

50 reasons to learn the shell for doing data science

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 12/14 Level: Beginner

Jeroen Janssens (Data Science Workshops)

Average rating:

(1.50, 2 ratings)

"Anyone who does not have the command line at their beck and call is really missing something," tweeted Tim O'Reilly when Jeroen Janssens's Data Science at the Command Line was recently made available online for free. Join Jeroen to learn what you're missing out on if you're not applying the command line and many of its power tools to typical data science problems. Read more.

Zipline: Airbnb's data management platform for machine learning

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Data Platforms, Retail and e-commerce

Varant Zanoyan (Airbnb)

Average rating:

(4.33, 6 ratings)

Zipline is Airbnb’s soon to be open-sourced data management platform specifically designed for ML use cases. It has taken the task of feature generation from months to days and offers features to support end-to-end data management for machine learning. Varant Zanoyan covers Zipline's architecture and dives into how it solves ML-specific problems. Read more.

Real-time analytics and BI with data lakes and data warehouses using Kudu, HBase, Spark, and Kafka: Lessons learned

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines

Mauricio Aristizabal (Impact)

Average rating:

(2.67, 3 ratings)

Mauricio Aristizabal shares lessons learned from migrating Impact's traditional ETL platform to a real-time platform on Hadoop (leveraging the full Cloudera EDH stack). Mauricio also discusses the company's data lake in HBase, Spark Streaming jobs (with Spark SQL), using Kudu for "fast data" BI queries, and using Kafka's data bus for loose coupling between components. Read more.

Enabling predictive maintenance using automated IoT data pipelines (sponsored by BMC)

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1E 06

Basil Faruqui (BMC Software)

Average rating:

(2.00, 1 rating)

Basil Faruqui demonstrates how to simplify the automation and orchestration of an IoT-driven data pipeline in a cloud environment where machine learning algorithms predict failures. Read more.

Streaming big data in the cloud: What to consider and why

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1E 07/08 Level: Intermediate

Bill Chambers (Databricks)

Average rating:

(3.00, 1 rating)

Streaming big data is a rapidly growing field but currently involves a lot of operational complexity and expertise. Bill Chambers shares a decision making framework for determining the best tools and technologies for successfully deploying and maintaining streaming data pipelines to solve business problems and offers an overview of Apache Spark’s Structured Streaming processing engine. Read more.

Clouds and containers: Case studies for big data

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1E 09 Level: Beginner

Paul Curtis (Weaveworks)

Average rating:

(5.00, 2 ratings)

Once the data has been captured, how can the cloud, containers, and a data fabric combine to build the infrastructure to provide the business insights? Paul Curtis explores three customer deployments that leverage the best of the private clouds and containers to provide a flexible big data environment. Read more.

From strategy to implementation: Putting data to work at USA for UNHCR

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1E 10/11 Level: Non-technical

Friederike Schuur (Cloudera), Rita Ko (USA for UNHCR)

Average rating:

(5.00, 1 rating)

Friederike Schuur and Rita Ko explain how the Hive (an internal group at USA for UNHCR) and Cloudera Fast Forward Labs transformed USA for UNHCR, enabling the agency to use data science and machine learning (DS/ML) to address the refugee crisis. Along the way, they cover the development and implementation of a DS/ML strategy, identify use cases and success metrics, and showcase the value of DS/ML. Read more.

Beyond explainability: Regulating machine learning in practice

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1E 12/13 Level: Non-technical

Secondary topics: Data preparation, governance and privacy, Ethics and Privacy

Andrew Burt (bnh.ai)

Average rating:

(5.00, 2 ratings)

Machine learning is becoming prevalent across industries, creating new types of risk. Managing this risk is quickly becoming the central challenge of major organizations, one that strains data science teams, legal personnel, and the C-suite alike. Andrew Burt shares lessons from past regulations focused on similar technology along with a proposal for new ways to manage risk in ML. Read more.

Executive Briefing: Managing successful data projects—Technology selection and team building

2:55pm–3:35pm Wednesday, 09/12/2018

Location: 1E 14 Level: Intermediate

Secondary topics: Machine Learning in the enterprise, Media, Marketing, Advertising

Ted Malaska (Capital One), Jonathan Seidman (Cloudera)

Average rating:

(4.00, 3 ratings)

Creating a successful big data practice in your organization presents new challenges in managing projects and teams. Ted Malaska and Jonathan Seidman share guidance and best practices to help technical leaders deliver successful projects from planning to implementation. Read more.

Performant time series data management and analytics with Postgres

2:55pm–3:35pm Wednesday, 09/12/2018

Location: Expo Hall Level: Intermediate

Michael Freedman (TimescaleDB)

Michael Freedman explains how to leverage Postgres for high-volume time series workloads using TimescaleDB, an open source time series database built as a Postgres plug-in. Michael covers the general architectural design principles and new time series data management features, including adaptive time partitioning and near-real-time continuous aggregations. Read more.

3:35pm

3:35pm–4:35pm Wednesday, 09/12/2018

Location: 3B | Expo Hall

Afternoon Break sponsored by Intel (1h)

4:35pm

TD Bank’s journey to turn its big data environment into a true data lake (sponsored by Talend)

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 01/02

Joseph ( Joe ) DosSantos (TD Bank)

Average rating:

(5.00, 3 ratings)

TD Bank’s data analytics team has undertaken a multiyear journey to modernize its data infrastructure for today and future needs. Joseph DosSantos explains how the team built a governed data lake foundation, enabling business users to leverage its big data environment to extract analytical insights while minimizing risks. Read more.

Accelerate big data analytics and AI with NetApp hybrid cloud architecture (sponsored by NetApp)

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 03

Karthikeyan Nagalingam (NetApp)

As the data authority for hybrid cloud for big data analytics and AI, NetApp understands the value of the access, management, and control of data. Karthikeyan Nagalingam discusses the NetApp Data Fabric, which provides a unified data management environment that spans edge devices, data centers, and multiple hyperscale clouds using ONTAP software, all-flash systems, ONTAP Select, and cloud volumes. Read more.

Keys to operationalize enterprise 360 (sponsored by Impetus)

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 04/05

Anand Raman (Impetus Technologies)

Average rating:

(1.00, 1 rating)

Is a single source of truth across the enterprise possible, or is it just an expensive myth? Anand Raman explains why you need a holistic decision framework that addresses multiple facets from platform to processes. Join in to explore EDW modernization strategies, self-service analytics, and interactive insights on big data and discover a process to get to a unified data model. Read more.

When Tiramisu meets online fashion retail

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 15/16 Level: Intermediate

Secondary topics: Deep Learning, Media, Marketing, Advertising, Retail and e-commerce

Patty Ryan (Microsoft), CY Yam (Microsoft), Elena Terenzi (Microsoft)

Average rating:

(5.00, 1 rating)

Large online fashion retailers must efficiently maintain catalogues of millions of items. Due to human error, it's not unusual that some items have duplicate entries. Since manually trawling such a large catalogue is next to impossible, how can you find these entries? Patty Ryan, CY Yam, and Elena Terenzi explain how they applied deep learning for image segmentation and background removal. Read more.

Hadoop-compatible filesystems: The limits of "compatible" (sponsored by WANdisco)

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1E 17

Paul Scott-Murphy (WANdisco)

Average rating:

(4.50, 2 ratings)

Every organization is considering its storage options, with an eye toward the cloud. Paul Scott-Murphy explores what makes different large-scale storage systems and services unique, their clear (and unexpected) differences, the options you have to use them, and the surprises you can expect along the way. Read more.

Anxiety at scale: How Investopedia used readership data to track market volatility

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 06/07 Level: Beginner

Secondary topics: Financial Services, Text and Language processing and analysis

Masha Westerlund (Investopedia)

Average rating:

(5.00, 2 ratings)

Businesses rely on user data to power their sites, products, and sales. Can we give back by sharing those insights with users? Masha Westerlund explains how Investopedia harnessed reader data to build an index that tracks market anxiety and moves with the VIX, a proprietary measure of market volatility. You'll see how thinking outside the box helps turn data into tools for users, not stakeholders. Read more.

Programming by input-output examples

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 08 Level: Intermediate

Sumit Gulwani (Microsoft)

Programming by input-output examples (PBE) is a new frontier in AI, set to revolutionize the programming experience for the masses. It can enable end users—99% of whom are nonprogrammers—to create small scripts and make data scientists 10–100x more productive for many data wrangling tasks. Sumit Gulwani leads a deep dive into this new programming paradigm and explores the science behind it. Read more.

Setting up a lightweight distributed caching layer using Apache Arrow

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 10 Level: Intermediate

Jacques Nadeau (Dremio)

Average rating:

(5.00, 1 rating)

Jacques Nadeau leads a deep dive into a new Apache-licensed lightweight distributed in-memory cache that allows multiple applications to consume Arrow directly using the Arrow RPC and IPC protocols. You'll explore the system design and deployment architecture—including the cache life cycle, update patterns, cache cohesion, and appropriate use cases—learn how it all works, and see it in action. Read more.

VC trends in machine learning and data science

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 12/14

Sarah Catanzaro (Amplify Partners), Rama Sekhar (Norwest Venture Partners), Zavain Dar (Lux Capital), Jonathan Lehr (Work-Bench), Crystal Huang (NEA)

In this panel discussion, venture capital investors explain how startups can accelerate enterprise adoption of machine learning and explore the new tech trends that will give rise to the next transformation in the big data landscape. Read more.

How to cost-effectively and reliably build infrastructure for machine learning

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 21/22 Level: Beginner

Secondary topics: Data Platforms

Osman Sarood (Mist Systems)

Average rating:

(2.00, 1 rating)

Mist consumes several terabytes of telemetry data daily from its globally deployed wireless access points, a significant portion of which is consumed by ML algorithms. Last year, Mist saw 10x infrastructure growth. Osman Sarood explains how Mist runs 75% of its production infrastructure, reliably, on AWS EC2 spot instances, which has brought its annual AWS cost from $3 million to $1 million. Read more.

Tracking data lineage at Stitch Fix

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1A 23/24 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines, Data preparation, governance and privacy

Neelesh Salian (Stitch Fix)

Average rating:

(1.33, 3 ratings)

Neelesh Srinivas Salian explains how Stitch Fix built a service to better understand the movement and evolution of data within the company's data warehouse, from the initial ingestion from outside sources through all of its ETLs. Neelesh covers why and how Stitch Fix built the service and details some use cases. Read more.

Best practices for migrating big data workloads to Amazon Web Services (sponsored by Amazon Web Services)

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1E 06

Faria Bruno (Amazon Web Services)

Bruno Faria explains how to identify the components and workflows in your current environment and shares best practices to migrate these workloads to AWS. Read more.

AppNexus's stream-based control system for automated buying of digital ads

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1E 07/08 Level: Intermediate

Brian Wu (AppNexus)

Average rating:

(5.00, 1 rating)

Automating the success of digital ad campaigns is complicated and comes with the risk of wasting the advertiser's budget or a trader's margin and time. Brian Wu describes the evolution of Inventory Discovery, a streaming control system of eligibility, prioritization, and real-time evaluation that helps digital advertisers hit their performance goals with AppNexus. Read more.

Using machine learning to drive intelligence at the edge

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1E 09 Level: Intermediate

Secondary topics: Model lifecycle management

Dave Shuman (Cloudera), Bryan Dean (Red Hat)

The focus on the IoT is turning increasingly to the edge, and the way to make the edge more intelligent is by building machine learning models in the cloud and pushing them back out to the edge. Dave Shuman and Bryan Dean explain how Cloudera and Red Hat executed this architecture at one of Europe's leading manufacturers, along with a demo highlighting this architecture. Read more.

The lure of "the one metric that matters"

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1E 10/11 Level: Beginner

Adil Aijaz (Split Software)

Average rating:

(5.00, 1 rating)

Many products, whether data driven or not, chase “the one metric that matters.” It may be engagement, revenue, or conversion, but the common theme is the pursuit of improvement in one metric. Product development teams should instead focus on the design of metrics that measure our goals. Adil Aijaz shares an approach to designing metrics and discusses best practices and common pitfalls. Read more.

Rationalizing risk in AI and ML

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1E 12/13 Level: Non-technical

Secondary topics: Ethics and Privacy, Machine Learning in the enterprise

Kimberly Nevala (SAS)

Average rating:

(5.00, 1 rating)

Too often, the discussion of AI and ML includes an expectation—if not a requirement—for infallibility. But as we know, this expectation is not realistic. So what’s a company to do? While risk can’t be eliminated, it can be rationalized. Kimberly Nevala demonstrates how an unflinching risk assessment enables AI/ML adoption and deployment. Read more.

Executive Briefing: Most data-driven cultures aren’t

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1E 14 Level: Non-technical

Secondary topics: Machine Learning in the enterprise, Media, Marketing, Advertising

Cassie Kozyrkov (Google)

Average rating:

(4.30, 10 ratings)

Many organizations aren’t aware that they have a blindspot with respect to their lack of data effectiveness, and hiring experts doesn’t seem to help. Cassie Kozyrkov examines what it takes to build a truly data-driven organizational culture and highlights a vital yet often neglected job function: the data science manager. Read more.

Architectural principles for building trusted, real-time, distributed IoT systems

4:35pm–5:15pm Wednesday, 09/12/2018

Location: Expo Hall Level: Intermediate

Secondary topics: Blockchain and decentralization, Data Platforms

Dan Harple (Context Labs)

Dan Harple explains how distributed systems are being influenced by and are influencing operational, financial, and social impact requirements of a wide range of enterprises and how trust in these distributed systems is being challenged, elevated, and resolved by engineers and architects today. Read more.

GPU-accelerated analytics and machine learning ecosystems (Inception Showcase sponsored by NVIDIA)

4:35pm–5:15pm Wednesday, 09/12/2018

Location: 1 E15

Alen Capalik (FASTDATA.io), Jim McHugh (NVIDIA), SriSatish Ambati (H2O.ai), Tim Delisle (Datalogue)

Explore case studies from Datalogue, FASTDATA.io, and H20.ai that demonstrate how GPU-accelerated analytics, machine learning, and ETL help companies overcome slow queries and tedious data preparation process, dynamically correlate among data, and enjoy automatic feature engineering. Read more.

5:25pm

How to avoid drowning in logs: Streaming 80 billion events and batch processing 40 TB/hour (sponsored by Pure Storage)

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 01/02

Ivan Jibaja (Pure Storage)

Average rating:

(5.00, 1 rating)

Pure Storage runs over 70,000 tests per day. Using Spark’s flexible computing platform, the company can write a single application for both streaming and batch jobs so the company's team of triage engineers can understand the state of the continuous integration pipeline. Ivan Jibaja discusses the use case for big data analytics technologies, the architecture of the solution, and lessons learned. Read more.

Data for posterity: Nobody licenses or builds data just to have it. (sponsored by Pitney Bowes)

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 03

Dan Adams (Pitney Bowes)

The role of data and the demand to get it right, coupled with competitive pressures to move faster, have dramatically increased. Companies now recognize data as an asset and need to manage it that way. Join Dan Adams for the insights you need to ensure that your data addresses current and future needs and that your organization is set up for success. Read more.

How Bell Canada increased the scale of BI exponentially with OLAP on big data (sponsored by Kyvos Insights)

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 04/05

Mark Huang (Bell Canada)

Like all telecommunication giants, Bell Canada relies on huge volumes of data to make accurate business decisions and deliver better services. Mark Huang discusses why Bell Canada chose Kyvos’s OLAP on big data technology to achieve multidimensional analytics and how it helped the company deliver to its growing business reporting demands. Read more.

Accelerating financial data science workflows with GPUs

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 15/16 Level: Intermediate

Secondary topics: Financial Services

Joshua Patterson (NVIDIA), Onur Yilmaz (NVIDIA)

GPUs have allowed financial firms to accelerate their computationally demanding workloads. Today, the bottleneck has moved completely to ETL. The GPU Open Analytics Initiative (GoAi) is helping accelerate ETL while keeping the entire workflow on GPUs. Joshua Patterson and Onur Yilmaz discuss several GPU-accelerated data science tools and libraries. Read more.

From data lakes to the data fabric: Our vision for digital strategy (sponsored by Cambridge Semantics)

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1E 17

Sam Chance (Cambridge Semantics), Partha Bhattachargee (Cambridge Semantics)

Ben Szekely shares a vision for digital innovation: The data fabric connects enterprise data for unprecedented access in an overlay fashion that does not disrupt current investments. Interconnected and reliable data drives business outcomes by automating scalable AI and ML efforts. Graph technology is the way forward to realize this future. Read more.

Network effects: Working with modern graph analytic systems

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 06/07 Level: Intermediate

Secondary topics: Financial Services

Zachary Hanif (Capital One)

Average rating:

(4.67, 3 ratings)

An understanding of graph-based analytical techniques can be extremely powerful when applied to modern practical problems, and modern frameworks and analytical techniques are making graph analysis methods viable for increasingly large, complex tasks. Zachary Hanif examines three prominent graph analytic methods, including graph convolutional networks, and applies them to concrete use cases. Read more.

From emotion analysis and topic extraction to narrative modeling

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 08 Level: Beginner

Secondary topics: Text and Language processing and analysis

Andreea Kremm (Netex Group), Mohammed Ibraaz Syed (UCLA)

Average rating:

(4.00, 2 ratings)

Narrative economics studies the impact of popular narratives and stories on economic fluctuations in the context of human interests and emotions. Andreea Kremm and Mohammed Ibraaz Syed describe the use of emotion analysis, entity relationship extraction, and topic modeling in modeling narratives from written human communication. Read more.

From flat files to deconstructed database: The evolution and future of the big data ecosystem

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 10 Level: Intermediate

Julien Le Dem (WeWork)

Average rating:

(5.00, 1 rating)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. Read more.

A roadmap for open data science and AI for business: Panel discussion with State Street

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 12/14 Level: Non-technical

Bethann Noble (Cloudera), Daniel Huss (State Street), Abhishek Kodi (State Street)

Average rating:

(4.00, 1 rating)

Bethann Noble, Abhishek Kodi, and Daniel Huss share their experience and best practices for designing and executing on a roadmap for open data science and AI for business. Read more.

Apache Kafka and the four challenges of production machine learning systems

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Model lifecycle management

Jay Kreps (Confluent)

Average rating:

(4.00, 2 ratings)

Machine learning has become mainstream, and suddenly businesses everywhere are looking to build systems that use it to optimize aspects of their product, processes or customer experience. Jay Kreps explores some of the difficulties of building production machine learning systems and explains how Apache Kafka and stream processing can help. Read more.

Circuit breakers to safeguard for garbage in, garbage out

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1A 23/24 Level: Beginner

Secondary topics: Data Integration and Data Pipelines, Financial Services

Sandeep Uttamchandani (Intuit)

Do your analysts always trust the insights generated by your data platform? Ensuring insights are always reliable is critical for use cases in the financial sector. Sandeep Uttamchandani outlines a circuit breaker pattern developed for data pipelines, similar to the common design pattern used in service architectures, that detects and corrects problems and ensures always reliable insights. Read more.

A tale of two BI standards: Data warehouses and data lakes (sponsored by Arcadia Data)

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1E 06

Randy Lea (Arcadia Data)

Hudi: Unifying storage and serving for batch and near-real-time analytics

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1E 07/08 Level: Beginner

Secondary topics: Data Integration and Data Pipelines

Nishith Agarwal (Uber), Balaji Varadarajan (Uber), Vinoth Chandar (Apache Hudi)

Uber has a real need to provide faster, fresher data to its data consumers and products, which are running hundreds of thousands of analytical queries every day. Nishith Agarwal, Balaji Varadarajan, and Vinoth Chandar share the design, architecture, and use cases of the second-generation of Hudi, an analytical storage engine designed to serve such needs and beyond. Read more.

Introducing Iceberg: Tables designed for object stores

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1E 09 Level: Intermediate

Owen O'Malley (Cloudera), Ryan Blue (Netflix)

Average rating:

(4.33, 3 ratings)

Owen O'Malley and Ryan Blue offer an overview of Iceberg, a new open source project that defines a new table layout with properties specifically designed for cloud object stores, such as S3. It provides a common set of capabilities such as partition pruning, schema evolution, atomic additions, removal, or replacements of files regardless of whether the data is stored in Avro, ORC, or Parquet. Read more.

Deploying machine learning models in the enterprise

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1E 10/11 Level: Intermediate

Secondary topics: Model lifecycle management

Diego Oppenheimer (Algorithmia)

Average rating:

(4.50, 2 ratings)

After big investments in collecting and cleaning data and building machine learning (ML) models, enterprises face big challenges in deploying models to production and managing a growing portfolio of ML models. Diego Oppenheimer covers the strategic and technical hurdles each company must overcome and the best practices developed while deploying over 4,000 ML models for 70,000 engineers. Read more.

If you thought politics was dirty, you should see the analytics behind it.

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1E 12/13 Level: Non-technical

Secondary topics: Media, Marketing, Advertising

John Thuma (Arcadia Data)

Average rating:

(5.00, 1 rating)

Forget about the fake news; data and analytics in politics is what drives elections. John Thuma shares ethical dilemmas he faced while proposing analytical solutions to the RNC and DNC. Not only did he help causes he disagreed with, but he also armed politicians with real-time data to manipulate voters. Read more.

Executive Briefing: Enhance your data lake with comprehensive data governance to improve adoption and meet compliance needs

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1E 14 Level: Intermediate

Secondary topics: Data preparation, governance and privacy

Sanjeev Mohan (Gartner)

Average rating:

(5.00, 1 rating)

If the last few years were spent proving the value of data lakes, the emphasis now is to monetize the big data architecture investments. The rallying cry is to onboard new workloads efficiently. But how do you do so if you don’t know what data is in the lake, the level of its quality, or the trustworthiness of models? Sanjeev Mohan explains why data governance is the linchpin to success. Read more.

Automating business processes with large-scale knowledge graphs

5:25pm–6:05pm Wednesday, 09/12/2018

Location: Expo Hall

Secondary topics: Machine Learning in the enterprise, Text and Language processing and analysis

Mike Tung (Diffbot)

Mike Tung offers an overview of available open source and commercial knowledge graphs and explains how consumer and business applications are already taking advantage of them to provide intelligent experiences and enhanced business efficiency. Mike then discusses what's coming in the future. Read more.

Accelerate AI with synthetic data using generative adversarial networks (GAN) (sponsored by NVIDIA)

5:25pm–6:05pm Wednesday, 09/12/2018

Location: 1 E15

Renee Yao (NVIDIA)

Average rating:

(5.00, 1 rating)

Renee Yao explains how generative adversarial networks (GAN) are successfully used to improve data generation and explores specific real-world examples where customers have deployed GANs to solve challenges in healthcare, space, transportation, and retail industries. Read more.

6:05pm

Booth Crawl

6:05pm–7:05pm Wednesday, 09/12/2018

Location: Expo Hall

Make your way from booth to booth while you check out all the exhibitors in the Expo Hall on Wednesday after sessions end. Read more.

7:05pm

7:05pm–7:30pm Wednesday, 09/12/2018

Location: TBD

Grey space closer slot only TBC

7:30pm

Data After Dark

7:30pm–10:30pm Wednesday, 09/12/2018

Location: TAO Downtown

Average rating:

(1.00, 1 rating)

Don't miss an exciting evening filled with cocktails, food, and entertainment at Data After Dark at Strata in New York. Read more.

Thursday, 09/13/2018

8:00am

8:00am–8:45am Thursday, 09/13/2018

Location: 3E Foyer

Morning Coffee (45m)

Speed Networking

8:00am–8:30am Thursday, 09/13/2018

Location: Crystal Palace

Gather before keynotes on Thursday morning to enjoy casual conversation while meeting fellow attendees. If one of your goals at Strata is to meet new people, this session will jumpstart your networking with other attendees. Read more.

8:50am

Thursday keynotes

8:50am–9:00am Thursday, 09/13/2018

Location: 3E

Ben Lorica (O'Reilly), Doug Cutting (Cloudera), Alistair Croll (Solve For Interesting)

Average rating:

(3.50, 2 ratings)

Program chairs Ben Lorica, Doug Cutting, and Alistair Croll welcome you to the second day of keynotes. Read more.

9:00am

Sound design and the future of experience

9:00am–9:15am Thursday, 09/13/2018

Location: 3E

Amber Case (MIT Media Lab)

Average rating:

(4.65, 20 ratings)

Amber Case outlines several methods that product designers and managers can use to improve everyday interactions through an understanding and application of sound design. Read more.

9:15am

Wait. . .pizza is a vegetable? Decoding regulations using machine learning (sponsored by IBM)

9:15am–9:20am Thursday, 09/13/2018

Location: 3E

Dinesh Nirmal (IBM)

Average rating:

(2.87, 15 ratings)

IBM Analytics’s Dinesh Nirmal solves school lunch and the struggle to keep ahead of regulations. With AI tech like deep learning and NLG, supplying meals to California’s kids leaps from enriching metadata for compliance to actionable insights for the business. Read more.

9:20am

Practical ML today and tomorrow

9:20am–9:30am Thursday, 09/13/2018

Location: 3E

Hilary Mason (Cloudera Fast Forward Labs)

Average rating:

(4.00, 11 ratings)

Machine learning and artificial intelligence are exciting technologies, but real value comes from marrying those capabilities with the right business problems. Hilary Mason explores the current state of these technologies, investigates what's coming next in applied machine learning, and explains how to identify and execute on the right business opportunities at the right time. Read more.

9:30am

Derive value from analytics and AI at scale (sponsored by Intel)

9:30am–9:35am Thursday, 09/13/2018

Location: 3E

马子雅 (Ziya Ma) (Intel)

Average rating:

(3.67, 9 ratings)

Data is the fuel for analytics and AI workloads, but the challenges in using it are constant. Ziya Ma discusses how recent innovations from Intel in high-capacity persistent memory and open source software are accelerating production-scale deployments, delivering breakthrough optimizations and faster insights to a wide range of opportunities in the digital enterprise. Read more.

9:35am

Quantifying forgiveness

9:35am–9:55am Thursday, 09/13/2018

Location: 3E

Julia Angwin (ProPublica)

Average rating:

(4.95, 21 ratings)

Algorithms are increasingly arbiters of forgiveness. Julia Angwin discusses what she has learned about forgiveness in her series of articles on algorithmic accountability and the lessons we all need to learn for the coming AI future. Read more.

9:55am

Smarter cities through Geotab with BigQuery ML and geospatial analytics (sponsored by Google Cloud)

9:55am–10:00am Thursday, 09/13/2018

Location: 3E

Chad W. Jennings (Google)

Average rating:

(3.45, 11 ratings)

Cities all over the world are using data and analytics to optimize infrastructure, but city planners are often held back by outdated data gathering methods and legacy analysis tools. Chad Jennings details how Geotab, a leader in IoT fleet logistics, brought BigQuery's unique machine learning and geospatial capabilities to its existing datasets to deliver a more capable solution to city planners. Read more.

10:05am

Brain-based human-machine interfaces: New developments, legal and ethical issues, and potential uses

10:05am–10:20am Thursday, 09/13/2018

Location: 3E

Secondary topics: Ethics and Privacy

Amanda Pustilnik (University of Maryland School of Law | Center for Law, Brain & Behavior, Mass. General Hospital)

Average rating:

(4.50, 12 ratings)

Have you ever dreamed you could read minds? Do telekinesis? Maybe fly a magic carpet by thought alone? Until now, these powers have existed only in the realm of imagination or, more recently, video, AR, and VR games. Join Amanda Pustilnik to learn how brain-based human-machine interfaces are beginning to offer these powers in near-commercially-viable forms. Read more.

10:20am

The data imperative (sponsored by Zaloni)

10:20am–10:25am Thursday, 09/13/2018

Location: 3E

Ben Sharma (Zaloni)

Average rating:

(3.00, 12 ratings)

Once, a company could live 60-70 years on the S&P 500. Now it averages 15 years. If companies were people, this would be an epidemic on par with the Black Plague. But the same things that dragged humanity out of that dark age can drag companies out of this one. Read more.

10:25am

Black box: How AI will amplify the best and worst of humanity

10:25am–10:45am Thursday, 09/13/2018

Location: 3E

Secondary topics: Ethics and Privacy

Jacob Ward (CNN | Al Jazeera | PBS)

Average rating:

(4.73, 15 ratings)

For most of us, our own mind is a black box—an all-powerful and utterly mysterious device that runs our lives for us, using rules and shortcuts of which we aren’t even aware. Jacob Ward reveals the relationship between the unconscious habits of our minds and the way that AI is poised to amplify them, alter them, maybe even reprogram them. Read more.

10:50am

10:50am–11:20am Thursday, 09/13/2018

Location: 3B | Expo Hall

Morning break sponsored by IBM (30m)

11:20am

Assumptions, constraints, and risks: How the wrong assumptions can jeopardize any model (sponsored by IBM)

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 01/02

Jennifer Shin (8 Path Solutions | NYU Stern | IBM)

Common wisdom dictates that we should never make assumptions, but assumptions are essential in the creation of statistical models. Jennifer Shin explores how assumptions fit into the creation of a statistical model, the pitfalls of applying a model to data without taking the underlying assumptions into account, and how to identify datasets where the model and its assumptions are applicable. Read more.

Democratizing deep learning with transfer learning

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 15/16 Level: Beginner

Secondary topics: Deep Learning

Lars Hulstaert (Microsoft)

Average rating:

(5.00, 1 rating)

Transfer learning allows data scientists to leverage insights from large labeled datasets. The general idea of transfer learning is to use knowledge learned from tasks for which a lot of labeled data is available in settings where little labeled data is available. Lars Hulstaert explains what transfer learning is and how it can boost your NLP or CV pipelines. Read more.

Applying petabyte-scale analytics and machine learning to billions of news reading sessions

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 06/07 Level: Intermediate

Secondary topics: Media, Marketing, Advertising, Text and Language processing and analysis

Andrew Montalenti (Parse.ly )

Average rating:

(5.00, 1 rating)

What can we learn from a one-billion-person live poll of the internet? Andrew Montalenti explains how Parse.ly has gathered a unique dataset of news reading sessions of billions of devices, peaking at over two million sessions per minute on thousands of high-traffic news and information websites, and how the company uses this data to unearth the secrets behind online content. Read more.

Predicting residential occupancy and hot water usage from high-frequency, multivector utilities data

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 08 Level: Intermediate

Secondary topics: Temporal data and time-series analytics

Cris Lowery (Baringa Partners), Marc Warner (ASI)

Average rating:

(4.00, 1 rating)

In EU households, heating and hot water alone account for 80% of energy usage. Cristobal Lowery and Marc Warner explain how future home energy management systems could improve their energy efficiency by predicting resident needs through utilities data, with a particular focus on the key data features, the need for data compression, and the data quality challenges. Read more.

TonY: Native support of TensorFlow on Hadoop

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 10 Level: Intermediate

Secondary topics: Data Platforms, Deep Learning

Jonathan Hung (LinkedIn), Keqiu Hu (LinkedIn), Zhe Zhang (LinkedIn)

Jonathan Hung, Keqiu Hu, and Zhe Zhang offer an overview of TensorFlow on YARN (TonY), a framework to natively run TensorFlow on Hadoop. TonY enables running TensorFlow distributed training as a new type of Hadoop application. Its native Hadoop connector, together with other features, aims to run TensorFlow jobs as reliably and flexibly as other first-class citizens on Hadoop. Read more.

The Vega project: Building an ecosystem of tools for interactive visualization

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 12/14 Level: Beginner

Jeffrey Heer (Trifacta | University of Washington)

Average rating:

(4.75, 4 ratings)

Jeffrey Heer offers an overview of Vega and Vega-Lite—high-level declarative languages for interactive visualization that support exploratory data analysis, communication, and the development of new visualization tools. Read more.

Understanding Spark tuning with auto-tuning; or, Magical spells to stop your pager going off at 2:00am

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 21/22 Level: Intermediate

Holden Karau (Independent), Rachel Warren (Salesforce Einstein), Anya Bida (Salesforce)

Average rating:

(4.00, 2 ratings)

Apache Spark is an amazing distributed system, but part of the bargain we've made with the infrastructure deamons involves providing the correct set of magic numbers (aka tuning) or our jobs may be eaten by Cthulhu. Holden Karau, Rachel Warren, and Anya Bida explore auto-tuning jobs using systems like Apache BEAM, Mahout, and internal Spark ML jobs as workloads. Read more.

Progress for big data in Kubernetes

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 23/24 Level: Advanced

Ted Dunning (MapR, now part of HPE)

Average rating:

(4.00, 4 ratings)

Stateful containers are a well-known anti-pattern, but the standard solution—managing state in a separate storage tier—is costly and complex. Recent developments have changed things dramatically for the better. In particular, you can now manage a high-performance software-defined-storage tier entirely in Kubernetes. Ted Dunning describes what's new and how it makes big data easier on Kubernetes. Read more.

Augmented data engineering: Leveraging machine learning in data profiling and discovery (sponsored by Io-Tahoe)

11:20am–12:00pm Thursday, 09/13/2018

Location: 1E 06

Arun Murugan (GE Digital), Jeff Miller (GE)

Average rating:

(2.00, 2 ratings)

Arun Murugan and Jeff Miller detail how complex relationships are discovered and modeled to simplify analytics while keeping an Agile architecture for data acquisition. You’ll see how GE uses machine learning (powered by Io-Tahoe) in data discovery and profiling for data engineering of the development of a standard data model essential to enterprise use cases. Read more.

Near-real-time anomaly detection at Lyft

11:20am–12:00pm Thursday, 09/13/2018

Location: 1E 07/08 Level: Beginner

Secondary topics: Temporal data and time-series analytics, Transportation and Logistics

Thomas Weise (Lyft), Mark Grover (Lyft)

Average rating:

(2.50, 2 ratings)

Thomas Weise and Mark Grover explain how Lyft uses its streaming platform to detect and respond to anomalous events, using data science tools for machine learning and a process that allows for fast and predictable deployment. Read more.

Data discovery and lineage: Integrating streaming data in the public cloud with on-prem, classic data stores, and heterogeneous schema types

11:20am–12:00pm Thursday, 09/13/2018

Location: 1E 09 Level: Advanced

Secondary topics: Data Integration and Data Pipelines, Data preparation, governance and privacy, Media, Marketing, Advertising

Barbara Eckman (Comcast)

Average rating:

(4.33, 6 ratings)

Comcast’s streaming data platform comprises ingest, transformation, and storage services in the public cloud, with Apache Atlas for data discovery and lineage. Barbara Eckman explains how Comcast recently integrated on-prem data sources, including traditional data warehouses and RDBMSs, which required its data governance strategy to include relational and JSON schemas in addition to Apache Avro. Read more.

The care and feeding of data scientists: Concrete tips for retaining your data science team

11:20am–12:00pm Thursday, 09/13/2018

Location: 1E 10/11 Level: Non-technical

Secondary topics: Machine Learning in the enterprise, Retail and e-commerce, Transportation and Logistics

Michelangelo D'Agostino (ShopRunner)

Average rating:

(4.75, 4 ratings)

Data scientists are hard to hire. But too often, companies struggle to find the right talent only to make avoidable mistakes that cause their best data scientists to leave. From org structure and leadership to tooling, infrastructure, and more, Michelangelo D'Agostino shares concrete (and inexpensive) tips for keeping your data scientists engaged, productive, and adding business value. Read more.

Data and privacy at scale at Wikipedia

11:20am–12:00pm Thursday, 09/13/2018

Location: 1E 12/13 Level: Beginner

Secondary topics: Ethics and Privacy

Nuria Ruiz (Wikimedia)

The Wikipedia community feels strongly that you shouldn’t have to provide personal information to participate in the free knowledge movement. Nuria Ruiz discusses the challenges that this strong privacy stance poses for the Wikimedia Foundation, including how it affects data collection, and details some creative workarounds that allow WMF to calculate metrics in a privacy-conscious way. Read more.

Executive Briefing: From Business to AI—The missing pieces in becoming "AI ready"

11:20am–12:00pm Thursday, 09/13/2018

Location: 1E 14 Level: Intermediate

Secondary topics: Machine Learning in the enterprise, Retail and e-commerce

Mikio Braun (Zalando)

Average rating:

(2.75, 4 ratings)

In order to become "AI ready," an organization not only has to provide the right technical infrastructure for data collection and processing but also must learn new skills. Mikio Braun highlights three pieces companies often miss when trying to become AI ready: making the connection between business problems and AI technology, implementing AI-driven development, and running AI-based projects. Read more.

Data at Netflix: See what’s next

11:20am–12:00pm Thursday, 09/13/2018

Location: Expo Hall Level: Intermediate

Secondary topics: Data Platforms

Michelle Ufford (Netflix)

Average rating:

(4.40, 5 ratings)

Michelle Ufford shares some of the cool things Netflix is doing with data and the big bets the company is making on data infrastructure, covering workflow orchestration, machine learning, interactive notebooks, centralized alerting, event-based processing, platform intelligence, and more. Read more.

Building the bridge from big data to ML, featuring Geotab (sponsored by Google Cloud)

11:20am–12:00pm Thursday, 09/13/2018

Location: 1A 03/04/05

Bob Bradley (Geotab), Chad W. Jennings (Google)

Average rating:

(4.50, 4 ratings)

If your company isn’t good at analytics, it’s not ready for AI. Bob Bradley and Chad W. Jennings explain how the right data strategy can set you up for success in machine learning and artificial intelligence—the new ground for gaining competitive edge and creating business value. You'll then see an in-depth demonstration of Google technology from smart cities innovator Geotab. Read more.

12:00pm

Thursday Topic Tables at Lunch

12:00pm–1:10pm Thursday, 09/13/2018

Location: Expo Hall (Hall 3B)

Topic Table discussions are a great way to informally network with people in similar industries or interested in the same topics. Read more.

Thursday Business Summit Lunch

12:00pm–1:10pm Thursday, 09/13/2018

Location: 3D 09

Average rating:

(5.00, 1 rating)

Join Strata Business Summit speakers and attendees for a networking lunch on Thursday. Read more.

1:10pm

Quick, reliable, and cost-effective ways to operationalize big data apps (sponsored by Unravel)

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 01/02

Shivnath Babu (Unravel Data Systems | Duke University), Madhusudan Tumma (TIAA)

Average rating:

(4.00, 1 rating)

Operationalizing big data apps in a quick, reliable, and cost-effective manner remains a daunting task. Shivnath Babu and Madhusudan Tumma outline common problems and their causes and share best practices to find and fix these problems quickly and prevent such problems from happening in the first place. Read more.

A high-performance system for deep learning inference and visual inspection

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 15/16 Level: Intermediate

Secondary topics: Data Platforms, Deep Learning

Moty Fania (Intel), Sergei Kom (Intel)

Average rating:

(5.00, 1 rating)

Moty Fania and Sergei Kom share their experience and lessons learned implementing an AI inference platform to enable internal visual inspection use cases. The platform is based on open source technologies and was designed for real-time, streaming, and online actuation. Read more.

Spark NLP in action: How SelectData uses AI to better understand home health patients

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 06/07 Level: Intermediate

Secondary topics: Health and Medicine, Text and Language processing and analysis

David Talby (Pacific AI), Alberto Andreotti (John Snow Labs), Stacy Ashworth (SelectData), Tawny Nichols (Select Data)

Average rating:

(3.00, 4 ratings)

David Talby, Alberto Andreotti, Stacy Ashworth, and Tawny Nichols outline a question-answering system for accurately extracting facts from free-text patient records and share best practices for training domain-specific deep learning NLP models. The solution is based on Spark NLP, an extension of Spark ML that provides state-of-the-art performance and accuracy for natural language understanding. Read more.

Scalable machine learning for data cleaning

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 08 Level: Non-technical

Secondary topics: Data preparation, governance and privacy

Ihab Ilyas (University of Waterloo)

Average rating:

(5.00, 2 ratings)

Machine learning tools promise to help solve data curation problems. While the principles are well understood, the engineering details in configuring and deploying ML techniques are the biggest hurdle. Ihab Ilyas explains why leveraging data semantics and domain-specific knowledge is key in delivering the optimizations necessary for truly scalable ML curation solutions. Read more.

Deep learning on YARN: Running distributed TensorFlow, MXNet, Caffe, and XGBoost on Hadoop clusters

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 10 Level: Intermediate

Secondary topics: Data Platforms, Deep Learning, Model lifecycle management

Wangda Tan (Cloudera)

Average rating:

(4.50, 2 ratings)

In order to train deep learning and machine learning models, you must leverage applications such as TensorFlow, MXNet, Caffe, and XGBoost. Wangda Tan discusses new features in Apache Hadoop 3.x to better support deep learning workloads and demonstrates how to run these applications on YARN. Read more.

Augmented reality: Going beyond plots in 3D

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 12/14 Level: Beginner

Secondary topics: Ethics and Privacy, Financial Services, Media, Marketing, Advertising

Bob Levy (Virtual Cove, Inc.)

Average rating:

(3.00, 1 rating)

Augmented reality opens a completely new lens on your data through which you see and accomplish amazing things. Bob Levy explains how to use simple Python scripts to leverage completely new plot types. You'll explore use cases revealing new insight into financial markets data as well as new ways of interacting with data that build trust in otherwise “black box” machine learning solutions. Read more.

A/B testing at Uber: How we built a BYOM (bring your own metrics) platform

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Data Platforms, Transportation and Logistics

Milene Darnis (Uber)

Average rating:

(4.22, 9 ratings)

Every new launch at Uber is vetted via robust A/B testing. Given the pace at which Uber operates, the metrics needed to assess the impact of experiments constantly evolve. Milene Darnis explains how the team built a scalable and self-serve platform that lets users plug in any metric to analyze. Read more.

Case study: A Spark-based distributed simulation optimization architecture for portfolio optimization in retail banking

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 23/24 Level: Intermediate

Kaushik Deka (Novantas), Ted Gibson (Novantas)

Average rating:

(4.50, 2 ratings)

Kaushik Deka and Ted Gibson share a large-scale optimization architecture in Spark for a consumer product portfolio optimization use case in retail banking. The architecture combines a simulator that distributes computation of complex real-world scenarios and a constraint optimizer that uses business rules as constraints to meet growth targets. Read more.

From two weeks in Python to two hours in Pentaho: Building modern big data pipelines for machine learning (sponsored by Hitachi Vantara)

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1E 06

Dave Huh (Hitachi Vantara), Kevin Haas (Hitachi Vantara)

Data in most organizations today is massive, messy, and often found in silos. With so many sources to analyze, data engineers need to construct robust data pipelines using automation and minimize duplicate processes, as computation is costly for big data. David Huh shares strategies to construct data pipelines for machine learning, including one to reduce time to insight from weeks to hours. Read more.

A deep dive into Kafka controller

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1E 07/08 Level: Intermediate

Jun Rao (Confluent)

Average rating:

(4.00, 1 rating)

The controller is the brain of Apache Kafka and is responsible for maintaining the consistency of the replicas. Jun Rao outlines the main data flow in the controller, then describes recent improvements to the controller that allow it to handle certain edge cases correctly and increase its performance, which allows for more partitions in a Kafka cluster. Read more.

How Komatsu is improving mining efficiencies using the IoT and machine learning

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1E 09 Level: Non-technical

Secondary topics: Transportation and Logistics

Shawn Terry (Komatsu Mining Corp)

Average rating:

(4.50, 2 ratings)

Global heavy equipment manufacturer Komatsu is using IoT data to continuously monitor some of the largest mining equipment to ultimately improve mine performance and efficiencies. Shawn Terry details the company's data journey and explains how it is using advanced analytics and predictive modeling to drive insights on terabytes of IoT data from connected mining equipment. Read more.

Best practices for migrating big data workloads to Amazon Web Services (sponsored by Amazon Web Services)

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1E 10/11

Faria Bruno (Amazon Web Services)

Average rating:

(4.00, 1 rating)

Bruno Faria explains how to identify the components and workflows in your current environment and shares best practices to migrate these workloads to AWS. Read more.

Enacting Data Subject Access Rights for GDPR with data services and data management

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1E 12/13 Level: Intermediate

Secondary topics: Data preparation, governance and privacy, Ethics and Privacy

Jean-Michel Franco (Talend)

Average rating:

(3.50, 2 ratings)

GDPR is more than another regulation to be handled by your back office. Enacting the GDPR's Data Subject Access Rights (DSAR) requires practical actions. Jean-Michel Franco outlines the practical steps to deploy governed data services. Read more.

Executive Briefing: Analytics for executives—Building an approachable language to drive data science in your organization

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1E 14 Level: Non-technical

Secondary topics: Machine Learning in the enterprise, Transportation and Logistics

Brandy Freitas (Pitney Bowes)

Average rating:

(4.50, 6 ratings)

Data science is an approachable field given the right framing. Often, though, practitioners and executives are describing opportunities using completely different languages. Join Brandy Freitas to develop context and vocabulary around data science topics to help build a culture of data within your organization. Read more.

The state of Postgres

1:10pm–1:50pm Thursday, 09/13/2018

Location: Expo Hall Level: Beginner

Umur Cubukcu (Citus Data)

PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you’ll learn how PostgreSQL’s extension APIs are fueling innovations in relational databases. Read more.

On the road to digital transformation, AI is a team sport (sponsored by Oracle + DataScience.com)

1:10pm–1:50pm Thursday, 09/13/2018

Location: 1A 03/04/05

Ian Swanson (Oracle)

Ian Swanson explores why and how data scientists and line-of-business leaders must treat AI as a team sport and explains what tools are needed to deploy models and applications that truly inform decision making. Read more.

2:00pm

Getting the most out of advanced analytics with people (sponsored by Alteryx)

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 01/02

Patrick Nussbaumer (Alteryx)

There is a lot of buzz around data science and machine learning in the world today. Unfortunately, to truly innovate with data and advanced capabilities, organizations need to expand their focus beyond just a few specialists. Patrick Nussbaumer details how focusing on people can help improve analytic value and drive innovation. Read more.

Job recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 15/16 Level: Intermediate

Secondary topics: Deep Learning, Media, Marketing, Advertising

Guoqiong Song (Intel), Wenjing Zhan (Talroo), Jacob Eisinger (Talroo )

Can the talent industry make the job search/match more relevant and personalized for a candidate by leveraging deep learning techniques? Guoqiong Song, Wenjing Zhan, and Jacob Eisinger demonstrate how to leverage distributed deep learning framework BigDL on Apache Spark to predict a candidate’s probability of applying to specific jobs based on their résumé. Read more.

Big data at speed

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 06/07 Level: Intermediate

Secondary topics: Transportation and Logistics

Ted Malaska (Capital One), Mark Grover (Lyft)

Many details go into building a big data system for speed, from determining a respectable latency until data access and where to store the data to solving multiregion problems—or even knowing just what data you have and where stream processing fits in. Mark Grover and Ted Malaska share challenges, best practices, and lessons learned doing big data processing and analytics at scale and at speed. Read more.

Let the machines learn to improve data quality

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 08 Level: Intermediate

Secondary topics: Data preparation, governance and privacy, Financial Services

Archana Anandakrishnan (American Express)

Average rating:

(3.20, 5 ratings)

Building accurate machine learning models hinges on the quality of the data. Errors and anomalies get in the way of data scientists doing their best work. Archana Anandakrishnan explains how American Express created an automated, scalable system for measurement and management of data quality. The methods are modular and adaptable to any domain where accurate decisions from ML models are critical. Read more.

Kubeflow explained: Portable machine learning on Kubernetes

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 10 Level: Intermediate

Secondary topics: Model lifecycle management

Michelle Casbon (Google)

Average rating:

(5.00, 2 ratings)

Michelle Casbon demonstrates how to build a machine learning application with Kubeflow. Kubeflow makes it easy for everyone to develop, deploy, and manage portable, scalable ML everywhere and supports the full lifecycle of an ML product, including iteration via Jupyter notebooks. Join Michelle to find out what Kubeflow currently supports and the long-term vision for the project. Read more.

Stories beat statistics: How to master the art and science of data storytelling

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 12/14 Level: Non-technical

Brent Dykes (Domo)

Average rating:

(4.78, 9 ratings)

Companies collect all kinds of data and use advanced tools and techniques to find insights, but they often fail in the last mile: communicating insights effectively to drive change. Brent Dykes discusses the power that stories wield over statistics and explores the art and science of data storytelling—an essential skill in today’s data economy. Read more.

Aetna's advanced analytics platform, Data Fabric

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Data Platforms, Health and Medicine

Occhio Orsini (Aetna)

Occhio Orsini offers an overview of Aetna's Data Fabric platform. Join in to learn the needs and desires that led to the creation of the advanced analytics platform, explore the platform's architecture, technology, and capabilities, and understand the key technologies and capabilities that made it possible to build a hybrid solution across on-premises and cloud-hosted data centers. Read more.

Using big data to unlock the delivery of personalized, multilingual real-time chat services for global financial service organizations

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 23/24 Level: Beginner

Secondary topics: Data Platforms, Financial Services

Timothy Walpole (BJSS)

Financial service clients demand increased data-driven personalization, faster insight-based decisions, and multichannel real-time access. Tim Walpole details how organizations can deliver real-time, vendor-agnostic, personalized chat services and explores issues around security, privacy, legal sign-off, data compliance, and how the internet of things can be used as a delivery platform. Read more.

From analytic silos to analytic democratization: How (and why) companies make the shift (sponsored by Dataiku)

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1E 06

Deborah Reynolds (Pfizer), Kurt Muehmel (Dataiku)

Average rating:

(4.00, 2 ratings)

By creating a collaborative and interactive analytic environment, a forward-thinking company may harness the best capabilities of its business analysts and data scientists to answer the company’s most pressing business questions. Deborah Reynolds and Kurt Muehmel explain how large enterprises can successfully put data at the core of everyday business decisions. Read more.

High-performance messaging with Apache Pulsar

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1E 07/08 Level: Beginner

Karthik Ramasamy (Streamlio), Matteo Merli (Streamlio)

Average rating:

(4.50, 2 ratings)

Apache Pulsar is being used for an increasingly broad array of data ingestion tasks. When operating at scale, it's very important to ensure that the system can make use of all the available resources. Karthik Ramasamy and Matteo Merli share insights into the design decisions and the implementation techniques that allow Pulsar to achieve high performance with strong durability guarantees. Read more.

Using Alluxio as a fault-tolerant pluggable optimization component of JD.com's compute frameworks

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1E 09 Level: Beginner

Secondary topics: Data Platforms, Retail and e-commerce, Transportation and Logistics

tao huang (JD.com), mang zhang (JD.com), Bing Bai (JD.com)

Average rating:

(3.00, 1 rating)

Tao Huang, Mang Zhang, and 白冰 explain how JD.com uses Alluxio to provide support for ad hoc and real-time stream computing, using Alluxio-compatible HDFS URLs and Alluxio as a pluggable optimization component. To give just one example, one framework, JDPresto, has seen a 10x performance improvement on average. Read more.

Building it beautiful: Analyzing the effectiveness of platform products and marketing at scale

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1E 10/11 Level: Beginner

Josh Laurito (Squarespace)

Joshua Laurito explores systems Squarespace built for acquiring and enforcing consistency on obtained data and for inferring conclusions from a company’s marketing and product initiatives. Joshua discusses the intricacies of gathering and evaluating marketing and user data, from raising awareness to driving purchases, and shares results of previous analyses. Read more.

Digging for gold: Developing AI in healthcare against unstructured text data

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1E 12/13 Level: Non-technical

Secondary topics: Health and Medicine, Text and Language processing and analysis

Chiny Driscoll (MetiStream), Jawad Khan (Rush University Medical Center )

Average rating:

(4.00, 5 ratings)

Chiny Driscoll and Jawad Khan offer an overview of a solution by Cloudera and MetiStream that lets healthcare providers automate the extraction, processing, and analysis of clinical notes within an electronic health record in batch or real time, improving care, identifying errors, and recognizing efficiencies in billing and diagnoses. Read more.

Executive Briefing: What you need to know about fast data

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1E 14 Level: Non-technical

Dean Wampler (Anyscale)

Streaming data systems, so called "fast data," promise accelerated access to information, leading to new innovations and competitive advantages. But they aren't just faster versions of big data. They force architecture changes to meet new demands for reliability and dynamic scalability, more like microservices. Dean Wampler shares what you need to know to exploit fast data successfully. Read more.

Building a high-performance model serving engine from scratch using Kubernetes, GPUs, Docker, Istio, and TensorFlow

2:00pm–2:40pm Thursday, 09/13/2018

Location: Expo Hall Level: Intermediate

Secondary topics: Model lifecycle management

Chris Fregly (Amazon Web Services)

Average rating:

(3.50, 2 ratings)

Chris Fregly details a full-featured, open source end-to-end TensorFlow model training and deployment system, using the latest advancements with Kubernetes, TensorFlow, and GPUs. Read more.

Redis for velocity and volume: Fast data ingest and probabilistic data structures (sponsored by Redi Labs)

2:00pm–2:40pm Thursday, 09/13/2018

Location: 1A 03/04/05

Kyle Davis (Redis Labs)

Average rating:

(5.00, 1 rating)

Kyle Davis explains how Redis can be used for ingesting high-velocity data from large-scale platforms and IoT data collections as well as for storing and querying data using probabilistic data structures that trade some precision for both higher speed and lower storage requirements. Along the way, Kyle shares examples and a demo of the solution. Read more.

2:30pm

2:30pm–3:30pm Thursday, 09/13/2018

Location: 3B | Expo Hall

Afternoon break sponsored by Google Cloud (1h)

3:30pm

Why the internet of things doesn’t exist but will still reshape your business

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 01/02 Level: Intermediate

Ajay Kulkarni (TimescaleDB)

Average rating:

(4.00, 2 ratings)

Ajay Kulkarni explores the underlying changes that are characterizing the next wave of computing and shares several ways in which individual businesses and overall industries will be transformed. Read more.

Classifying job execution using deep learning

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 15/16 Level: Advanced

Secondary topics: Deep Learning

Ash Munshi (Pepperdata)

Ash Munshi outlines a technique for labeling applications using runtime measurements of CPU, memory, and network I/O along with a deep neural network. This labeling groups the applications into buckets that have understandable characteristics, which can then be used to reason about the cluster and its performance. Read more.

Modeling time series in R

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 06/07 Level: Beginner

Secondary topics: Temporal data and time-series analytics

Jared Lander (Lander Analytics)

Average rating:

(5.00, 3 ratings)

Temporal data is being produced in ever-greater quantity, but fortunately our time series capabilities are keeping pace. Jared Lander explores techniques for modeling time series, from traditional methods such as ARMA to more modern tools such as Prophet and machine learning models like XGBoost and neural nets. Along the way, Jared shares theory and code for training these models. Read more.

InnerSource for reproducible and extensible business analysis

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 08 Level: Non-technical

Secondary topics: Financial Services

Emily Riederer (Capital One)

Emily Riederer explains how best practices from data science, open source, and open science can solve common business pain points. Using a case example from Capital One, Emily illustrates how designing empathetic analytical tools and fostering a vibrant InnerSource community are keys to developing reproducible and extensible business analysis. Read more.

Managing data chaos in the world of microservices

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 10 Level: Intermediate

Oleksii Kachaiev (Attendify)

Average rating:

(3.50, 2 ratings)

When we talk about microservices, we usually focus on the communication layer. In practice, data is the much harder and often overlooked problem. Splitting applications into independent units leads to increased complexity, such as structural and semantic changes, knowledge sharing, and data discovery. Join Alexey Kachayev to explore emerging technologies created to tackle these challenges. Read more.

Data visualization in mixed reality with Python

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 12/14 Level: Beginner

Anna Nicanorova (Annalect)

Average rating:

(3.00, 3 ratings)

Data visualization is supposed to be our map to information. However, contemporary charting techniques have a few shortcomings, including context reduction, hard numeric grasp, and perceptual dehumanization. Anna Nicanorova explains how augmented reality can solve these issues by presenting an intuitive and interactive environment for data exploration. Read more.

Self-service modern analytics on the GovCloud

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 21/22 Level: Intermediate

Ramesh Krishnan (lmco), Steven Morgan (Lockheed Martin)

Average rating:

(4.00, 1 rating)

Lockheed Martin is a data-driven company with a massive variety and volume of data. To extract the most value from its information assets, the company is constantly exploring ways to enable effective self-service scenarios. Ramesh Krishnan and Steve Morgan discuss Lockheed Martin's journey into modern analytics and explore its analytics platform focused on leveraging AWS GovCloud. Read more.

Cassandra versus cloud databases

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 23/24 Level: Beginner

Jonathan Ellis (DataStax)

Average rating:

(4.50, 2 ratings)

Is open source Apache Cassandra still relevant in an era of hosted cloud databases? Jonathan Ellis discusses Cassandra’s strengths and weaknesses relative to Amazon DynamoDB, Microsoft CosmosDB, and Google Cloud Spanner. Read more.

The importance of experimental iteration: A data-centric approach to an AI project (sponsored by Globant)

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1E 06

Antonio Fragoso (Globant)

Average rating:

(1.00, 1 rating)

Antonio Fragoso explores the key aspects of implementing a natural language processing project within your organization and reveals the necessary steps for making it a success. Antonio focuses on how to leverage an iterative process that can pave the way toward building a successful product. Read more.

Machine learning for nonstationary streaming data using Structured Streaming and StreamDM

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1E 07/08 Level: Intermediate

Secondary topics: Temporal data and time-series analytics

Heitor Murilo Gomes (Télécom ParisTech), Albert Bifet (Télécom ParisTech)

The StreamDM library provides the largest collection of data stream mining algorithms for Spark. Heitor Murilo Gomes and Albert Bifet explain how to use StreamDM and Structured Streaming to develop, apply, and evaluate learning models specially for nonstationary streams (i.e., those with concept drifts). Read more.

Kafka at PayPal: Enabling 400 billion messages a day

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1E 09 Level: Intermediate

Secondary topics: Data Integration and Data Pipelines, Data Platforms, Financial Services

Kevin Lu (PayPal), Maulin Vasavada (PayPal), Na Yang (PayPal)

Average rating:

(4.00, 3 ratings)

PayPal is one of the biggest Kafka users in the industry; it manages and maintains over 40 production Kafka clusters in three geodistributed data centers and supports 400 billion Kafka messages a day. Kevin Lu, Maulin Vasavada, and Na Yang explore the management and monitoring PayPal applies to Kafka, from client-perceived statistics to configuration management, failover, and data loss auditing. Read more.

Scaling data infrastructure in the fashion world; or, “What is this? Business intelligence for ants?”

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1E 10/11 Level: Non-technical

Secondary topics: Data Platforms, Media, Marketing, Advertising, Retail and e-commerce

Francesco Mucio (Francescomuc.io)

Average rating:

(3.50, 2 ratings)

Francesco Mucio tells the story of how Zalando went from an old-school BI company to an AI-driven company built on a solid data platform. Along the way, he shares what Zalando learned in the process and the challenges that still lie ahead. Read more.

Balancing stakeholder interests in personal data governance technology

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1E 12/13 Level: Intermediate

Secondary topics: Data preparation, governance and privacy, Ethics and Privacy

LaVonne Reimer, JD (Lumenous)

GDPR asks us to rethink personal data systems—viewing UI/UX, consent management, and value-add data services through the eyes of subjects of the data. LaVonne Reimer explains why the opportunity in the $150B credit and risk industry is to deploy data governance technologies that balance the interests of individuals to control their own data with requirements for trusted data. Read more.

Executive Briefing: Best practices for human in the loop—The business case for active learning

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1E 14 Level: Non-technical

Paco Nathan (derwen.ai)

Average rating:

(3.00, 1 rating)

Deep learning works well when you have large labeled datasets, but not every team has those assets. Paco Nathan offers an overview of active learning, an ML variant that incorporates human-in-the-loop computing. Active learning focuses input from human experts, leveraging intelligence already in the system, and provides systematic ways to explore and exploit uncertainty in your data. Read more.

Stochastic field theory for time series

3:30pm–4:10pm Thursday, 09/13/2018

Location: 1A 03/04/05 Level: Intermediate

Secondary topics: Financial Services, Temporal data and time-series analytics

Revant Nayar (FMI Technologies LLC )

Average rating:

(1.50, 2 ratings)

Machine learning has so far underperformed in time series prediction (slowness and overfitting), and classical methods are ineffective at capturing nonlinearity. Revant Nayar shares an alternative approach that is faster and more transparent and does not overfit. It can also pick up regime changes in the time series and systematically captures all the nonlinearity of a given dataset. Read more.

4:20pm

Assumptions, constraints, and risks: How the wrong assumptions can jeopardize any model (sponsored by IBM)

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1A 01/02

Jennifer Shin (8 Path Solutions | NYU Stern | IBM)

Deep learning on audio in Azure to detect sounds in real time

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1A 15/16 Level: Beginner

Secondary topics: Deep Learning

Swetha Machanavajhala (Microsoft), Xiaoyong Zhu (Microsoft)

Average rating:

(5.00, 3 ratings)

In this auditory world, the human brain processes and reacts effortlessly to a variety of sounds. While many of us take this for granted, there are over 360 million in this world who are deaf or hard of hearing. Swetha Machanavajhala and Xiaoyong Zhu explain how to make the auditory world inclusive and meet the great demand in other sectors by applying deep learning on audio in Azure. Read more.

Analytics maturity: Industry trends and financial impacts

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1A 06/07 Level: Non-technical

Secondary topics: Machine Learning in the enterprise

Bill Franks (International Institute For Analytics)

Drawing on a recent study of the analytics maturity level of large enterprises by the International Institute for Analytics, Bill Franks discusses how maturity varies by industry, shares key steps organizations can take to move up the maturity scale, and explains how the research correlates analytics maturity with a wide range of success metrics, including financial and reputational measures. Read more.

Infrastructure for deploying machine learning to production in large financial institutions: Lessons learned and best practices

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1A 08 Level: Intermediate

Secondary topics: Financial Services, Model lifecycle management

Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)

Large financial institutions have many data science teams (e.g., those for fraud, credit risk, and marketing), each often using diverse set of tools to build predictive models. There are many challenges involved in productionizing these predictive AI models. Harish Doddi and Jerry Xu share challenges and lessons learned deploying AI models to production in large financial institutions. Read more.

The move to a modern data platform in the cloud: Pitfalls to avoid and best practices to follow

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1A 10 Level: Intermediate

Amandeep Khurana (Okera)

Amandeep Khurana shares critical data management practices for easy and unified data access that meets security and regulatory compliance, helping you avoid the pitfalls that could lead to complex expensive architectures. Read more.

UX strategies for underperforming analytics services and data products

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1A 12/14 Level: Non-technical

Secondary topics: Machine Learning in the enterprise

Brian O'Neill (Designing for Analytics)

Average rating:

(5.00, 5 ratings)

Gartner says 85%+ of big data projects will fail, despite the fact your company may have invested millions on engineering implementation. Why are customers and employees not engaging with these products and services? Brian O'Neill explains why a "people first, technology second" mission—a design strategy, in other words—enables the best UX and business outcomes possible. Read more.

Building turnkey recommendations for 5% of internet video

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1A 21/22 Level: Intermediate

Secondary topics: Deep Learning, Media, Marketing, Advertising, Recommendation Systems

Nir Yungster (JW Player), Kamil Sindi (JW Player)

JW Player—the world’s largest network-independent video platform, representing 5% of global internet video—provides on-demand recommendations as a service to thousands of media publishers. Nir Yungster and Kamil Sindi explain how the company is systematically improving model performance while navigating the many engineering challenges and unique needs of the diverse publishers it serves. Read more.

Best practices for developing an enterprise data hub to collect and analyze 1 TB of data a day from a multiple services with Apache Kafka and Google Cloud Platform

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1A 23/24 Level: Beginner

Secondary topics: Data Integration and Data Pipelines

Kenji Hayashida (Recruit Lifestyle co., ltd.), Toru Sasaki (NTT DATA Corporation)

Average rating:

(4.50, 2 ratings)

Recruit Group and NTT DATA Corporation have developed a platform based on a data hub, utilizing Apache Kafka. This platform can handle around 1 TB/day of application logs generated by a number of services in Recruit Group. Kenji Hayashida and Toru Sasaki share best practices for and lessons learned about topics such as schema evolution and network architecture. Read more.

IoT edge processing with Apache NiFi, Apache MiniFi, and multiple deep learning libraries

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1E 07/08 Level: Beginner

TIMOTHY SPANN (Cloudera)

Average rating:

(4.00, 2 ratings)

Timothy Spann leads a hands-on deep dive into using Apache MiniFi with Apache MXNet and other deep learning libraries on edge devices. Read more.

TuneIn: How to get your jobs tuned while you are sleeping

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1E 09 Level: Intermediate

Manoj Kumar (LinkedIn), Pralabh Kumar (LinkedIn), Arpan Agrawal (LinkedIn)

Average rating:

(5.00, 1 rating)

Have you ever tuned a Spark or MR job? If the answer is yes, you already know how difficult it is to tune more than hundred parameters to optimize the resources used. Manoj Kumar, Pralabh Kumar, and Arpan Agrawal offer an overview of TuneIn, an auto-tuning tool developed to minimize the resource usage of jobs. Experiments have shown up to a 50% reduction in resource usage. Read more.

Real-time machine intelligence in IndyCar and Tour de France

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1E 10/11 Level: Beginner

Secondary topics: Transportation and Logistics

Yasuyuki Kataoka (NTT Innovation Institute, Inc.)

Average rating:

(3.00, 4 ratings)

One of the challenges of sports data analytics is how to deliver machine intelligence beyond a mere real-time monitoring tool. Yasuyuki Kataoka highlights various real-time machine learning models in both IndyCar and Tour de France, sharing real-time data processing architectures, machine learning models, and demonstrations that deliver meaningful insights for players and fans. Read more.

A day in the life of a data scientist: How do we train our teams to get started with AI?

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1E 12/13 Level: Beginner

Secondary topics: Machine Learning in the enterprise

Francesca Lazzeri (Microsoft), Jaya Susan Mathew (Microsoft)

Average rating:

(2.67, 6 ratings)

With the growing buzz around data science, many professionals want to learn how to become a data scientist—the role Harvard Business Review called the "sexiest job of the 21st century." Francesca Lazzeri and Jaya Mathew explain what it takes to become a data scientist and how artificial intelligence solutions have started to reinvent businesses. Read more.

Conda, Docker, and Kubernetes: The cloud-native future of data science (sponsored by Anaconda)

4:20pm–5:00pm Thursday, 09/13/2018

Location: 1E 14

Mathew Lodge (Anaconda)

Average rating:

(5.00, 1 rating)

Presented by

Elite Sponsors

Strategic Sponsors

Zettabyte Sponsors

Contributing Sponsors

Exabyte Sponsors

Impact Sponsors

Supporting Sponsors

Sponsorship Opportunities

For exhibition and sponsorship opportunities, email strataconf@oreilly.com

Partner Opportunities

For information on trade opportunities with O'Reilly conferences, email partners@oreilly.com

Contact Us

View a complete list of Strata Data Conference contacts

©2018, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • confreg@oreilly.com

Schedule List ViewGrid View

Sponsorship Opportunities

Partner Opportunities

Contact Us

Schedule List View Grid View