Presented By O’Reilly and Cloudera
Make Data Work
March 5–6, 2018: Training
March 6–8, 2018: Tutorials & Conference
San Jose, CA

Speaker slides & video

Presentation slides will be made available after the session has concluded and the speaker has given us the files. Check back if you don't see the file you're looking for—it might be available later! (However, please note some speakers choose not to share their presentations.)

Kurt Brown (Netflix)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.
Kurt Brown (Netflix)
Kurt Brown explains how to get the most out of your data infrastructure with 20 principles and practices used at Netflix. Kurt covers each in detail and explores how they relate to the technologies used at Netflix, including S3, Spark, Presto, Druid, R, Python, and Jupyter.
Ronny Kohavi (Microsoft), Alex Deng (Microsoft), Somit Gupta (Microsoft), Paul Raff (Microsoft)
Controlled experiments such as A/B tests have revolutionized the way software is being developed, allowing real users to objectively evaluate new ideas. Ronny Kohavi, Alex Deng, Somit Gupta, and Paul Raff lead an introduction to A/B testing and share lessons learned from one of the largest A/B testing platforms on the planet, running at Microsoft, which executes over 10K experiments a year.
Advanced analytics and AI workloads require a scalable and optimized architecture, from hardware and storage to software and applications. Kevin Huiskes and Radhika Rangarajan share best practices for accelerating analytics and AI and explain how businesses globally are leveraging Intel’s technology portfolio, along with optimized frameworks and libraries, to build AI workloads at scale.
Sergey Ermolin (Intel), Shivaram Venkataraman (Microsoft Research)
The BigDL framework scales deep learning for large datasets using Apache Spark. However there is significant scheduling overhead from Spark when running BigDL at large scale. Shivaram Venkataraman and Sergey Ermolin outline a new parameter manager implementation that along with coarse-grained scheduling can provide significant speedups for deep learning models like Inception and VGG.
Kinnary Jangla (Pinterest)
Having trouble coordinating development of your production ML system between a team of developers? Microservices drifting and causing problems during debugging? Kinnary Jangla explains how Pinterest dockerized the services powering its home feed and how it impacted the engineering productivity of the company's ML teams while increasing uptime and ease of deployment.
Ajay Kumar Mothukuri (Sapient), Vijay Agneeswaran (Walmart Labs)
Ajay Mothukuri and Vijay Srinivas Agneeswaran explain how to use open source blockchain technologies such as Hyperledger to implement the European Union's General Data Protection Regulation (GDPR) regulation.
Or Herman-Saffar (Dell), Ran Taig (Dell EMC)
What if we could predict when and where crimes will be committed? Or Herman-Saffar and Ran Taig offer an overview of Crimes in Chicago, a publicly published dataset of reported incidents of crime that have occurred in Chicago since 2001. Or and Ran explain how to use this data to explore committed crimes to find interesting trends and make predictions for the future.
Adam Ahringer (Disney-ABC TV Digital Media)
Adam Ahringer explains how Disney-ABC TV leverages Amazon Kinesis and MemSQL to provide real-time insights based on user telemetry as well as the platform for traditional data warehousing activities.
Brooke Wenig (Databricks)
Brooke Wenig walks you through the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, and Spark’s streaming capabilities and machine learning APIs.
Debasish Ghosh (Lightbend)
Debasish Ghosh explores the role that approximation data structures play in processing streaming data. Typically, streams are unbounded in space and time, and processing has to be done online using sublinear space. Debasish covers the probabilistic bounds that these data structures offer and shows how they can be used to implement solutions for fast and streaming architectures.
Santosh Rao (NetApp)
Santosh Rao explores the architecture of a data pipeline from edge to core to cloud and across various data sources and processing engines and explains how to build a solution architecture that enables businesses to maximize the competitive differentiation with the ability to unify data insights in compelling yet efficient ways.
Jiao(Jennie) Wang (Intel), Valentina Pedoia (UCSF), Berk Norman (UCSF), Yulia Tell (Intel)
Damage to the meniscus is a physically limiting injury that can lead to further medical complications. Automatically classifying this damage at the time of an MRI scan would allow quicker and more accurate diagnosis. Jennie Wang, Valentina Pedoia, Berk Norman, and Yulia Tell offer an overview of their classification system built with 3D convolutional neural networks using BigDL on Apache Spark.
Wayde Fleener (General Mills)
Decision makers are busy. Businesses can hire people to analyze data for them, but most companies are resource constrained and can’t hire a small army to look through all their data. Wayde Fleener explains how General Mills implemented automation to enable decision makers to quickly focus on the metrics that matter and cut through everything else that does not.
Divya Ramachandran (Captricity)
Divya Ramachandran explains how top insurance companies have used handwriting transcription powered by deep learning to achieve a more than 70% reduction in daily operational processing time, develop a best-in-industry predictive model for assessing mortality risk from decades of archived forms, and enable a smarter claims leakage review, which led to a 10x ROI in its first year.
Joseph Bradley (Databricks)
Joseph Bradley discusses common paths to productionizing Apache Spark MLlib models and shares engineering challenges and corresponding best practices. Along the way, Joseph covers several deployment scenarios, including batch scoring, Structured Streaming, and real-time low-latency serving.
Burcu Baran (LinkedIn), Wei Di (LinkedIn), Michael Li (LinkedIn), Chi-Yi Kuan (LinkedIn)
Burcu Baran, Wei Di, Michael Li, and Chi-Yi Kuan walk you through the big data analytics and data science lifecycle and share their experience and lessons learned leveraging advanced analytics and machine learning techniques such as predictive modeling to drive and grow business at LinkedIn.
Felix Gorodishter (GoDaddy)
GoDaddy ingests and analyzes over 100,000 data points per second. Felix Gorodishter discusses the company's big data journey from ingest to automation, how it is evolving its systems to scale to over 10 TB of new data per day, and how it uses tools like anomaly detection to produce valuable insights, such as the worth of a reminder email.
Mauro Damo (Dell EMC), Wei Lin (Dell EMC)
Image recognition classification of diseases will minimize the possibility of medical mistakes, improve patient treatment, and speed up patient diagnosis. Mauro Damo and Wei Lin offer an overview of an approach to identify bladder cancer in patients using nonsupervised and supervised machine learning techniques on more than 5,000 magnetic resonance images from the Cancer Imaging Archive.
Katie Malone (Civis Analytics), Skipper Seabold (Civis Analytics)
A huge challenge for data science managers is determining priorities for their teams, which often have more good ideas than they have time. Katie Malone and Skipper Seabold share a framework that their large and diverse data science team uses to identify, discuss, select, and manage data science projects for a fast-moving startup.
Simon Hughes (, Yuri Bykov ( recently released several free career advisory tools for technology professionals, including a salary predictor, a tool that recommends the next skills to learn, and a career path explorer. Simon Hughes and Yuri Bykov offer an overview of the machine learning algorithms behind these tools and the technologies used to build, deploy, and monitor these solutions in production.
Ash Munshi (Pepperdata)
Ash Munshi shares techniques for labeling big data apps using runtime measurements of CPU, memory, I/O, and network and details a deep neural network to help operators understand the types of apps running on the cluster and better predict runtimes, tune resource utilization, and increase efficiency. These methods are new and are the first approach to classify multivariate time series.
Vlad A Ionescu (ShiftLeft), Fabian Yamaguchi (ShiftLeft)
Vlad Ionescu and Fabian Yamaguchi outline Code Property Graph (CPG), a unique approach that allows the functional elements of code to be represented in an interconnected graph of data and control flows, which enables semantic information about code to be stored scalably on distributed graph databases over the web while allowing them to be rapidly accessed.
Roger Barga (Amazon Web Services), Nina Mishra (Amazon Web Services), Sudipto Guha (Amazon Web Services), Ryan Nienhuis (Amazon Web Services)
Roger Barga, Nina Mishra, Sudipto Guha, and Ryan Nienhuis detail continuous machine learning algorithms that discover useful information in streaming data. They focus on explainable machine learning, including anomaly detection with attribution, the ability to reduce false positives through user feedback, and the detection of anomalies in directed graphs.
Nancy Lublin (Crisis Text Line)
Nancy Lublin shares insights from Crisis Text Line.
Tomer Kaftan (University of Washington)
Tomer Kaftan offers an overview of Cuttlefish, a lightweight framework prototyped in Apache Spark that helps developers adaptively improve the performance of their data processing applications by inserting a few library calls into their code. These calls construct tuning primitives that use reinforcement learning to adaptively modify execution as they observe application performance over time.
Jennifer Webb (SuprFanz)
Jennifer Webb explains how cloud-based marketing company SuprFanz uses data science techniques and graph theory with Neo4j to generate live event attendance from social media platforms, email, and SMS.
Alex Smola (Amazon)
In this talk Alex will discuss lessons learned from AWS SageMaker, an integrated framework for handling all stages of analysis. AWS uses open source components such as Jupyter, Docker containers, Python and well established deep learning frameworks such as Apache MxNet and TensorFlow for an easy to learn workflow.
Marcin Pilarczyk (Ryanair)
Managing fuel at a company flying 120 millions passengers yearly is not a trivial task. Marcin Pilarczyk explores the main aspects of fuel management of a modern airline and offers an overview of machine learning methods supporting long-term planning and daily decisions.
Mohamed AbdelHady (Microsoft), Zoran Dzunic (Microsoft)
Mohamed AbdelHady and Zoran Dzunic demonstrate how to build a domain-specific entity extraction system from unstructured text using deep learning. In the model, domain-specific word embedding vectors are trained on a Spark cluster using millions of PubMed abstracts and then used as features to train a LSTM recurrent neural network for entity extraction.
Abhishek Kumar (Publicis Sapient), Vijay Agneeswaran (Walmart Labs)
Abhishek Kumar and Vijay Srinivas Agneeswaran offer an introduction to deep learning-based recommendation and learning-to-rank systems using TensorFlow. You'll learn how to build a recommender system based on intent prediction using deep learning that is based on a real-world implementation for an ecommerce client.
Natalie Evans Harris (BrightHive)
Natalie Evans Harris explores the Community Principles on Ethical Data Practices (CPEDP), a community-driven code of ethics for data collection, sharing, and utilization that provides people in the data science community a standard set of easily digestible, recognizable principles for guiding their behaviors.
Ron Bodkin (Google), Brian Foo (Google)
TensorFlow and Keras are popular libraries for machine learning because of their support for deep learning and GPU deployment. Join Ron Bodkin and Brian Foo to learn how to execute these libraries in production with vision and recommendation models and how to export, package, deploy, optimize, serve, monitor, and test models using Docker and TensorFlow Serving in Kubernetes.
Andrea Pasqua (Uber), Anny Chen (Uber)
Time series forecasting and anomaly detection is of utmost importance at Uber. However, the scale of the problem, the need for speed, and the importance of accuracy make anomaly detection a challenging data science problem. Andrea Pasqua and Anny Chen explain how the use of recurrent neural networks is allowing Uber to meet this challenge.
Eric Colson (Stitch Fix)
While companies often use data science as a supportive function, the emergence of new business models has made it possible for some companies to differentiate via data science. Eric Colson explores what it means to differentiate by data science and explains why companies must now think very differently about the role and placement of data science in the organization.
Balasubramanian Narasimhan (Stanford University), John-Mark Agosta (Microsoft), Philip Lavori (Stanford University)
Clinical collaboration benefits from pooling data to train models from large datasets, but it's hampered by concerns about sharing data. Balasubramanian Narasimhan, John-Mark Agosta, and Philip Lavori outline a privacy-preserving alternative that creates statistical models equivalent to one from the entire dataset.
dong meng (MapR)
Deep learning model performance relies on underlying data. Dong Meng offers an overview of a converged data platform that serves as a data infrastructure, providing a distributed filesystem, key-value storage and streams, and Kubernetes as orchestration layer to manage containers to train and deploy deep learning models using GPU clusters.
Matteo Merli (Streamlio)
Traditionally, messaging systems have offered at-least-once delivery semantics, leaving the task of implementing idempotent processing to the application developers. Matteo Merli explains how to add effectively once semantics to Apache Pulsar using a message deduplication layer that can ensure those stricter semantics with guaranteed accuracy and no performance penalty.
Stephen O'Sullivan (Data Whisperers)
Stephen O'Sullivan takes you along the data science journey, from onboarding data (using a number of data/object stores) to understanding and choosing the right data format for the data assets to using query engines (and basic query tuning). You'll learn some new skills to help you be more productive and reduce contention with the data engineering team.
Mark Madsen (Teradata), Shant Hovsepian (Arcadia Data)
There are 70+ BI tools in the market and a dozen or more SQL- or OLAP-on-Hadoop open source projects. Mark Madsen and Shant Hovsepian outline the trade-offs between a number of architectures that provide self-service access to data and discuss the pros and cons of architectures, deployment strategies, and examples of BI on big data.
Mark Donsky (Okera), Steven Ross (Cloudera)
In May 2018, the General Data Protection Regulation (GDPR) goes into effect for firms doing business in the EU, but many companies aren't prepared for the strict regulation or fines for noncompliance (up to €20 million or 4% of global annual revenue). Mark Donsky and Steven Ross outline the capabilities your data environment needs to simplify compliance with GDPR and future regulations.
This talk provides a very quick overview of Graph theory and the graph Algothrims available in Neo4j.
Katie Malone (Civis Analytics)
The 2012 Obama campaign ran the first personalized presidential campaign in history. The data team was made up of people from diverse backgrounds who embraced data science in service of the goal. Civis Analytics emerged from this team and today enables organizations to use the same methods outside politics. Katie Malone shares lessons learned from these experiences for building effective teams.
Tendu Yogurtcu (Syncsort)
Chefs must be able to trust the authenticity, quality, and origin of their ingredients; data analysts must be able to do the same of their data—and what happens to it along the way. Tendü Yoğurtçu explains how to seamlessly track the lineage and quality of your data—on and off the cluster, on-premises or in the cloud—to deliver meaningful insights and meet regulatory compliance requirements.
Mark Donsky (Okera), Andre Araujo (Cloudera), Syed Rafice (Cloudera), Mubashir Kazia (Cloudera)
New regulations are driving compliance, governance, and security challenges for big data, and infosec and security groups must ensure a consistently secured and governed environment across multiple workloads that span a variety of deployments. Mark Donsky, Andre Araujo, Syed Rafice, and Mubashir Kazia walk you through securing a Hadoop cluster, with special attention to GDPR.
Kimoon Kim (Pepperdata), Ilan Filonenko (Bloomberg LP)
There is growing interest in running Spark natively on Kubernetes, and Spark data is often stored in HDFS. Kimoon Kim and Ilan Filonenko explain how to make Spark on Kubernetes work seamlessly with HDFS by addressing challenges such as HDFS data locality and secure HDFS support.
Szehon Ho (Criteo), Pawel Szostek (Criteo)
Hive is the main data transformation tool at Criteo, and hundreds of analysts and thousands of automated jobs run Hive queries every day. Szehon Ho and Pawel Szostek discuss the evolution of Criteo's Hive platform from an error-prone add-on installed on some spare machines to a best-in-class installation capable of self-healing and automatically scaling to handle its growing load.
Jordan Hambleton (Cloudera), GuruDharmateja Medasani (Domino Data Lab)
When Kafka stream processing pipelines fail, they can leave users panicked about data loss when restarting their application. Jordan Hambleton and Guru Medasani explain how offset management provides users the ability to restore the state of the stream throughout its lifecycle, deal with unexpected failure, and improve accuracy of results.
Thomas Phelan (HPE BlueData)
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). However, TDE can be difficult to configure and manage—issues that are only compounded when running on Docker containers. Thomas Phelan explores these challenges and how to overcome them.
Juan Yu (Cloudera)
Apache Impala (incubating) is an exceptional, best-of-breed massively parallel processing SQL query engine that is a fundamental component of the big data software stack. Juan Yu demystifies the cost model Impala Planner uses and how Impala optimizes queries and explains how to identify performance bottleneck through query plan and profile and how to drive Impala to its full potential.
Pramit Choudhary (
Pramit Choudhary explores the usefulness of a generative approach that applies Bayesian inference to generate human-interpretable decision sets in the form of "if. . .and else" statements. These human interpretable decision lists with high posterior probabilities might be the right way to balance between model interpretability, performance, and computation.
Sergey Ermolin (Intel), Suqiang Song (Mastercard)
Sergey Ermolin and Suqiang Song demonstrate how to use Spark BigDL wide and deep and neural collaborative filtering (NCF) algorithms to predict a user’s probability of shopping at a particular offer merchant during a campaign period. Along the way, they compare the deep learning results with those obtained by MLlib’s alternating least squares (ALS) approach.
Ajey Gore (GO-JEK)
Ajey Gore details GO-JEK's evolution from a small bike-hailing startup to a technology-focused unicorn in the areas of transportation, lifestyle, payments, and social enterprise and explains how the company is focusing its attention beyond urban Indonesia to impact more than a million people across the country's rural areas.
In the era of device addiction, an attention economy & surveillance capitalism, this initiative will enable new levels of organizational Data Transparency and user Data Control. Formation of a meta-community focused on existing and new open source projects throughout a modern Data pipeline. (streaming, queuing, transformation, integration, database, analytics, post-processing, storage, networking)
Dean Wampler (Anyscale)
Dean Wampler compares and contrasts data processing with Akka Streams and Kafka Streams, microservice streaming applications based on Kafka. Dean discusses the strengths and weaknesses of each tool for particular design needs and contrasts them with Spark Streaming and Flink, so you'll know when to choose them instead.
Seth Stephens-Davidowitz (New York Times)
Seth Stephens-Davidowitz explains how to use Google searches to uncover behaviors or attitudes that may be hidden from traditional surveys, such as racism, sexuality, child abuse, and abortion.
Harish Doddi (Datatron), Jerry Xu (Datatron Technologies)
Deploying machine learning models and deep learning models in production is hard. Harish Doddi and Jerry Xu outline the enterprise data science lifecycle, covering how production model deployment flow works, challenges, best practices, and lessons learned. Along the way, they explain why monitoring models in the production should be mandatory.
Chris Chapo (Gap Inc.)
Chris Chapo walks you through real-world examples of companies that are driving transformational change by leveraging data science and analytics, paying particular attention to established organizations where these capabilities are newer concepts.
Shenghu Yang (Lyft)
Lyft’s business has grown over 100x in the past four years. Shenghu Yang explains how Lyft’s data pipeline has evolved over the years to serve its ever-growing analytics use cases, migrating from the world's largest AWS Redshift clusters to Apache Hive and Presto for solving scalability and concurrency hard limits.
Manu Mukerji (8x8)
Acme Corporation is a global leader in commerce marketing. Manu Mukerji walks you through Acme Corporation's machine learning example for universal catalogs, explaining how the training and test sets are generated and annotated; how the model is pushed to production, automatically evaluated, and used; production issues that arise when applying ML at scale in production; lessons learned; and more.
Delip Rao (AI Foundation), Brian McMahan (Wells Fargo)
PyTorch is a recent deep learning framework from Facebook that is gaining massive momentum in the deep learning community. Its fundamentally flexible design makes building and debugging models straightforward, simple, and fun. Delip Rao and Brian McMahan walk you through PyTorch's capabilities and demonstrate how to use PyTorch to build deep learning models and apply them to real-world problems.
Emre Velipasaoglu (Lightbend)
Most machine learning algorithms are designed to work on stationary data, but real-life streaming data is rarely stationary. Models lose prediction accuracy over time if they are not retrained. Without model quality monitoring, retraining decisions are suboptimal and costly. Emre Velipasaoglu evaluates monitoring methods for applicability in modern fast data and streaming applications.
Ram Sriharsha (Databricks)
How do you scale geospatial analytics on big data? And while you're at it, can you make it easy to use while achieving state-of-the-art performance on a single node? Ram Sriharsha offers an overview of Magellan—a geospatial optimization engine that seamlessly integrates with Spark—and explains how it provides scalability and performance without sacrificing simplicity.
Li Fan (Pinterest)
Li Fan shares insights into how Pinterest improves products based on usage and explains how the company is using AI to predict what’s in an image, what a user wants, and what they’ll want next, answering subjective questions better than machines or humans alone could achieve.
Edwina Lu (LinkedIn), Ye Zhou (LinkedIn), Min Shen (LinkedIn)
Spark applications need to be well tuned so that individual applications run quickly and reliably and cluster resources are efficiently utilized. Edwina Lu, Ye Zhou, and Min Shen outline a fast, reliable, and automated process used at LinkedIn for tuning Spark applications, enabling users to quickly identify and fix problems.
Karthik Ramasamy (Streamlio), Sanjeev Kulkarni (Streamlio), Sijie Guo (StreamNative), Arun Kejariwal (Independent)
Across diverse segments in industry, there has been a shift in focus from big data to fast data. Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them.
Henry Cai (Pinterest), Yi Yin (Pinterest)
With the rise of large-scale real-time computation, there is a growing need to link legacy MySQL systems with real-time platforms. Henry Cai and Yi Yin offer an overview of WaterMill, Pinterest's continuous DB ingestion system for streaming SQL data into near-real-time computation pipelines to support dynamic personalized recommendations and search indices.
Goodman Gu (Cogito)
Machine learning is a pivotal technology. However, bringing an ML application to life often requires overcoming bottlenecks not just in the model code but in operationalizing the end-to-end system itself. Goodman Gu shares a case study from a leading SaaS company that quickly and easily built, trained, optimized, and deployed an XGBoost churn prediction ML app at scale with Amazon SageMaker.
Francesca Lazzeri (Microsoft), Fidan Boylu Uz (Microsoft)
Francesca Lazzeri and Fidan Boylu Uz explain how to operationalize LSTM networks to predict the remaining useful life of aircraft engines. They use simulated aircraft sensor values to predict when an aircraft engine will fail in the future so that maintenance can be planned in advance.
Machine learning research and incubation projects are everywhere, but less common, and far more valuable, is the innovation unlocked once you bring machine learning out of research and into production. Dinesh Nirmal explains how real-world machine learning reveals assumptions embedded in business processes and in the models themselves that cause expensive and time-consuming misunderstandings.
Holden Karau (Independent), Rachel B Warren (Salesforce Einstein)
Holden Karau and Rachel Warren explore the state of the current big data ecosystem and explain how to best work with it in non-JVM languages. While much of the focus will be on Python + Spark, the talk will also include interesting anecdotes about how these lessons apply to other systems (including Kafka).
Ben Lorica (O'Reilly)
Ben Lorica shares emerging security best practices for business intelligence, machine learning, and mobile computing products and explores new tools, methods, and products that can help ease the way for companies interested in deploying secure and privacy-preserving analytics.
Anne Buff (SAS)
Emerging technologies such as the IoT, AI, and ML present businesses with enormous opportunities for innovation, but to maximize the potential of these technologies, businesses must radically shift their approach to governance. Anne Buff explains what it takes to shift the focus of governance from standards, conformity, and control to accountability, extensibility, and enablement.
Eugene Kirpichov (Google)
Apache Beam equips users with a novel programming model in which the classic batch/streaming data processing dichotomy is erased. Eugene Kirpichov details the modularity and composability advantages created by treating data ingestion as just another data processing task and walks you through building highly modular data ingestion APIs using the new Beam programming model primitive Splittable DoFn.
April Chen (Civis Analytics)
Which of your ad campaigns lead to the most sales? In the absence of A/B testing, marketers often turn to simple touch attribution models. April Chen details the shortcomings of these models and proposes a new approach that uses matching methods from causal inference to more accurately measure marketing effectiveness.
Mike Prorock (
Mike Prorock offers an overview of, a game-changing climate awareness solution that combines smart sensor technology, data transmission, and state-of-the-art visual analytics to transform the agricultural and turf management market. enables growers to monitor areas of concern, providing immediate benefits to crop yield, supply costs, farm labor overhead, and water consumption.
Zhen Fan (, Wei Ting Chen (Intel Corporate)
Zhen Fan and Wei Ting Chen explain how uses Spark on Kubernetes in a production environment and why the company chose Spark on Kubernetes for its AI workloads. You'll learn how to run Spark with Kubernetes and gain an understanding of the advantages this provides.
Ian Cook (Cloudera)
The popular R package dplyr provides a consistent grammar for data manipulation that can abstract over diverse data sources. Ian Cook shows how you can use dplyr to query large-scale data using different processing engines including Spark and Impala. He demonstrates the R package sparklyr (from RStudio) and the new R package implyr (from Cloudera) and shares tips for making dplyr code portable.
Billy Liu (Kyligence)
As organizations look to scale their analytics capability, the need to grow beyond a traditional data warehouse becomes critical, and cloud-based solutions allow more flexibility while being more cost efficient. Billy Liu offers an overview of Kyligence Cloud, a managed Apache Kylin online service designed to speed up mission-critical analytics at web scale for big data.
Janelle Shane (
At Janelle Shane posts the results of neural network experiments gone delightfully wrong. But machine learning mistakes can also be very embarrassing or even dangerous. Using silly datasets as examples, Janelle talks about some ways that algorithms fail.
Tim Berglund (Confluent)
Tim Berglund leads a basic architectural introduction to Kafka and walks you through using Kafka Streams and KSQL to process streaming data.
Dean Wampler (Anyscale), Boris Lublinsky (Lightbend)
Join Dean Wampler and Boris Lublinsky to learn how to build two microservice streaming applications based on Kafka using Akka Streams and Kafka Streams for data processing. You'll explore the strengths and weaknesses of each tool for particular design needs and contrast them with Spark Streaming and Flink, so you'll know when to choose them instead.
Fabian Hueske (data Artisans), Shuyi Chen (Uber)
Fabian Hueske and Shuyi Chen explore SQL's role in the world of streaming data and its implementation in Apache Flink and cover fundamental concepts, such as streaming semantics, event time, and incremental results. They also share their experience using Flink SQL in production at Uber, explaining how Uber leverages Flink SQL to solve its unique business challenges.
Valentin Bercovici (PencilDATA)
Valentin Bercovici explores the challenges in securing, maintaining, and repairing the dynamic, heterogeneous software supply chain for modern self-driving cars, from levels 0 to 5. Along the way, Valentin reviews implementation options, from centralized certificate authority-based architectures to decentralized blockchains networked over a fleet of cars.
Stephanie Beben (Booz Allen Hamilton)
How can you most effectively use machine intelligence to drive strategy? By merging it in the right way with the human ingenuity of leaders throughout your organization. Stephanie Beben shares insights from her work with pioneering companies, government agencies, and nonprofits that are successfully navigating this partnership by becoming “mathematical corporations.”
Jiangjie Qin (LinkedIn)
LinkedIn runs more than 1,800+ Kafka brokers that deliver more than two trillion messages a day. Running Kafka at such a scale makes automated operations a necessity. Jiangjie Qin shares lessons learned from operating Kafka at scale with minimum human intervention.
Umur Cubukcu (Citus Data)
PostgreSQL is often regarded as the world’s most advanced open source database—and it’s on fire. Umur Cubukcu moves beyond the typical list of features in the next release to explore why so many new projects “just use Postgres” as their system of record (or system of engagement) at scale. Along the way, you'll learn how PostgreSQL's extension APIs are fueling innovations in relational databases.
Tobias Ternstrom (Microsoft)
The emergence of the cloud combined with open source software ushered in an explosive use of a broad range of technologies. Tobias Ternstrom explains why you should step back and attempt to objectively evaluate the problem you are trying to solve before choosing the tool to fix it.
Brian Karfunkel (Pinterest)
When software companies use A/B tests to evaluate product changes and fail to accurately estimate the long-term impact of such experiments, they risk optimizing for the users they have at the expense of the users they want to have. Brian Karfunkel explains how to estimate an experiment’s impact over time, thus mitigating this risk and giving full credit to experiments targeted at noncore users.
Jeff Dean (Google)
The Google Brain team conducts research on difficult problems in artificial intelligence and builds large-scale computer systems for machine learning research, both of which have been applied to dozens of Google products. Jeff Dean highlights some of Google Brain's projects with an eye toward how they can be used to solve challenging problems.
Ann Nguyen (Whole Whale)
Power Poetry is the largest online platform for young poets, with over 350K users. Ann Nguyen explains how Power Poetry is extending the learning potential with machine learning and covers the technical elements of the Poetry Genome, a series of ML tools to analyze and break down similarity scores of the poems added to the site.
Siddharth Teotia (Dremio)
Query processing technology has rapidly evolved since the iconic C-Store paper was published in 2005, with a focus on designing query processing algorithms and data structures that efficiently utilize CPU and leverage the changing trends in hardware to deliver optimal performance. Siddharth Teotia outlines the different types of vectorized query processing in Dremio using Apache Arrow.
Noah Gift (UC Davis)
Noah Gift uses data science and machine learning to explore NBA team valuation and attendance as well as individual player performance. Questions include: What drives the valuation of teams (attendance, the local real estate market, etc.)? Does winning bring more fans to games? Does salary correlate with social media performance?
William Vambenepe explains how a pivot toward machine learning and artificial intelligence has created clearer separation among clouds than ever before. William walks you through an interesting use case of machine learning in action and discusses the central role AI will play in big data analysis moving forward.
Daniel Templeton (Cloudera), Andrew Wang (Cloudera)
Hadoop 3.0 has been years in the making, and now it's finally arriving. Andrew Wang and Daniel Templeton offer an overview of new features, including HDFS erasure coding, YARN Timeline Service v2, YARN federation, and much more, and discuss current release management status and community testing efforts dedicated to making Hadoop 3.0 the best Hadoop major release yet.
Baron Schwartz (VividCortex)
Anomaly detection is white hot in the monitoring industry, but many don't really understand or care about it, while others repeat the same pattern many times. Why? And what can we do about it? Baron Schwartz explains how he arrived at a "post-anomaly detection" point of view.
Patrick Harrison (S&P Global)
Word vector embeddings are everywhere, but relatively few understand how they produce their remarkable results. Patrick Harrison opens up the black box of a popular word embedding algorithm and walks you through how it works its magic. Patrick also covers core neural network concepts, including hidden layers, loss gradients, backpropagation, and more.
Thomas Miller (Northwestern University)
Sports analytics today is more than a matter of analyzing box scores and play-by-play statistics. Faced with detailed on-field or on-court data from every game, sports teams face challenges in data management, data engineering, and analytics. Thomas Miller details the challenges faced by a Major League Baseball team as it sought competitive advantage through data science and deep learning.
Joe Dumoulin (Next IT)
AI is transformative for business, but it’s not magic; it’s data. Joe Dumoulin shares how Next IT's global enterprise customers have transformed their businesses with AI solutions and outlines how companies should build AI strategies, utilize data to develop and evolve conversational intelligence and business intents, and ultimately increase ROI.