Speaker Slides: Big data conference & machine learning training

Analytics Zoo: Distributed TensorFlow and Keras on Apache Spark

Jason Dai (Intel), Yuhao Yang (Intel), Jiao(Jennie) Wang (Intel), Guoqiong Song (Intel)

Download slides (PDF)

Jason Dai, Yuhao Yang, Jennie Wang, and Guoqiong Song explain how to build and productionize deep learning applications for big data with Analytics Zoo—a unified analytics and AI platform that seamlessly unites Spark, TensorFlow, Keras, and BigDL programs into an integrated pipeline—using real-world use cases from JD.com, MLSListings, the World Bank, Baosight, and Midea/KUKA.

Analytics Zoo: Distributed TensorFlow in production on Apache Spark

Yuhao Yang (Intel), Jiao(Jennie) Wang (Intel)

Download slides (PDF)

Yuhao Yang and Jennie Wang demonstrate how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solutions, Analytics Zoo is built for production environments and encourages more industry users to run deep learning applications with the big data ecosystems.

Apache Spark 2.4 and beyond

Xiao Li (Databricks), Wenchen Fan (Databricks)

Download slides (PDF)

Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4 and give insight into upcoming releases. Then you'll get the chance to ask all your burning Spark questions.

Applications of mixed effects random forests

Sourav Dey (Manifold)

View slides

Clustered data is all around us. The best way to attack it? Mixed effect models. Sourav Dey explains how the mixed effects random forests (MERF) model and Python package marries the world of classical mixed effect modeling with modern machine learning algorithms and shows how it can be extended to be used with other advanced modeling techniques like gradient boosting machines and deep learning.

Applied machine learning in finance

Chakri Cherukuri (Bloomberg LP)

Download slides (PPTX)

Quantitative finance is a rich field in finance where advanced mathematical and statistical techniques are employed by both sell-side and buy-side institutions. Chakri Cherukuri explains how machine learning and deep learning techniques are being used in quantitative finance and details how these models work under the hood.

Architecting a data platform for enterprise use

Mark Madsen (Teradata), Todd Walter (Archimedata)

View slides

Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.

Augmented OLAP for big data from on-premises to multicloud (sponsored by Kyligence)

Yang Li (Kyligence)

Download slides (PDF)

Augmenting data management and analytics platforms with artificial intelligence and machine learning is game changing for analysts, engineers, and other users. It enables companies to optimize their storage, speed, and spending. Yang Li details the Kyligence platform, which is evolving to the next level with augmented capabilities such as intelligent modeling, smart pushdowns, and more.

AutoML and interpretability in healthcare

Taposh DuttaRoy (Kaiser Permanente), Sabrina Dahlgren (Kaiser Permanente)

Download slides (ZIP)

The healthcare industry requires accuracy and highly interpretable models, but the data is usually plagued by missing information and incorrect values. Enter AutoML and auto-model interpretability. Taposh DuttaRoy and Sabrina Dahlgren discuss tools and strategies for AutoML and interpretability and explain how KP uses them to improve time to develop and deploy highly interpretable models.

Better Together Diversity Networking Lunch (sponsored by Walmart Labs)

Download slides (PDF)

If you’d like to make new professional connections and hear ideas for supporting diversity in the tech community, come to the diversity and inclusion networking lunch on Wednesday.

Building high-performance text classifiers on a limited labeling budget

Robert Horton (Microsoft), Mario Inchiosa (Microsoft), Ali Zaidi (Microsoft)

Download slides (PPTX)

Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game.

Chatting with machines: Strange things 60 billion bot logs say about human nature

Lauren Kunze (Pandorabots)

Watch the keynote

Keynote with Lauren Kunze

Community and regional data sharing policy frameworks: Frontier stories

Mei Fung (People Centered Internet)

Download slides (PPTX)

Data sharing necessitates stakeholders and populations of people to come together to learn the benefits, risks, challenges, and known and unknown "unknowns." Data sharing policies and frameworks require increasing levels of trust, which takes time to build. Join Mei Fung for trail-blazing stories from Solano County, California, and ASEAN (SE Asia), which offer important insights

Cost-effective Presto on AWS with Spot nodes

Shubham Tagra (Qubole)

Download slides (PPTX)

Did you know you can run Presto in AWS at a tenth of the cost with AWS Spot nodes, with just a few architectural enhancements to Presto. Shubham Tagra explores the gaps in Presto architecture, explains how to use Spot nodes, covers enhancements, and showcases the improvements in terms of reliability and TCO achieved through them.

Cruise Control: Effortless management of Kafka clusters

Adem Efe Gencer (LinkedIn)

Download slides (PDF)

Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control.

Data processing at the speed of 100 Gbps using Apache Crail

Patrick Stuedi (IBM Research)

Download slides (PDF)

Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark.

Data warehousing is not a use case (sponsored by Google Cloud)

Jordan Tigani (Google )

Watch the keynote

Modern data analysis requirements have fundamentally redefined what our expectations should be for data warehouses. Join Google BigQuery cocreator Jordan Tigani as he shares his vision for where he sees cloud-scale data analytics heading as well as what technology leaders should be considering as part of their data warehousing roadmap.

Database migrations don't have to be painful, but the road will be bumpy

Adrian Lungu (Adobe), Serban Teodorescu (Adobe)

Download slides (PPTX)

Adrian Lungu and Serban Teodorescu explain how—inspired by the green-blue deployment technique—the Adobe Audience Manager team developed an active-passive database migration procedure that allows them to test database clusters in production, minimizing the risks without compromising the innovation.

Deploying data science for national economic statistics

Jeff Chen (US Bureau of Economic Analysis)

Download slides (PDF)

Jeff Chen shares strategies for overcoming time series challenges at the intersection of macroeconomics and data science, drawing from machine learning research conducted at the Bureau of Economic Analysis aimed at improving its flagship product the gross domestic product.

Dilated neural networks for time series forecasting

Chenhui Hu (Microsoft)

Download slides (PPTX)

Dilated neural networks are a class of recently developed neural networks that achieve promising results in time series forecasting. Chenhui Hu discusses representative network architectures of dilated neural networks and demonstrates their advantages in terms of training efficiency and forecast accuracy by applying them to solve sales forecasting and financial time series forecasting problems.

Efficient multi-armed bandit with Thompson sampling for applications with delayed feedback

Shradha Agrawal (Adobe)

Download slides (PPTX)

Decision making often struggles with the exploration-exploitation dilemma. Multi-armed bandits (MAB) are a popular reinforcement learning solution, but increasing the number of decision criteria leads to an exponential blowup in complexity, and observational delays don’t allow for optimal performance. Shradha Agrawal offers an overview of MABs and explains how to overcome the above challenges.

Executive Briefing: Big data in the era of heavy worldwide privacy regulations

Mark Donsky (Okera), Nikki Rouda (Amazon Web Services)

Download slides (1-PDF)

Download slides (2-PDF)

The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance.

Executive Briefing: Overview of data governance

Paco Nathan (derwen.ai)

View slides

Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa.

Executive Briefing: What it takes to use machine learning in fast data pipelines

Dean Wampler (Anyscale)

Download slides (PDF)

Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations.

Faster ML over joins of tables

Arun Kumar (University of California, San Diego)

Download slides (PDF)

Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python.

Federated learning

Mike Lee Williams (Cloudera Fast Forward Labs)

Download slides (PDF)

Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. Mike Lee Williams discusses the algorithmic solutions and the product opportunities.

Flink SQL in action

Fabian Hueske (Ververica)

Download slides (PDF)

Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL.

Forecasting uncertainty at Airbnb

Theresa Johnson (Airbnb)

Watch the keynote

Airbnb uses AI and machine learning in many parts of its user-facing business. But it's also advancing the state of AI-powered internal tools. Theresa Johnson details the AI powering Airbnb's next-generation end-to-end metrics forecasting platform, which leverages machine learning, Bayesian inference, TensorFlow, Hadoop, and web technology.

Foundations for successful data projects

Jonathan Seidman (Cloudera), Ted Malaska (Capital One)

Download slides (PDF)

The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.

From flat files to deconstructed databases: The evolution and future of the big data ecosystem

Julien Le Dem (WeWork)

Download slides (PDF)

Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.

Hacking the vote: The neuropolitical universe

Elizabeth Svoboda (What Makes a Hero?)

Watch the keynote

Using biosensors and predictive analytics, political campaigns aim to decode your true desires—and influence your vote—without your knowledge. Elizabeth Svoboda explains how these tools work, who's using them, and what they mean for the future of free and fair elections.

Hands-on with Cloudera SDX: Setting up your own shared data experience

Santosh Kumar (Cloudera), Andre Araujo (Cloudera), Wim Stoop (Cloudera)

Download slides (PDF)

Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX.

How to determine the optimal anomaly detection method for your application

Jonathan Merriman (Verint Intelligent Self Service), Cynthia Freeman (Verint Intelligent Self-Service)

Download slides (PDF)

Anomaly detection has many applications, such as tracking business KPIs or fraud spotting in credit card transactions. Unfortunately, there's no one best way to detect anomalies across a variety of domains. Jonathan Merriman and Cynthia Freeman introduce a framework to determine the best anomaly detection method for the application based on time series characteristics.

How to protect big data in a containerized environment

Thomas Phelan (HPE BlueData)

Download slides (PPT)

Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). But TDE is difficult to configure and manage—particularly when run in Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them.

How Zhaopin.com built its enterprise event bus using Apache Pulsar

Sijie Guo (StreamNative), Penghui Li (Zhaopin)

Download slides (PDF)

Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar.

Interpretable and resilient AI for financial services

Jari Koister (FICO )

View slides

Financial services are increasingly deploying AI services for a wide range of applications, such as identifying fraud and financial crimes. Such deployment requires models to be interpretable, explainable, and resilient to adversarial attacks—regulatory requirements prohibit black-box machine learning models. Jari Koister shares tools and infrastructure has developed to support these needs.

Introduction to Flink via Flink SQL

Fabian Hueske (Ververica)

Download slides (PDF)

Fabian Hueske offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink's various modes of use. Then you'll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API.

Likewar: How social media is changing the world…and how the world is changing social media

Peter Singer (New America)

Watch the keynote

Terrorists live-stream their attacks, “Twitter wars” sell music albums and produce real-world casualties, and viral misinformation alters not just the result of battles but the very fate of nations. The result is that war, tech, and politics have blurred into a new kind of battle space that plays out on our smartphones. P. W. Singer explains.

Live Aggregators: A scalable, cost-effective, and reliable way of aggregating billions of messages in real time

Osman Sarood (Mist Systems), Chunky Gupta (Mist Systems)

View slides

Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization.

Machine learning prediction of blood alcohol content: A digital signature of behavior

Kirstin Aschbacher (UCSF Cardiology)

Download slides (PDF)

Some people use digital devices to track their blood alcohol content (BAC). A BAC-tracking app that could anticipate when a person is likely to have a high BAC could offer coaching in a time of need. Kirstin Aschbacher shares a machine learning approach that predicts user BAC levels with good precision based on minimal information, thereby enabling targeted interventions.

Managing data science in the enterprise

Joshua Poduska (Domino Data Lab), Kimberly Shenk (NakedPoppy), Mac Steele (Domino)

Download slides (PDF)

The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders must deliver measurable impact on an increasing share of an enterprise's KPIs. Joshua Poduska, Kimberly Shenk, and Mac Steele explain how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.

Masquerading malicious DNS traffic

David Rodriguez (Cisco Systems)

Download slides (PDF)

Malicious DNS traffic patterns are inconsistent and typically thwart anomaly detection. David Rodriguez explains how Cisco uses Apache Spark and Stripe’s Bayesian inference software, Rainier, to fit the underlying time series distribution for millions of domains and outlines techniques to identify artificial traffic volumes related to spam, malvertising, and botnets (masquerading traffic).

MLflow: An open platform to simplify the machine learning lifecycle

Corey Zumar (Databricks)

Download slides (PPTX)

Developing applications that leverage machine learning is difficult. Practitioners need to be able to reproduce their model development pipelines, as well as deploy models and monitor their health in production. Corey Zumar offers an overview of MLflow, which simplies this process by managing, reproducing, and operationalizing machine learning through a suite of model tracking and deployment APIs.

New directions in record linkage

Yves Thibaudeau (US Census Bureau)

Download slides (PDF)

The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications.

NLP from scratch: Solving the cold start problem for natural language processing

Michael Johnson (Lockheed Martin), Norris Heintzelman (Lockheed Martin)

Download slides (PPTX)

How do you train a machine learning model with no training data? Michael Johnson and Norris Heintzelman share their journey implementing multiple solutions to bootstrapping training data in the NLP domain, covering topics including weak supervision, building an active learning framework, and annotation adjudication for named-entity recognition.

Nutrition data science

Noah Gift (UC Davis ), Michelle Davenport (Quantitative Nutrition)

Download slides (PPTX)

Noah Gift and Michelle Davenport explore exciting ideas in nutrition using data science; specifically, they analyze the detrimental relationship between sugar and longevity, obesity, and chronic diseases.

Online evaluation of machine learning models

Ted Dunning (MapR, now part of HPE)

Download slides (PDF)

Evaluating machine learning models is surprisingly hard, particularly because these systems interact in very subtle ways. Ted Dunning breaks the problem of evaluation apart into operational and function evaluation, demonstrating how to do each without unnecessary pain and suffering. Along the way, he shares exciting visualization techniques that will help make differences strikingly apparent.

Presto: Tuning performance of SQL-on-anything analytics

Kamil Bajda-Pawlikowski (Starburst), Martin Traverso (Presto Software Foundation)

Download slides (PDF)

Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward.

ROCKSET: The design and implementation of a data system for low-latency queries for search and analytics

Igor Canadi (Rockset), Dhruba Borthakur (Rockset)

Download slides (PDF)

Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines.

Scaling visualization for big data and analytics in the cloud

Jaipaul Agonus (FINRA), Daniel Monteiro do Carmo Rosa (FINRA)

View slides

Jaipaul Agonus and Daniel Monteiro do Carmo Rosa detail big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that the Market Regulation Department conducts in the AWS cloud.

Scoring your business in the AI matrix (sponsored by Dataiku)

Jed Dougherty (Dataiku)

Watch the keynote

One widely accepted definition of AI is that it means going beyond simple statistics to mimic human skills in perception, learning, interaction, and decision making. Jed Dougherty tightens up this definition by sharing examples on a matrix that breaks down the different parts of that definition and how they might manifest themselves in data science projects at different levels.

Serverless for data and AI

Avner Braverman (Binaris)

Download slides (PDF)

What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code.

Serverless workflows for orchestration hybrid cluster-based and serverless processing

Rustem Feyzkhanov (Instrumental)

Download slides (1-PDF)

View slides

Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture.

Sharing cancer genomic data from clinical sequencing using the blockchain

Benjamin Glicksberg (UCSF)

Download slides (PDF)

Sequencing cancer genomes has transformed how we diagnose and treat the deadliest disease in America: cancer. Benjamin Glicksberg explains how coupling cancer genomic data with treatment data through the blockchain will empower patients and citizen scientists to rapidly advance cancer research.

Solving the enterprise data dilemma (sponsored by erwin)

Adam Famularo (erwin, Inc.)

Download slides (PDF)

Adam Famularo showcases erwin's combination of data management and data governance to produce actionable insights. Erwin customer Nasdaq then shares a real-world use case. You'll learn how to answer tough data questions, how to maintain a metadata landscape, and how to use data management and governance to produce actionable insights.

Spark adaptive execution: Unleash the power of Spark SQL

Haifeng Chen (Intel)

Download slides (PPT)

Spark SQL is widely used, but it still suffers from stability and performance challenges in highly dynamic environments with large-scale data. Haifeng Chen shares a Spark adaptive execution engine built to address these challenges. It can handle task parallelism, join conversion, and data skew dynamically during runtime, guaranteeing the best plan is chosen using runtime statistics.

Spark-PMoF: Accelerating big data analytics with Persistent Memory over Fabric

Yuan Zhou (Intel), haodong tang (Intel), Jian Zhang (Intel)

Download slides (PPTX)

Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance.

Strata Data Awards: Winners Announced

Watch the keynote

The Strata Data Awards recognize the most innovative startups, leaders, and data science projects from Strata sponsors and exhibitors around the world. Join us during keynotes for the announcement of the winners.

Streamlining your data assets: A strategy for the journey to AI (sponsored by IBM)

Dinesh Nirmal (IBM)

Watch the keynote

The journey to AI begins with data and making intelligent use of it. Dinesh Nirmal shares a strategic framework for streamlining your data assets, a framework that takes into account the current state of your existing data structures, the new technologies driving enterprise, the complexities of business processes, and at the foundation, the elements required in an AI-fluent data platform.

Testing ad content with survey experiments

Patrick Miller (Civis Analytics)

Download slides (PDF)

Brands that test the content of ads before they are shown to an audience can avoid spending resources on the 11% of ads that cause backlash. Using a survey experiment to choose the best ad typically improves effectiveness of marketing campaigns by 13% on average, and up to 37% for particular demographics. Patrick Miller explores data collection and statistical methods for analysis and reporting.

The enterprise data cloud

Mike Olson (Cloudera)

Watch the keynote

Most enterprises want the same flexibility and convenience they get in the public cloud, no matter where their data lives or their applications run. We've reached the point that the "enterprise data cloud" must span the firewall and the services offered by hyperscale vendors. Mike Olson describes the key capabilities that such a system requires and why hybrid and multicloud is the future.

The ethics of analytics

Bill Franks (International Institute For Analytics)

Download slides (PDF)

Concerns are constantly being raised today about what data is appropriate to collect and how (or if) it should be analyzed. There are many ethical, privacy, and legal issues to consider, and no clear standards exist in many cases as to what is fair and what is foul. Bill Franks explores a variety of dilemmas and provides some guidance on how to approach them.

The future of the firm: Starting now

Josh Bersin (Bersin by Deloitte)

Download slides (PPTX)

Josh Bersin explains how firms are transforming for the digital era, covering the death of the traditional organizational hierarchy, new models of leadership and management, changes in the way people learn and progress, new models of pay, and the importance of trust and transparency as a central business value.

The journey to the data-driven enterprise from the edge to AI

Amy O'Connor (Cloudera)

Watch the keynote

Cloudera “drinks its own champagne”—running Cloudera on Cloudera. The company analyzes data from the edge and runs probabilistic models to tune its business processes with AI, from marketing, sales, and support to strategic planning. Amy O'Connor shares what Cloudera has learned from the edge to AI and explains how it's helping Cloudera and its customers get better at data-driven.

The journey toward a self-service data platform at Netflix

Kurt Brown (Netflix)

View slides

The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data tech (Spark and Flink), enabling services (federated metadata management), and machine learning support. But with power comes complexity. Kurt Brown explains how Netflix is working toward an easier, "self-service" data platform without sacrificing any enabling capabilities.

The magic behind your Lyft ride prices: A case study on machine learning and streaming

Rakesh Kumar (Lyft), Thomas Weise (Lyft)

Download slides (PDF)

Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes.

The new frontier: Marsh’s data voyage into the public cloud (sponsored by Impetus)

Stephen Dantu (Marsh)

Download slides (PPTX)

Stephen Dantu shares insurance broker Marsh’s pioneering journey into the public cloud and explains why this move was necessary to unleash new opportunities and future-proof the company.

User-based real-time product recommendations leveraging deep learning using Analytics Zoo on Apache Spark and BigDL

Luyang Wang (Restaurant Brands International), Jing (Nicole) Kong (Office Depot), Guoqiong Song (Intel), Maneesha Bhalla (Office Depot)

Download slides (PPT)

User-based real-time recommendation systems have become an important topic in ecommerce. Lu Wang, Nicole Kong, Guoqiong Song, and Maneesha Bhalla demonstrate how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create an end-to-end system to serve real-time product recommendations.

Using the full spectrum of data science to drive business decisions

Chi-Yi Kuan (LinkedIn), Tiger Zhang (LinkedIn), Xiaojing Dong (LinkedIn), Burcu Baran (LinkedIn), Emily Huang (LinkedIn)

Download slides (PDF)

Thanks to the rapid growth in data resources, business leaders now appreciate the importance (and the challenge) of mining information from data. Join in as a group of LinkedIn's data scientists share their experiences successfully leveraging emerging techniques to assist in intelligent decision making.

What the reproducibility problem means for your business

Stuart Buck (Arnold Ventures)

Download slides (PPTX)

Academic research has been plagued by a reproducibility crisis in fields ranging from medicine to psychology. Stuart Buck explains how to take precautions in your data analysis and experiments so as to avoid those reproducibility problems.

When self-service BI meets geospatial analysis

Kyungtaak Noh (SK Telecom)

View slides

In the analysis of the mobile world, everyone starts with the question, "Where?" SK Telecom is trying to meet these needs. Kyungtaak Noh explains how the company provides geospatial analysis by processing geospatial data through Druid with Lucene.

When SQL users run wild: Resource management features and techniques to tame Apache Impala

Tim Armstrong (Cloudera)

Download slides (PDF)

As the popularity and utilization of Apache Impala deployments increases, clusters often become victims of their own success when demand for resources exceeds the supply. Tim Armstrong dives into the latest resource management features in Impala to maintain high cluster availability and optimal performance and provides examples of how to configure them in your Impala deployment.

Speaker slides & video

Sponsorship Opportunities

Partner Opportunities

Contact Us