Jason Dai, Yuhao Yang, Jennie Wang, and Guoqiong Song explain how to build and productionize deep learning applications for big data with Analytics Zoo—a unified analytics and AI platform that seamlessly unites Spark, TensorFlow, Keras, and BigDL programs into an integrated pipeline—using real-world use cases from JD.com, MLSListings, the World Bank, Baosight, and Midea/KUKA.
Yuhao Yang and Jennie Wang demonstrate how to run distributed TensorFlow on Apache Spark with the open source software package Analytics Zoo. Compared to other solutions, Analytics Zoo is built for production environments and encourages more industry users to run deep learning applications with the big data ecosystems.
Xiao Li and Wenchen Fan offer an overview of the major features and enhancements in Apache Spark 2.4 and give insight into upcoming releases. Then you'll get the chance to ask all your burning Spark questions.
Clustered data is all around us. The best way to attack it? Mixed effect models. Sourav Dey explains how the mixed effects random forests (MERF) model and Python package marries the world of classical mixed effect modeling with modern machine learning algorithms and shows how it can be extended to be used with other advanced modeling techniques like gradient boosting machines and deep learning.
Quantitative finance is a rich field in finance where advanced mathematical and statistical techniques are employed by both sell-side and buy-side institutions. Chakri Cherukuri explains how machine learning and deep learning techniques are being used in quantitative finance and details how these models work under the hood.
Building a data lake involves more than installing Hadoop or putting data into AWS. The goal in most organizations is to build a multiuse data infrastructure that isn't subject to past constraints. Mark Madsen and Todd Walter explore design assumptions and principles and walk you through a reference architecture to use as you work to unify your analytics infrastructure.
Augmenting data management and analytics platforms with artificial intelligence and machine learning is game changing for analysts, engineers, and other users. It enables companies to optimize their storage, speed, and spending. Yang Li details the Kyligence platform, which is evolving to the next level with augmented capabilities such as intelligent modeling, smart pushdowns, and more.
The healthcare industry requires accuracy and highly interpretable models, but the data is usually plagued by missing information and incorrect values. Enter AutoML and auto-model interpretability. Taposh DuttaRoy and Sabrina Dahlgren discuss tools and strategies for AutoML and interpretability and explain how KP uses them to improve time to develop and deploy highly interpretable models.
If you’d like to make new professional connections and hear ideas for supporting diversity in the tech community, come to the diversity and inclusion networking lunch on Wednesday.
Robert Horton, Mario Inchiosa, and Ali Zaidi demonstrate how to use three cutting-edge machine learning techniques—transfer learning from pretrained language models, active learning to make more effective use of a limited labeling budget, and hyperparameter tuning to maximize model performance—to up your modeling game.
Keynote with Lauren Kunze
Data sharing necessitates stakeholders and populations of people to come together to learn the benefits, risks, challenges, and known and unknown "unknowns." Data sharing policies and frameworks require increasing levels of trust, which takes time to build. Join Mei Fung for trail-blazing stories from Solano County, California, and ASEAN (SE Asia), which offer important insights
Did you know you can run Presto in AWS at a tenth of the cost with AWS Spot nodes, with just a few architectural enhancements to Presto. Shubham Tagra explores the gaps in Presto architecture, explains how to use Spot nodes, covers enhancements, and showcases the improvements in terms of reliability and TCO achieved through them.
Adem Efe Gencer explains how LinkedIn alleviated the management overhead of large-scale Kafka clusters using Cruise Control.
Modern networking and storage technologies like RDMA or NVMe are finding their way into the data center. Patrick Stuedi offers an overview of Apache Crail (incubating), a new project that facilitates running data processing workloads (ML, SQL, etc.) on such hardware. Patrick explains what Crail does and how it benefits workloads based on TensorFlow or Spark.
Modern data analysis requirements have fundamentally redefined what our expectations should be for data warehouses. Join Google BigQuery cocreator Jordan Tigani as he shares his vision for where he sees cloud-scale data analytics heading as well as what technology leaders should be considering as part of their data warehousing roadmap.
Adrian Lungu and Serban Teodorescu explain how—inspired by the green-blue deployment technique—the Adobe Audience Manager team developed an active-passive database migration procedure that allows them to test database clusters in production, minimizing the risks without compromising the innovation.
Jeff Chen shares strategies for overcoming time series challenges at the intersection of macroeconomics and data science, drawing from machine learning research conducted at the Bureau of Economic Analysis aimed at improving its flagship product the gross domestic product.
Dilated neural networks are a class of recently developed neural networks that achieve promising results in time series forecasting. Chenhui Hu discusses representative network architectures of dilated neural networks and demonstrates their advantages in terms of training efficiency and forecast accuracy by applying them to solve sales forecasting and financial time series forecasting problems.
Decision making often struggles with the exploration-exploitation dilemma. Multi-armed bandits (MAB) are a popular reinforcement learning solution, but increasing the number of decision criteria leads to an exponential blowup in complexity, and observational delays don’t allow for optimal performance. Shradha Agrawal offers an overview of MABs and explains how to overcome the above challenges.
The implications of new privacy regulations for data management and analytics, such as the General Data Protection Regulation (GDPR) and the upcoming California Consumer Protection Act (CCPA), can seem complex. Mark Donsky and Nikki Rouda highlight aspects of the rules and outline the approaches that will assist with compliance.
Effective data governance is foundational for AI adoption in enterprise, but it's an almost overwhelming topic. Paco Nathan offers an overview of its history, themes, tools, process, standards, and more. Join in to learn what impact machine learning has on data governance and vice versa.
Your team is building machine learning capabilities. Dean Wampler demonstrates how to integrate these capabilities in streaming data pipelines so you can leverage the results quickly and update them as needed and covers challenges such as how to build long-running services that are very reliable and scalable and how to combine a spectrum of very different tools, from data science to operations.
Arun Kumar details recent techniques to accelerate ML over data that is the output of joins of multiple tables. Using ideas from query optimization and learning theory, Arun demonstrates how to avoid joins before ML to reduce runtimes and memory and storage footprints. Along the way, he explores open source software prototypes and sample ML code in both R and Python.
Imagine building a model whose training data is collected on edge devices such as cell phones or sensors. Each device collects data unlike any other, and the data cannot leave the device because of privacy concerns or unreliable network access. This challenging situation is known as federated learning. Mike Lee Williams discusses the algorithmic solutions and the product opportunities.
Processing streaming data with SQL is becoming increasingly popular. Fabian Hueske explains why SQL queries on streams should have the same semantics as SQL queries on static data. He then shares a selection of common use cases and demonstrates how easily they can be addressed with Flink SQL.
Airbnb uses AI and machine learning in many parts of its user-facing business. But it's also advancing the state of AI-powered internal tools. Theresa Johnson details the AI powering Airbnb's next-generation end-to-end metrics forecasting platform, which leverages machine learning, Bayesian inference, TensorFlow, Hadoop, and web technology.
The enterprise data management space has changed dramatically in recent years, and this had led to new challenges for organizations in creating successful data practices. Jonathan Seidman and Ted Malaska share guidance and best practices from planning to implementation based on years of experience working with companies to deliver successful data projects.
Big data infrastructure has evolved from flat files in a distributed filesystem to an efficient ecosystem to a fully deconstructed and open source database with reusable components. Julien Le Dem discusses the key open source components of the big data ecosystem and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem.
Using biosensors and predictive analytics, political campaigns aim to decode your true desires—and influence your vote—without your knowledge. Elizabeth Svoboda explains how these tools work, who's using them, and what they mean for the future of free and fair elections.
Cloudera SDX provides unified metadata control, simplifies administration, and maintains context and data lineage across storage services, workloads, and operating environments. Santosh Kumar, Andre Araujo, and Wim Stoop offer an overview of SDX before diving deep into the moving parts and guiding you through setting it up. You'll leave with the skills to set up your own SDX.
Anomaly detection has many applications, such as tracking business KPIs or fraud spotting in credit card transactions. Unfortunately, there's no one best way to detect anomalies across a variety of domains. Jonathan Merriman and Cynthia Freeman introduce a framework to determine the best anomaly detection method for the application based on time series characteristics.
Recent headline-grabbing data breaches demonstrate that protecting data is essential for every enterprise. The best-of-breed approach for big data is HDFS configured with Transparent Data Encryption (TDE). But TDE is difficult to configure and manage—particularly when run in Docker containers. Thomas Phelan discusses these challenges and explains how to overcome them.
Using a messaging system to build an event bus is very common. However, certain use cases demand a messaging system with a certain set of features. Sijie Guo and Penghui Li discuss the event bus requirements for Zhaopin.com, one of China's biggest online recruitment services providers, and explain why the company chose Apache Pulsar.
Financial services are increasingly deploying AI services for a wide range of applications, such as identifying fraud and financial crimes. Such deployment requires models to be interpretable, explainable, and resilient to adversarial attacks—regulatory requirements prohibit black-box machine learning models. Jari Koister shares tools and infrastructure has developed to support these needs.
Fabian Hueske offers an overview of Apache Flink via the SQL interface, covering stream processing and Flink's various modes of use. Then you'll use Flink to run SQL queries on data streams and contrast this with the Flink DataStream API.
Terrorists live-stream their attacks, “Twitter wars” sell music albums and produce real-world casualties, and viral misinformation alters not just the result of battles but the very fate of nations. The result is that war, tech, and politics have blurred into a new kind of battle space that plays out on our smartphones. P. W. Singer explains.
Osman Sarood and Chunky Gupta discuss Mist’s real-time data pipeline, focusing on Live Aggregators (LA)—a highly reliable and scalable in-house real-time aggregation system that can autoscale for sudden changes in load. LA is 80% cheaper than competing streaming solutions due to running over AWS Spot Instances and having 70% CPU utilization.
Some people use digital devices to track their blood alcohol content (BAC). A BAC-tracking app that could anticipate when a person is likely to have a high BAC could offer coaching in a time of need. Kirstin Aschbacher shares a machine learning approach that predicts user BAC levels with good precision based on minimal information, thereby enabling targeted interventions.
The honeymoon era of data science is ending, and accountability is coming. Successful data science leaders must deliver measurable impact on an increasing share of an enterprise's KPIs. Joshua Poduska, Kimberly Shenk, and Mac Steele explain how leading organizations take a holistic approach to people, process, and technology to build a sustainable competitive advantage.
Malicious DNS traffic patterns are inconsistent and typically thwart anomaly detection. David Rodriguez explains how Cisco uses Apache Spark and Stripe’s Bayesian inference software, Rainier, to fit the underlying time series distribution for millions of domains and outlines techniques to identify artificial traffic volumes related to spam, malvertising, and botnets (masquerading traffic).
Developing applications that leverage machine learning is difficult. Practitioners need to be able to reproduce their model development pipelines, as well as deploy models and monitor their health in production. Corey Zumar offers an overview of MLflow, which simplies this process by managing, reproducing, and operationalizing machine learning through a suite of model tracking and deployment APIs.
The US Census Bureau has been involved in record linkage projects for over 40 years. In that time, there's been a lot of change in computing capabilities and new techniques, and the Census Bureau is reviewing an inventory of linkage methodologies. Yves Thibaudeau describes the progress made so far in identifying specific record linkage techniques for specific applications.
How do you train a machine learning model with no training data? Michael Johnson and Norris Heintzelman share their journey implementing multiple solutions to bootstrapping training data in the NLP domain, covering topics including weak supervision, building an active learning framework, and annotation adjudication for named-entity recognition.
Noah Gift and Michelle Davenport explore exciting ideas in nutrition using data science; specifically, they analyze the detrimental relationship between sugar and longevity, obesity, and chronic diseases.
Evaluating machine learning models is surprisingly hard, particularly because these systems interact in very subtle ways. Ted Dunning breaks the problem of evaluation apart into operational and function evaluation, demonstrating how to do each without unnecessary pain and suffering. Along the way, he shares exciting visualization techniques that will help make differences strikingly apparent.
Kamil Bajda-Pawlikowski and Martin Traverso explore Presto's recently introduced cost-based optimizer, which must account for heterogeneous inputs with differing and often incomplete data statistics, and detail use cases for Presto across several industries. They also share recent Presto advancements, such as geospatial analytics at scale, and the project roadmap going forward.
Most existing big data systems prefer sequential scans for processing queries. Igor Canadi and Dhruba Borthakur challenge this view, offering an overview of converged indexing: a single system called ROCKSET that builds inverted, columnar, and document indices. Converged indexing is economically feasible due to the elasticity of cloud-resources and write optimized storage engines.
Jaipaul Agonus and Daniel Monteiro do Carmo Rosa detail big data analytics and visualization practices and tools used by FINRA to support machine learning and other surveillance activities that the Market Regulation Department conducts in the AWS cloud.
One widely accepted definition of AI is that it means going beyond simple statistics to mimic human skills in perception, learning, interaction, and decision making. Jed Dougherty tightens up this definition by sharing examples on a matrix that breaks down the different parts of that definition and how they might manifest themselves in data science projects at different levels.
What is serverless, and how can it be utilized for data analysis and AI? Avner Braverman outlines the benefits and limitations of serverless with respect to data transformation (ETL), AI inference and training, and real-time streaming. This is a technical talk, so expect demos and code.
Serverless implementation of core processing is quickly becoming a production-ready solution. However, companies with existing processing pipelines may find it hard to go completely serverless. Serverless workflows unite the serverless and cluster worlds, with the benefits of both approaches. Rustem Feyzkhanov demonstrates how serverless workflows change your perception of software architecture.
Sequencing cancer genomes has transformed how we diagnose and treat the deadliest disease in America: cancer. Benjamin Glicksberg explains how coupling cancer genomic data with treatment data through the blockchain will empower patients and citizen scientists to rapidly advance cancer research.
Adam Famularo showcases erwin's combination of data management and data governance to produce actionable insights. Erwin customer Nasdaq then shares a real-world use case. You'll learn how to answer tough data questions, how to maintain a metadata landscape, and how to use data management and governance to produce actionable insights.
Spark SQL is widely used, but it still suffers from stability and performance challenges in highly dynamic environments with large-scale data. Haifeng Chen shares a Spark adaptive execution engine built to address these challenges. It can handle task parallelism, join conversion, and data skew dynamically during runtime, guaranteeing the best plan is chosen using runtime statistics.
Yuan Zhou, Haodong Tang, and Jian Zhang offer an overview of Spark-PMOF and explain how it improves Spark analytics performance.
The Strata Data Awards recognize the most innovative startups, leaders, and data science projects from Strata sponsors and exhibitors around the world. Join us during keynotes for the announcement of the winners.
The journey to AI begins with data and making intelligent use of it. Dinesh Nirmal shares a strategic framework for streamlining your data assets, a framework that takes into account the current state of your existing data structures, the new technologies driving enterprise, the complexities of business processes, and at the foundation, the elements required in an AI-fluent data platform.
Brands that test the content of ads before they are shown to an audience can avoid spending resources on the 11% of ads that cause backlash. Using a survey experiment to choose the best ad typically improves effectiveness of marketing campaigns by 13% on average, and up to 37% for particular demographics. Patrick Miller explores data collection and statistical methods for analysis and reporting.
Most enterprises want the same flexibility and convenience they get in the public cloud, no matter where their data lives or their applications run. We've reached the point that the "enterprise data cloud" must span the firewall and the services offered by hyperscale vendors. Mike Olson describes the key capabilities that such a system requires and why hybrid and multicloud is the future.
Concerns are constantly being raised today about what data is appropriate to collect and how (or if) it should be analyzed. There are many ethical, privacy, and legal issues to consider, and no clear standards exist in many cases as to what is fair and what is foul. Bill Franks explores a variety of dilemmas and provides some guidance on how to approach them.
Josh Bersin explains how firms are transforming for the digital era, covering the death of the traditional organizational hierarchy, new models of leadership and management, changes in the way people learn and progress, new models of pay, and the importance of trust and transparency as a central business value.
Cloudera “drinks its own champagne”—running Cloudera on Cloudera. The company analyzes data from the edge and runs probabilistic models to tune its business processes with AI, from marketing, sales, and support to strategic planning. Amy O'Connor shares what Cloudera has learned from the edge to AI and explains how it's helping Cloudera and its customers get better at data-driven.
The Netflix data platform is a massive-scale, cloud-only suite of tools and technologies. It includes big data tech (Spark and Flink), enabling services (federated metadata management), and machine learning support. But with power comes complexity. Kurt Brown explains how Netflix is working toward an easier, "self-service" data platform without sacrificing any enabling capabilities.
Rakesh Kumar and Thomas Weise explore how Lyft dynamically prices its rides with a combination of various data sources, ML models, and streaming infrastructure for low latency, reliability, and scalability—allowing the pricing system to be more adaptable to real-world changes.
Stephen Dantu shares insurance broker Marsh’s pioneering journey into the public cloud and explains why this move was necessary to unleash new opportunities and future-proof the company.
User-based real-time recommendation systems have become an important topic in ecommerce. Lu Wang, Nicole Kong, Guoqiong Song, and Maneesha Bhalla demonstrate how to build deep learning algorithms using Analytics Zoo with BigDL on Apache Spark and create an end-to-end system to serve real-time product recommendations.
Thanks to the rapid growth in data resources, business leaders now appreciate the importance (and the challenge) of mining information from data. Join in as a group of LinkedIn's data scientists share their experiences successfully leveraging emerging techniques to assist in intelligent decision making.
Academic research has been plagued by a reproducibility crisis in fields ranging from medicine to psychology. Stuart Buck explains how to take precautions in your data analysis and experiments so as to avoid those reproducibility problems.
In the analysis of the mobile world, everyone starts with the question, "Where?" SK Telecom is trying to meet these needs. Kyungtaak Noh explains how the company provides geospatial analysis by processing geospatial data through Druid with Lucene.
As the popularity and utilization of Apache Impala deployments increases, clusters often become victims of their own success when demand for resources exceeds the supply. Tim Armstrong dives into the latest resource management features in Impala to maintain high cluster availability and optimal performance and provides examples of how to configure them in your Impala deployment.