Mahesh Goud shares success stories using Ticketmaster's large-scale contextual bandit platform for SEM, which determines the optimal keyword bids under evolving keyword contexts to meet different business requirements, and explores Ticketmaster's streaming pipeline, consisting of Storm, Kafka, HBase, the ELK Stack, and Spring Boot.
Mark Donsky, André Araujo, Michael Yoder, and Manish Ahluwalia walk you through securing a Hadoop cluster. You’ll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.
Kevin Mao explores the value of and challenges associated with collecting raw security event data from disparate corners of enterprise infrastructure and transforming them into high-quality intelligence that can be used to forecast, detect, and mitigate cybersecurity threats.
Jamie Grier outlines the latest important features in Apache Flink and walks you through building a working demo to show these features off. Topics include queryable state, dynamic scaling, streaming SQL, very large state support, and whatever is the latest and greatest in March 2017.
Apache Kylin, which started as a big data OLAP engine, is reaching its v2.0. Yang Li explains how, armed with snowflake schema support, a full SQL interface, spark cubing, and the ability to consume real-time streaming data, Apache Kylin is closing the gap to becoming a real-time data warehouse.
Daphne Koller explains how Coursera is using large-scale data processing and machine learning in online education. Building on Coursera's wealth of online learning data, Daphne discusses the role of automation in scaling access to education that is personalized and efficient at connecting people with skills and knowledge throughout their lives.
Joseph Blue and Carol Mcdonald walk you through a reference application that processes ECG data encoding using HL7 with a modern anomaly detector, demonstrating how combining visualization and alerting enables healthcare professionals to improve outcomes and reduce costs and sharing lessons learned from their experience dealing with real data in real medical situations.
Using Entity 360 as an example, Jonathan Seidman, Ted Malaska, Mark Grover, and Gwen Shapira explain how to architect a modern, real-time big data platform leveraging recent advancements in the open source software world, using components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics.
Eric Richardson explains how ACS used Hadoop, HBase, Spark, Kafka, and Solr to create a hybrid cloud enterprise data hub that scales without drama and drives adoption by ease of use, covering the architecture, technologies used, the challenges faced and defeated, and problems yet to solve.
Launched in late 2015, Astellas's enterprise data lake project is taking the company on a data governance journey. Kishore Papineni offers an overview of the project, providing insights into some of the business pain points and key drivers, how it has led to organizational change, and the best practices associated with Astellas's new data governance process.
Data helps us understand our market in new and novel ways. In today's world, sifting through the noise in modern journalism means navigating enormous amounts of data, news, and tweets. Tom Reilly and Khalid Al-Kofahi explain how Thomson Reuters is leveraging big data and machine learning to chase down leads, verify sources, and determine what's newsworthy.
Sean Suchter and Shekhar Gupta describe the use of very fine-grained performance data from many Hadoop clusters to build a model predicting excessive swapping events.
GoDaddy ingests and analyzes 100,000 EPS of logs, metrics, and events each day. Felix Gorodishter shares GoDaddy's big data journey and explains how the company makes sense of 10+-TB-per-day growth for operational insights of its cloud leveraging Kafka, Hadoop, Spark, Pig, Hive, Cassandra, and Elasticsearch.
Big data needs governance. Governance empowers data scientists to find, trust, and use data on their own, yet it can be overwhelming to know where to start—especially if your big data environment spans beyond your enterprise to the cloud. Mark Donsky and Sudhanshu Arora share a step-by-step approach to kick-start your big data governance initiatives.
Evangelos Simoudis explores how data generated in and around increasingly autonomous vehicles and by on-demand mobility services will enable the development of new transportation experiences and solutions for a diverse set of industries and government types.
Vijay Narayanan takes you on an inspiring journey exploring how the cloud, data, and artificial intelligence are powering and accelerating the genomic revolution—saving and changing lives in the process.
When building your data stack, architecture could be your biggest challenge—yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin when assembling a scalable data architecture? Ben Sharma shares real-world lessons and best practices to get you started.
Eric Anderson and Shyam Konda explain how the IT team at Beachbody—the makers of P90X and CIZE—successfully ingested all their enterprise data into Amazon S3 and delivered self-service access in less than six months with Talend.
Warren Reed (US Treasury’s Office of Financial Research)
Warren Reed explains how he and his team at the US Treasury’s Office of Financial Research leverage data visualization techniques to build interactive data products for risk measurement and monitoring.
Apache DistributedLog (incubating) is a low-latency, high-throughput replicated log service. Sijie Guo shares how Twitter has used DistributedLog as the real-time data foundation in production for years, supporting services like distributed databases, pub-sub messaging, and real-time stream computing and delivering more than 1.5 trillion (17 PB) events per day.
The cloud is becoming pervasive, but it isn’t always full of rainbows. Defining a strategy that works for your company or for your use cases is critical to ensuring success. Jim Scott explores different use cases that may be best run in the cloud versus on-premises, points out opportunities to optimize cost and operational benefits, and explains how to get the data moved between locations.
Artificial intelligence will accelerate both cancer research and the development of autonomous vehicles. Jason Waxman explains why the ultimate potential of AI will be realized through its societal benefits and positive impact on our world. Collaboration between industry, government, and academia are required to drive this societal innovation and deliver the scale and promise of AI to everyone.
Many iterative machine-learning algorithms can only operate efficiently when a large matrix of training data fits in the main memory. Frederick Reiss and Arvind Surve offer an overview of compressed linear algebra, a technique for compressing training data and performing key operations in the compressed domain that lets you build models over big data with small machines.
Data to the rescue. Desi Matel-Anderson offers an immersive deep dive into the world of the Field Innovation Team, who routinely find themselves on the frontier of disasters working closely with data to save lives, at times while risking their own.
Dirk Jungnickel explains how Dubai-based telco leader du leverages big data to create smart cities and enable location-based data monetization, covering business objectives and outcomes and addressing technical and analytical challenges.
Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging than on traditional distributed systems. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Organizations need a model to measure how effectively they are using data and analytics. Once they know where they are and where they need to go, they then need a framework to determine the economic value of their data. William Schmarzo explores techniques for getting business users to “think like a data scientist” so they can assist in identifying data that makes the best performance predictors.
Big data and data science have great potential for accelerating business, but how do you reconcile the business opportunity with the sea of possible technologies? Data should serve the strategic imperatives of a business—those aspirations that will define an organization’s future vision. Scott Kurth and Edd Wilder-James explain how to create a modern data strategy that powers data-driven business.
Docker makes it easy to bundle an application with its dependencies and provide full isolation, and YARN now supports Docker as an execution engine for submitted applications. Daniel Templeton explains how YARN's Docker support works, why you'd want to use it, and when you shouldn't.
It is no surprise that reducing operational IT expenditures and increasing software capabilities is a top priority for large enterprises. Given its advantages, open source software has proliferated across the globe. Ron Bodkin explains how Teradata drives open source adoption inside enterprises using open source data management and AI techniques leveraged across the analytical ecosystem.
The IoT is driven by outcomes delivered by applications, but to gain operational efficiency, many organizations are looking toward a horizontal platform for delivering and supporting a number of applications. Teresa Tung explores how to choose and implement a platform—and deal with the fact that the platform is horizontal and application outcomes are vertical.
Twitter processes billions of events per day the instant the data is generated using Heron, an open source streaming engine tailored for large-scale environments. Bill Graham, Avrilia Floratau, and Ashvin Agrawal explore the techniques Heron uses to elastically scale resources in order to handle highly varying loads without sacrificing real-time performance or user experience.
Uber relies on making data-driven decisions at every level, and most of these decisions can benefit from faster data processing. Vinoth Chandar and Prasanna Rajaperumal introduce Hoodie, a newly open sourced system at Uber that adds new incremental processing primitives to existing Hadoop technologies to provide near-real-time data at 10x reduced cost.
The aviation industry is facing a huge pressure in costs as well as a profound disruption in marketing and service. With ticket revenues dropping, increasing customer loyalty is key. Andreas Ribbrock explains how Lufthansa German Airlines uses data science and data-driven decision making to create the next level of digital customer experience along the full customer journey.
Although deep learning has proved to be very powerful, few results are reported on its application to business-focused problems. Feng Zhu and Val Fontama explore how Microsoft built a deep learning-based churn predictive model and demonstrate how to explain the predictions using LIME—a novel algorithm published in KDD 2016—to make the black box models more transparent and accessible.
Over the course of just six years, Pinterest has helped over 100 million pinners discover and collect over 75+ billion ideas to plan their everyday lives. Romit Jadhwani walks you through the different phases of this hypergrowth journey and explores the focuses, thought processes, and decisions of Pinterest’s data team as they scaled and enabled this growth.
Just like any six-year-old, Apache Spark does not always do its job and can be hard to understand. Yin Huai looks at the top causes of job failures customers encountered in production and examines ways to mitigate such problems by modifying Spark. He also shares a methodology for improving resilience: a combination of monitoring and debugging techniques for users.
Dwai Lahiri explains how to leverage private cloud infrastructure to successfully build Hadoop clusters and outlines dos, don'ts, and gotchas for running Hadoop on private clouds.
Lloyd Palum explores the importance of identifying the target business value in an IIoT application—a prerequisite to justifying a return on technology investment—and explains how to deliver that value using the concept of a “digital twin.”
Maya Shankar (White House Office of Science & Technology Policy)
Maya Shankar discusses the motivation for and impact of the White House Social and Behavioral Sciences Team and shares lessons learned building a startup within the federal government.
Pokémon GO was one of the fastest-growing games of all time, becoming a worldwide phenomenon in a matter of days. In conversation with Beau Cronin, Phil Keslin, CTO of Niantic, explains how the engineering team prepared for—and just barely survived—the experience.
Real-world data is incomplete and imperfect. The right way to handle it is with Bayesian inference. Michael Williams demonstrates how probabilistic programming languages hide the gory details of this elegant but potentially tricky approach, making a powerful statistical method easy and enabling rapid iteration and new kinds of data-driven products.
Estimating the growth rate of tumors is a very important but very expensive and time-consuming part of diagnosing and treating breast cancer. Michael Dusenberry and Frederick Reiss describe how to use deep learning with Apache Spark and Apache SystemML to automate this critical image classification task.
Rob Craft shares some of the ways machine learning is being used inside of Google, explores cloud-based neural networks, and discusses some customer use cases.
Which is more important: the model or the data? Dinesh Nirmal explains how your data can help you build the right cognitive systems to learn about, reason with, and engage with your customers.
Rodrigo Fontecilla explains how many of the largest airlines use different classes of machine-learning algorithms to create robust and reusable predictive models to provide a holistic view of operations and provide business value.
Eric Frenkiel explains how to use real-time data as a vehicle for operationalizing machine-learning models by leveraging MemSQL, exploring advanced tools, including TensorFlow, Apache Spark, and Apache Kafka, and compelling use cases demonstrating the power of machine learning to effect positive change.
Not all data science problems are big data problems. Lots of small and medium product companies want to start their journey to become data driven. Nischal HP and Raghotham Sripadraj share their experience building data science platforms for various enterprises, with an emphasis on making the right architecture choices and using distributed and fault-tolerant tools.
Let’s stop talking about bad robots and start talking about what makes a robot good. A good or ethical robot must be carefully designed. Andra Keay outlines five principles of good robot design and discusses the implications of implicit bias in our robots.
Dustin Cote and Ryan Pridgeon share their experience troubleshooting Apache Kafka in production environments and discuss how to avoid pitfalls like message loss or performance degradation in your environment.
Cesar Berho and Alan Ross offer an overview of open source project Apache Spot (incubating), which delivers next-generation cybersecurity analytics architecture through unsupervised learning using machine-learning techniques at cloud scale for anomaly detection.
Security will always be very important in the world of big data, but the choices today mostly start with Kerberos. Does that mean setting up security is always going to be painful? What if your company standardizes on other security alternatives? What if you want to have the freedom to decide what security type to support? Yuliya Feldman and Bill ODonnell discuss your options.
Chao Zhong offers an overview of a new predictive model for customer lifetime value (LTV) in a cloud-computing business. This model is also the first known application of the Fader RFM approach to a cloud business—a Bayesian approach that predicts a customer's LTV with a symmetric absolute percentage error (SAPE) of only 3% on an out-of-time testing dataset.
Teradata joined the Presto community in 2015 and is now a leading contributor to this open source SQL engine, originally created by Facebook. Join Kamil Bajda-Pawlikowski to learn about Presto, Teradata's recent enhancements in query performance, security integrations, and ANSI SQL coverage, and its roadmap for 2017 and beyond.
James Bradbury offers an overview of PyTorch, a brand-new deep learning framework from developers at Facebook AI Research that's intended to be faster, easier, and more flexible than alternatives like TensorFlow. James makes the case for PyTorch, focusing on the library's advantages for natural language processing and reinforcement learning.
Keynote with Michael I. Jordan
Sridhar Alla and Shekhar Agrawal explain how Comcast built the largest Kudu cluster in the world (scaling to PBs of storage) and explore the new kinds of analytics being performed there, including real-time processing of 1 trillion events and joining multiple reference datasets on demand.
Jagane Sundar shares a strongly consistent replication service for replicating between cloud object stores, HDFS, NFS, and other S3- and Hadoop-compatible filesystems.
Food production and preparation have always been labor and capital intensive, but with the internet of things, low-cost sensors, cloud-computing ubiquity, and big data analysis, farmers and chefs are being replaced with connected, big data robots—not just in the field but also in your kitchen. Tim Gasper explores the tech stack, data science techniques, and use cases driving this revolution.
Many hospitals combine early warning systems with rapid response teams (RRT) to detect patient decline and respond with elevated care. Predictive models can minimize RRT events by identifying at-risk patients, but modeling is difficult because events are rare and features are varied. Emily Spahn explores the creation of one such patient-risk model and shares lessons learned along the way.
David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.
Data analysts, data scientists, and data engineers are already working on teams delivering insight and analysis, but how do you get the team to support experimentation and insight delivery without ending up in an IT versus data engineer versus data scientist war? Christopher Bergh and Gil Benghiat present the seven shocking steps to get these groups of people working together.
Data warehouses are critical in driving business decisions—with SQL dominantly used to build ETL pipelines. While the technology has shifted from using RDBMS-centric data warehouses to data pipelines based on Hadoop and MPP databases, engineering and quality processes have not kept pace. Avinash Padmanabhan highlights the changes that Intuit's team made to improve processes and data quality.
Sparklyr makes it easy and practical to analyze big data with R—you can filter and aggregate Spark DataFrames to bring data into R for analysis and visualization and use R to orchestrate distributed machine learning in Spark using Spark ML and H2O SparkingWater. Edgar Ruiz walks you through these features and demonstrates how to use sparklyr to create R functions that access the full Spark API.
Gwen Shapira and Bob Lehmann share their experience and patterns building a cross-data-center streaming data platform for Monsanto. Learn how to facilitate your move to the cloud while "keeping the lights on" for legacy applications. In addition to integrating private and cloud data centers, you'll discover how to establish a solid foundation for a transition from batch to stream processing.
In 2016, digital advertising overtook TV in spend, requiring companies to cut through the noise to reach their audience. Manny Puentes explains how Rebel AI decides which ads to serve across devices and how it delivers multidimension reporting in milliseconds.
Ted Dunning offers an overview of tensor computing—covering, in practical terms, the high-level principles behind tensor computing systems—and explains how it can be put to good use in a variety of settings beyond training deep neural networks (the most common use case).
Tony Xing offers an overview of Microsoft's common anomaly detection platform, an API service built internally to provide product teams the flexibility to plug in any anomaly detection algorithms to fit their own signal types.
When there is a strong signal in a large dataset, many machine-learning algorithms will find it. On the other hand, when the effect is weak and the data is large, there are many ways to discover an effect that is in fact nothing more than noise. Robert Grossman shares best practices so that you will not be accused of p-hacking.
Recently, the volume of data collected from farmers' fields via sensors, rovers, drones, in-cabin technologies, and other sources has forced Monsanto to rethink its geospatial processing capabilities. Naghman Waheed and Martin Mendez-Costabel explain how Monsanto built a scalable geospatial platform using cloud and open source technologies.
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, such as RDMA, SSDs, and nonvolatile memory.
Data is powering a machine-learning renaissance. Understanding our data helps save lives, secure our personal and business information, and engage our customers with better relevance. However, as Mike Olson explains, without big data and a platform to manage big data, machine learning and artificial intelligence just don’t work.
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.
The goal of RCSA's Scialog conferences is to foster collaboration between scientists with different specialties and approaches, and, working with Datascope, the company has been doing so in a quantitative way for the last six years. Brian Lange discusses how Datasope and RCSA arrived at the problem, the design choices made in the survey and optimization, and how the results were visualized.
How do we know that an advertisement or promotion truly drives incremental revenue? Michelangelo D'Agostino and Bill Lattner share their experience developing machine-learning techniques for predicting treatment responsiveness from randomized controlled experiments and explore the use of these “persuasion” models at scale in politics, social good, and marketing.
Transport for London (TfL) and its partners have been working together on broader integration projects focused on getting the most efficient use out of road networks and public transport. Roland Major explains how TfL brings together a wide range of data from multiple disconnected systems for operational purposes while also making more of them open and available, all in real time.
Marcel Kornacker and Mostafa Mokhtar help simplify the process of making good SQL-on-Hadoop decisions and cover top performance optimizations for Apache Impala (incubating), from schema design and memory optimization to query tuning.
The internet of things is turning the internet upside down, and the effects are causing all kinds of problems. We have to answer questions about how to have data where we want it and computation where we need it—and we have to coordinate and control all of this while maintaining visibility and security. Ted Dunning shares solutions for this problem from across multiple industries and businesses.
Peng Du and Randy Wei offer an overview of Uber’s data science workbench, which provides a central platform for data scientists to perform interactive data analysis through notebooks, share and collaborate on scripts, and publish results to dashboards and is seamlessly integrated with other Uber services, providing convenient features such as task scheduling, model publishing, and job monitoring.
Unbounded, out-of-order, global-scale data is now the norm. Even for the same computation, each use case entails its own balance between completeness, latency, and cost. Kenneth Knowles shows how Apache Beam gives you control over this balance in a unified programming model that is portable to any Beam runner, including Apache Spark, Apache Flink, and Google Cloud Dataflow.
Wee Hyong Tok and Danielle Dean explain how the global, trusted, and hybrid Microsoft platform can enable you to do intelligence at scale, describing real-life applications where big data, the cloud, and AI are making a difference and how this is accelerating the digital transformation for these organizations at a lighting pace.
Join in to learn how to do scalable, end-to-end data science in R on single machines as well as on Spark clusters. You'll be assigned an individual Spark cluster with all contents preloaded and software installed and use it to gain experience building, operationalizing, and consuming machine-learning models using distributed functions in R.
Justin Murray outlines the benefits of virtualizing Hadoop and Spark, covering the main architectural approaches at a technical level and demonstrating how the core Hadoop architecture maps into virtual machines and how those relate to physical servers. You'll gain a set of design approaches and best practices to make your application infrastructure fit well with the virtualization layer.
Seemingly harmless choices in visualization design and content selection can distort your data and lead to false conclusions. Aneesh Karve presents a quantitative framework for identifying and overcoming distortions by applying recent research in algebraic visualization.
Life doesn’t happen in batches. Being able to work with data from continuous events as data streams is a better fit to the way life happens, but doing so presents some challenges. Ellen Friedman examines the advantages and issues involved in working with streaming data, takes a look at emerging technologies for streaming, and describes best practices for this style of work.
Zillow pioneered providing access to unprecedented information about the housing market. Long gone are the days when you needed an agent to get comparables and prior sale and listing data. And with more data, data science has enabled more use cases. Jasjeet Thind explains how Zillow uses Spark and machine learning to transform real estate.