Leading companies that are getting the most out of their data are not focusing on queries and data lakes; they are actively integrating analytics into their operations. Jack Norris reviews three customer case studies in ad/media, financial services, and healthcare to show how a focus on real-time data streams can transform the development, deployment, and future agility of applications.
Many Hadoop clusters lack even basic security controls. Michael Yoder, Ben Spivey, Mark Donsky, and Mubashir Kazia walk you through securing a Hadoop cluster. You'll start with a cluster with no security and then add security features related to authentication, authorization, encryption of data at rest, encryption of data in transit, and complete data governance.
FINRA ingests over 50 billion records of stock market trading data daily into multipetabyte databases. Janaki Parameswaran and Kishore Ramachandran explain how FINRA technology integrates data feeds from disparate systems to provide analytics and visuals for regulating equities, options, and fixed-income markets.
Shirshanka Das and Yael Garten describe how LinkedIn redesigned its data analytics ecosystem in the face of a significant product rewrite, covering the infrastructure changes, such as client-side activity tracking, a unified reporting platform, and data virtualization techniques to simplify migration, that enable LinkedIn to roll out future product innovations with minimal downstream impact.
Li Li and Hao Hao elaborate the architecture of Apache Sentry + RecordService for Hadoop in the cloud, which provides unified, fine-grained authorization via role- and attribute-based access control, to encourage attendees to adopt Apache Sentry and RecordService to protect sensitive data on the multitenant cloud across the Hadoop ecosystem.
Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.
Siva Raghupathy demonstrates how to use Hadoop innovations in conjunction with Amazon Web Services (cloud) innovations.
While other industries have embraced the digital era, healthcare is still playing catch-up. Kaiser Permanente has been a leader in healthcare technology and first started using computing to improve healthcare results in the 1960s. Taposh Roy, Rajiv Synghal, and Sabrina Dahlgren offer an overview of Kaiser’s big data strategy and explain how other organizations can adopt similar strategies.
The IoT is fundamentally transforming industries and reconfiguring the technology landscape, but challenges exist for enterprises to effectively realize the value from this next wave of information and opportunity. Cheryl Wiebe explores how leading companies harness the IoT by putting IoT data in context, fostering collaboration between IT and OT and enabling a new breed of scalable analytics.
Praveen Murugesan explains how Uber leverages Hadoop and Spark as the cornerstones of its data infrastructure. Praveen details the current data architecture at Uber and outlines some of the unique challenges with data processing Uber faced as well as its approach to solving some key issues in order to continue to power Uber's real-time marketplace.
Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian focuses on five common issues observed in a cluster environment setup with Apache Spark (Core, Streaming, and SQL) to help you improve the usability and supportability of Apache Spark and avoid such issues in future deployments.
At Strata + Hadoop World 2012, Amy O'Connor and her daughter Danielle Dean shared how they learned and built data science skills at Nokia. This year, Amy and Danielle explore how the landscape in the world of data science has changed in the past four years and explain how to be successful deriving value from data today.
Data has long stopped being structured and flat, but the results of our analysis are still rendered as flat bar charts and scatter plots. We live in a 3D world, and we need to be able to enable data interaction from all perspectives. Robert Thomas offers an overview of Immersive Visualization—integrated with notebooks and powered by Spark—which helps bring insights to life.
When building your data stack, the architecture could be your biggest challenge. Yet it could also be the best predictor for success. With so many elements to consider and no proven playbook, where do you begin to assemble best practices for a scalable data architecture? Ben Sharma offers lessons learned from the field to get you started.
The need to quickly acquire, process, prepare, store, and analyze data has never been greater. The need for performance crosses the big data ecosystem too—from the edge to the server to the analytics software, speed matters. Raghunath Nambiar shares a few use cases that have had significant organizational impact where performance was key.
Narasimhan Sampath and Avinash Ramineni share how Choice Hotels International used Spark Streaming, Kafka, Spark, and Spark SQL to create an advanced analytics platform that enables business users to be self-reliant by accessing the data they need from a variety of sources to generate customer insights and property dashboards and enable data-driven decisions with minimal IT engagement.
Ben Sharma uses popular cloud-based use cases to explore how to effectively and safely leverage big data in the cloud to achieve business goals. Now is the time to get the jump on this trend before your competition gets the upper hand.
The power of artificial intelligence and advanced analytics emerges from the ability to analyze and compute large, disparate datasets from varied devices and locations, such as predictive medicine and automated cars, at lightning-fast speed. Martin Hall explains why collaboration and openness are the key elements driving innovation in AI.
Will machine learning give us better eyesight? Join Joseph Sirosh for a surprising story about how machine learning, population data, and the cloud are coming together to fundamentally reimagine eye care in one of the world’s most populous countries, India.
Whether we're talking about spam emails, merging records, or investigating clusters, there are many times when having a measure of how alike things are makes them easier to work with (e.g., with unstructured data that isn't incorporated into your data models). Melissa Santos offers a practical approach to creating a distance metric and validating with business owners that it provides value.
The road to a data-driven business is paved with hard-won lessons, painful mistakes, and clever insights. We're introducing a new Tutorial Day track packed with case studies, where you can hear from practitioners across a wide range of industries.
Jeff Carpenter describes how data modeling can be a key enabler of microservice architectures for transactional and analytics systems, including service identification, schema design, and event streaming.
Uma Raghavan explains why you're about to see companies whose business models depend on using their customers' data, like Facebook, Google, and many others, scramble to keep up with the flood of new and evolving laws on data privacy.
Data science has always been a focus at eHarmony, but recently more business units have needed data-driven models. Jonathan Morra introduces Aloha, an open source project that allows the modeling group to quickly deploy type-safe accurate models to production, and explores how eHarmony creates models with Apache Spark and how it uses them.
Keynote by DJ Patil and Lynn Overmann
During election season, we’re tasked with considering the next four years and comparing platforms across candidates. What’s good for the country is good for your data. Consider what the next four years will look like for your organization. How will you lower costs and deliver innovation? Jack Norris reviews the requirements for a winning data platform, such as speed, scale, and agility.
Swisscom, the leading mobile service provider in Switzerland, also provides data-driven intelligence through the analysis of its mobile network. Its Mobility Insights team works to help administrators understand the flow of people through their location of interest. François Garillot explores the platform, tooling, and choices that help achieve this service and some challenges the team has faced.
Geospatial analysis can provide deep insights into many datasets. Unfortunately the key tools to unlocking these insights—geospatial statistics, machine learning, and meaningful cartography—remain inaccessible to nontechnical audiences. Stuart Lynn and Andy Eschbacher explore the design challenges in making these tools accessible and integrated in an intuitive location intelligence platform.
There’s been much discussion on open source versus commercial; CIOs and CTOs are increasingly interested in solutions that blend the benefits of both worlds. Ron Bodkin explains how Teradata drives open source adoption inside enterprises through a range of initiatives: direct contributions to open source projects, building orchestration software, and providing technical expertise.
Adam Bordelon and Mohit Soni demonstrate how projects like Apache Myriad (incubating) can install Hadoop on Mesosphere DC/OS alongside other data center-scale applications, enabling efficient resource sharing and isolation across a variety of distributed applications while sharing the same cluster resources and hence breaking silos.
In the realm of predictive maintenance, the event of interest is an equipment failure. In real scenarios, this is usually a rare event. Unless the data collection has been taking place over a long period of time, the data will have very few of these events or, in the worst case, none at all. Danielle Dean and Shaheen Gauher discuss the various ways of building and evaluating models for such data.
Modern cars produce data. Lots of data. And Formula 1 cars produce more than their fair share. Ted Dunning presents a demo of how data streaming can be applied to the analytics problems posed by modern motorsports. Although he won't be bringing Formula 1 cars to the talk, Ted demonstrates a physics-based simulator to analyze realistic data from simulated cars.
Picking the best data format depends on what kind of data you have and how you plan to use it. Owen O'Malley outlines the performance differences between formats in different use cases and offers an overview of the advantages and disadvantages of each to help you improve the performance of your applications.
Data should be something you can see, feel, hear, taste, and touch. Drawing on real-world examples, Cameron Turner, Brad Sarsfield, Hanna Kang-Brown, and Evan Macmillan cover the emerging field of sensory data visualization, including data sonification, and explain where it's headed in the future.
Gary Marcus explores the gap between what machines do well and what people do well and what needs to happen before machines can match the flexibility and power of human cognition.
Chad W. Jennings demonstrates the power of BigQuery through an exciting demo and announces several new features that will make BigQuery a better home for your enterprise big data workloads.
Bas Geerdink offers an overview of the evolution that the Hadoop ecosystem has taken at ING. Since 2013, ING has invested heavily in a central data lake and data management practice. Bas shares historical lessons and best practices for enterprises that are incorporating Hadoop into their infrastructure landscape.
Jonathan Seidman, Gwen Shapira, Mark Grover, and Ted Malaska demonstrate how to architect a modern, real-time big data platform and explain how to leverage components like Kafka, Impala, Kudu, Spark Streaming, and Spark SQL with Hadoop to enable new forms of data processing and analytics such as real-time ETL, change data capture, and machine learning.
Cloudera CEO Tom Reilly and James Powell, global CTO of Nielsen, discuss the dynamics of Hadoop in the cloud, what to consider at the start of the journey, and how to implement a solution that delivers flexibility while meeting key enterprise requirements.
Kaushik Deka and Phil Jarymiszyn discuss the benefits of a Spark-based feature store, a library of reusable features that allows data scientists to solve business problems across the enterprise. Kaushik and Phil outline three challenges they faced—semantic data integration within a data lake, high-performance feature engineering, and metadata governance—and explain how they overcame them.
The need to find efficiencies in healthcare is becoming paramount as our society and the global population continue to grow and live longer. Navdeep Alam shares his experience and reviews current and emerging technologies in the marketplace that handle working with unbounded, de-identified patient datasets in the billions of rows in an efficient and scalable way.
Yaron Haviv explains how to design real-time IoT and FSI applications, leveraging Spark with advanced data frame acceleration. Yaron then presents a detailed, practical use case, diving deep into the architectural paradigm shift that makes the powerful processing of millions of events both efficient and simple to program.
When Hollywood portrays artificial intelligence, it's either a demon or a savior. But the reality is that AI is far more likely to be an extension of ourselves. Strata program chair Alistair Croll looks at the sometimes surprising ways that machine learning is insinuating itself into our everyday lives.
Many initiatives for running applications inside containers have been scoped to run on a single host. Using Docker containers for large-scale environments poses new challenges, especially for big data applications like Hadoop. Thomas Phelan shares lessons learned and some tips and tricks on how to Dockerize your big data applications in a reliable, scalable, and high-performance environment.
Many areas of applied machine learning require models optimized for rare occurrences, such as class imbalances, and users actively attempting to subvert the system (adversaries). Brendan Herger offers an overview of multiple published techniques that specifically attempt to address these issues and discusses lessons learned by the Data Innovation Lab at Capital One.
Though visualization is used in data science to understand the shape of the data, it's not widely used for statistical models, which are evaluated based on numerical summaries. Amit Kapoor explores model visualization, which aids in understanding the shape of the model, the impact of parameters and input data on the model, the fit of the model, and where it can be improved.
Data, your most precious commodity, is increasing at an alarming rate. At the same time, an emerging business imperative has made this data a component of your deepest insights, allowing you to focus on your business outcomes. Patricia Florissi explains why the recent formation of Dell EMC ensures that your analytics capabilities will be stronger than ever.
You need to think about financial data differently to solve the most pressing challenges for your organization. But how do you get to data-driven finance and risk using unstructured data? Michelle Bonat explains how unstructured data (words and text) can be used to solve critical challenges in finance and unlock opportunities.
Netflix is exploring new avenues for data processing where traditional approaches fail to scale. Ryan Blue explains how Netflix is building on Parquet to enhance its 40+ petabyte warehouse, combining Parquet's features with Presto and Spark to boost ETL and interactive queries. Information about tuning Parquet is hard to find. Ryan shares what he's learned, creating the missing guide you need.
Sridhar Alla and Kiran Muglurmath explain how real-time analytics on Comcast Xfinity set-top boxes (STBs) help drive several customer-facing and internal data-science-oriented applications and how Comcast uses Kudu to fill the gaps in batch and real-time storage and computation needs, allowing Comcast to process the high-speed data without the elaborate solutions needed till now.
Hadoop and its ecosystem have changed analytics profoundly. Paul Kent offers an overview of SAS's participation in open platforms and introduces SAS Viya, a new unified and open analytics architecture that lets you scale analytics in the cloud and code as you choose.
With Apache Kakfa 0.9, the community has introduced a number of features to make data streams secure. Jun Rao explains the motivation for making these changes, discusses the design of Kafka security, and demonstrates how to secure a Kafka cluster. Jun also covers common pitfalls in securing Kafka and talks about ongoing security work.
David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, Titan, and Elasticsearch; data science components include custom UIMA annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.
Data scientists use statistics to reach meaningful conclusions about data. Unfortunately, statistical tools are often misapplied, resulting in errors that cost both time and money. Deborah Berebichez presents examples of egregious misuses of statistics in business, technology, science, and the media and outlines the simple steps that can reduce the chance of being fooled by statistics.
Moty Fania shares Intel’s IT experience implementing an on-premises IoT platform for internal use cases. The platform was designed as a multitenant platform with built-in analytical capabilities and based on open source big data technologies and containers. Moty highlights the lessons learned from this journey with a thorough review of the platform’s architecture.
In pursuit of speed, big data is evolving toward columnar execution. The solid foundation laid by Arrow and Parquet for a shared columnar representation across the ecosystem promises a great future. Julien Le Dem and Jacques Nadeau discuss the future of columnar and the hardware trends it takes advantage of, like RDMA, SSDs, and nonvolatile memory.
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.
Since its inception, big data solutions have best been known for their ability to master the complexity of the volume, variety, and velocity of data. But as we enter the era of data democratization, there’s a new set of concerns to consider. Mike Olson discusses the new dynamics of big data and how a renewed approach focused on where, who, and why can lead to cutting-edge solutions.
How are users meant to interpret the influence of big data and personalization in their targeted experiences? What signals do we have to show us how our data is used, how it improves or constrains our experience? Sara Watson explains that in order to develop normative opinions to shape policy and practice, users need means to guide their experience—the personalization spectrum.
The Panama Papers investigation revealed the offshore holdings and connections of dozens of politicians and prominent public figures around the world and led to high-profile resignations, police raids, and official investigations. Almost 500 journalists had to sift through 2.6 terabytes of data—the biggest leak in the history of journalism. Mar Cabra explains how technology made it all possible.
Healthcare, a $3 trillion industry, is ripe for disruption through data science. However, there are many challenges in the journey to make healthcare a truly transparent, consumer-centric, data-driven industry. Sriram Vishwanath shares some myths and facts about data science's impact on healthcare.
Triggers specify when a stage of computation should emit output. With a small language of primitive conditions, triggers provide the flexibility to tailor a streaming pipeline to a variety of use cases and data sources. Kenneth Knowles delves into the details of language- and runner-independent semantics for triggers in Apache Beam and explores real-world implementations in Google Cloud Dataflow.
Spark's efficiency and speed can help reduce the TCO of existing clusters. This is because Spark's performance advantages allow it to complete processing in drastically shorter batch windows with higher performance per dollar. Raj Krishnamurthy offers a detailed walk-through of an alternating least squares-based matrix factorization workload able to improve runtimes by a factor of 2.22.
Jonathon Whitton details how PRGX is using Talend and Cloudera to load two million annual client flat files into a Hadoop cluster and perform recovery audit services in order to help clients detect, find, and fix leakage in their procurement and payment processes.
Our ability to extract meaning from unstructured text data has not kept pace with our ability to produce and store it, but recent breakthroughs in recurrent neural networks are allowing us to make exciting progress in computer understanding of language. Building on these new ideas, Michael Williams explores three ways to summarize text and presents prototype products for each approach.
Susan Woodward discusses venture outcomes—what fraction make lots of money, which just barely return capital, and which fraction fail completely. Susan uses updated figures on the fraction of entrepreneurs who succeed, including some interesting details on female founders of venture companies.
Tim Williamson and Emil Eifrem explain how organizations can use graph databases to operationalize insights from big data, drawing on the real-life example of Monsanto’s use of graph databases to conduct real-time graph analysis of the company’s data to transform the business in ways that were previously impossible.
VoltDB promises full ACID with strong serializability in a fault-tolerant, distributed SQL platform, as well as higher throughput than other systems that promise much less. But why should users believe this? John Hugg discusses VoltDB's internal testing and support processes, its work with Kyle Kingsbury on the VoltDB Jepsen testing project, and where VoltDB will continue to improve.
Watermarks are a system for measuring progress and completeness in out-of-order streaming systems and are utilized to emit correct results in a timely manner. Given the trend toward out-of-order processing in existing streaming systems, watermarks are an increasingly important tool when designing streaming pipelines. Slava Chernyak explains watermarks and explores real-world applications.
In 1853, Britain’s workshops built 90 new gunboats for the Royal Navy in just 90 days—an astonishing feat of engineering made possible by industrial standardization. Snowplow's Alexander Dean argues that data-sophisticated corporations need a new standardization of their own, in the form of schema registries like Confluent Schema Registry or Snowplow’s own Iglu.
Which suppliers are most likely to have delivery or quality issues? Does service, product placement, or price make the biggest difference in customer sentiment? Text data from sources like email and social media can give answers. Mark Turner explains how to see the associations between any two variables in text data by combining text analytics and the bipartite graph visualization technique.
You may have successfully made the transition from single machines and one-off solutions to large, distributed stream infrastructures in your data center. But what if one data center is not enough? Ewen Cheslack-Postava explores resilient multi-data-center architecture with Apache Kafka, sharing best practices for data replication and mirroring as well as disaster scenarios and failure handling.
Daniel Mintz dives into case studies from three companies—ThredUp, Twilio, and Warby Parker—that use data to generate sustainable competitive advantages in their industries.
The self-service YP Analytics application allows advertisers to understand their digital presence and ROI. Richard Langlois explains how Yellow Pages used this expertise for an internal use case that delivers real-time analytics with Tableau, using OLAP on Hadoop and enabled by its stack, which includes HDFS, Parquet, Hive, Impala, and AtScale, for fast, real-time analytics and data exploration.