Data is all sales and marketing. The reality of data work is pain. Most data projects fail and are horrible experiences to work on. Phil Harvey explains that data is just too hard—the world needs to talk about real challenges so that we can start tackling them to deliver data projects that work. This is DataOps; there will be tears before bedtime.
Federated analytics, a new approach to analyzing big data, supports unprecedented collaboration across large distributed datasets that contain proprietary and/or protected information. Gilad Olswang explains how Intel harnesses the power of federated analytics in the Collaborative Cancer Cloud project.
As Intuit evolved QuickBooks, Payroll, Payments, and other product offerings into a SaaS business and an open cloud platform, it quickly became apparent that business analytics could no longer be treated as an afterthought but had to be part of the platform architecture as a first-class concern. Calum Murray outlines key design considerations when architecting analytics into your SaaS platform.
Apache Eagle is an open source monitoring solution to instantly identify access to sensitive data, recognize malicious activities, and take action. Arun Karthick Manoharan, Edward Zhang, and Chaitali Gupta explain how Eagle helps secure a Hadoop cluster using policy-based and machine-learning user-profile-based detection and alerting.
The cybersecurity landscape is quickly changing, and Apache Hadoop is becoming the analytics and data management platform of choice for cybersecurity practitioners. Tom Reilly explains why organizations are turning toward the open source ecosystem to break down traditional cybersecurity analytics and data constraints in order to detect a new breed of sophisticated attacks.
With the rise of deep learning, natural language understanding techniques are becoming more effective and are not as reliant on costly annotated data. This leads to an explosion of possibilities of what businesses can do with language. Alyona Medelyan explains what the newest NLU tools can achieve today and presents their common use cases.
Raghunath Nambiar reviews the big data landscape, reflects on big data lessons learned in enterprise over the last few years, and explores how these organizations avoid their big data environments becoming unmanageable by using simplex management for deployment, administration, monitoring, and reporting no matter how much the environment scales.
Turning big data into tangible business value can be a struggle even for highly skilled data scientists. Zoltan Prekopcsak outlines the best practices that make life easier, simplify the process, and implement results faster.
There are (too?) many options for BI on Hadoop. Some are great at exploration, some are great at OLAP, some are fast, and some are flexible. Understanding the options and how they work with Hadoop systems is a key challenge for many organizations. Tomer Shiran provides a survey of the main options, both traditional (Tableau, Qlik, etc.) and new (Platfora, Datameer, etc.).
Google is no stranger to big data, pioneering several big data technologies grown and tested internally—including MapReduce, BigTable, and most recently Dataflow and TensorFlow, as well as one of the most heavily used tools at Google, BigQuery—and making them available to everyone. Jordan Tigani shares what big data means for Google and announces several new BigQuery features.
Spark has been growing in deployments for the past year. The increasing amount of data being analyzed and processed through the framework is massive and continues to push the boundaries of the engine. Drawing on his experiences across 150+ production deployments, Neelesh Srinivas Salian explores common issues observed in a cluster environment setup with Apache Spark.
Cat Drew (UK Policy Lab and Government Data Science Partnership)
Cat Drew explains how the UK's Policy Lab and GDS data teams are bringing more of a data, digital, and design approach to policy making, showcasing some of the Policy Lab projects that have used ethnography and data science to create fresh insight to change the way we think about policy problems.
Ruhollah Farchtchi explores best practices for building systems that support ad hoc queries over real-time data and offers an overview of Kudu, a new storage layer for Hadoop that is specifically designed for use cases that require fast analytics on rapidly changing data with a simultaneous combination of sequential and random reads and writes.
The traditional data warehouse of the 1990s was quaintly called the “single source of truth.” Joe Hellerstein explains why today we take a far more relativistic view: the meaning of data depends on the context in which it is used.
Data has more potential value when it can be shared. In order to monetize data, it must first be made shareable: shareable data is an asset that can be sold, traded, or used to create new data marketplaces. Mona Vernon outlines a framework to structure thinking about data shareability and monetization and explores these new business opportunities.
Piotr Mirowski looks under the hood of recurrent neural networks and explains how they can be applied to speech recognition, machine translation, sentence completion, and image captioning.
Many businesses have undertaken big data projects, but for every successful project, there are dozens that have failed or stagnated. Seb Darrington explores the reasons why such projects hit obstacles, typical challenges, and how to overcome them along your own big data journey.
Mark Donsky and Chang She explore canonical case studies that demonstrate how leading banks, healthcare, and pharmaceutical organizations are tackling Hadoop governance challenges head-on. You'll learn how to ensure data doesn't get lost, help users find and trust the data they need, and protect yourself against a data breach—all at Hadoop scale.
Stefanie Posavec recently completed a year-long drawing project with Giorgia Lupi called Dear Data, where each week they manually gathered and drew their data on a postcard to send to the other. Stefanie discusses the professional insights she gained spending a year on such an intensive personal data project.
Currently, multitenancy in Hadoop is limited to organizations running separate Hadoop clusters, and the secure sharing of resources is achieved using virtualization or containers. Jim Dowling describes how HopsWorks enables organizations to securely share a single Hadoop cluster using projects and a new metadata layer that enables protection domains while still allowing data sharing.
Using data science to help make better business decisions in the film industry.
The generalized low-rank model is a new machine-learning approach for reconstructing missing values and identifying important features in heterogeneous data. Through a series of examples, Jo-fai Chow demonstrates how to fit low-rank models in a parallelized framework and how to use these models to make better predictions.
Megan Price demonstrates how machine-learning methods help us determine what we know, and what we don't, about the ongoing conflict in Syria. Megan then explains why these methods can be crucial to better understand patterns of violence, enabling better policy decisions, resource allocation, and ultimately, accountability and justice.
Exactly-once semantics is a highly desirable property for streaming analytics. Ideally, all applications process events once and never twice, but making such guarantees in general either induces significant overhead or introduces other inconveniences, such as stalling. Flavio Junqueira explores what's possible and reasonable for streaming analytics to achieve when targeting exactly-once semantics.
Cloudera’s Mike Olson is joined by Manuel Martin Marquez, a senior research and data scientist at CERN, to discuss Hadoop's role in the modern data strategy and how CERN is using Hadoop and information discovery to help drive operational efficiency for the Large Hadron Collider.
The advent of next-generation DNA sequencing technologies is revolutionizing life sciences research by routinely generating extremely large datasets. Tom White explains how big data tools developed to handle large-scale Internet data (like Hadoop) help scientists effectively manage this new scale of data and also enable addressing a host of questions that were previously out of reach.
The metaphors used online have always borrowed heavily from the offline world, but as our online and offline worlds converge, the biggest opportunities for innovative experiences will come from blending them intentionally. Kate O’Neill examines how the meaning and understanding of place relates to identity, culture, and intent and how we can shape our audiences' experiences more meaningfully.
The famous Oracle at Delphi had a secret: Its prophecies were interpreted by Temple Guides, using a very early version of ethnographic research. With today’s near-blind faith in the predictive power of Big Data, it’s time to take a lesson from the Ancient Greeks.
With the analytic and predictive power of big data comes the responsibility to respect and protect individual privacy. As citizens, we should hold organizations to account; as data practitioners, we must find intelligent ways to analyze data without violating privacy. Jason McFall discusses privacy risks and surveys leading privacy-preserving analysis techniques.
The IoT combined with big data analytics enables organizations to track new patterns and signs and bring data together that previously was not only a challenge to integrate but also way too expensive. Frank Saeuberlich and Eliano Marques explain why data management, data integration, and multigenre analytics are foundational to driving business value from IoT initiatives.
Piotr Niedźwiedź explores how deepsense.io created the world’s best deep learning model for identifying individual right whales using aerial photography for the NOAA (National Oceanic and Atmospheric Administration) and explains what happened when the solution was covered by news media around the globe.
Which venues have similar visiting patterns? How can we detect when a user is on vacation? Can we predict which venues will be favorited by users by examining their friends' preferences? Natalino Busa explains how these predictive analytics tasks can be accomplished by using Spark SQL, Spark ML, and just a few lines of Scala code.
Value at risk (VaR) is a widely used risk measure. VaR is not simply additive, which provides unique challenges to report VaR at any aggregate level, as traditional database aggregation functions don't work. Deenar Toraskar explains how the Hive complex data types and user-defined functions can be used very effectively to provide simple, fast, and flexible VaR aggregation.
Moty Fania shares Intel’s IT experience implementing an on-premises IoT platform for internal use cases. The platform was based on open source big data technologies and containers and was designed as a multitenant platform with built-in analytical capabilities. Moty highlights the key lessons learned from this journey and offers a thorough review of the platform’s architecture.
TensorFlow is an open source software library for numerical computation with a focus on machine learning. Its flexible architecture makes it great for research and production deployment. Sherry Moore offers a high-level introduction to TensorFlow and explains how to use it to train machine-learning models to make your next application smarter.
Data is no longer a by-product of business transactions; now, data is the business. Franz Aman explains how data lakes can put the power of big data into the hands of every business person, sharing the inside scoop on how he turned marketing into a new kind of revenue-generation machine and interviewing an Informatica customer about how data lakes have innovated and transformed their business.
David Selby shares some of the challenges he has faced coercing meaning from data and explains why he is particularly enthusiastic about the latest technological developments in the data science field.
The news media in recent months has been full of dire warnings about the risk that AI poses to the human race. Should we be concerned? If so, what can we do about it? While some in the mainstream AI community dismiss these concerns, Stuart Russell argues that a fundamental reorientation of the field is required.
Martin Willcox shares the lessons he's learned from successful Teradata IoT projects about about how to manage and leverage sensor data and explains why data management, data integration, and multigenre analytics are foundational to driving business value from IoT initiatives.
Drawing on important real-world use cases, Kenneth Knowles delves into the details of the language- and runner-independent semantics developed for triggers in Apache Beam, demonstrating how the semantics support the use cases as well as all of the above variability in streaming systems. Kenneth then describes some of the particular implementations of those semantics in Google Cloud Dataflow.
Watermarks are a system for measuring progress and completeness in out-of-order stream processing systems and are used to emit correct results in a timely way. Given the trend toward out-of-order processing in current streaming systems, understanding watermarks is an increasingly important skill. Slava Chernyak explains watermarks and demonstrates how to apply them using real-world cases.
In the last few years, auto makers and others have introduced devices to connect cars to the Internet and gather data about the vehicles’ activity, and auto insurers and local governments are just starting to require these devices. Charles Givre gives an overview of the security risks as well as the potential privacy invasions associated with this unique type of data collection.
Can our real-time distributed data systems help predict whether high-resolution audio is the future of digital music? What about content curation? Paul Shannon and Alan Hannaway explore the future of music services through data and explain why 7digital believes well-curated, high-resolution listening experiences are the future of digital music services.
Hadoop is used to run large-scale jobs over hundreds of machines. Considering the complexity of Hadoop jobs, it's no wonder that Hadoop jobs running slower than expected remains a perennial source of grief for developers. Bikas Saha draws on his experience debugging and analyzing Hadoop jobs to describe the approaches and tools that can solve this difficult problem.