Amir Hajian, Khaled Ammar, and Alex Constandache offer an approach to mining a large dataset to predict the electability of hypothetical candidates in the US presidential election race, using machine learning, natural language processing, and deep learning on an infrastructure that includes Spark and Elasticsearch, which serves as the backbone of the mobile game White House Run.
Leading companies that are getting the most out of their data are not focusing on queries and data lakes; they are actively integrating analytics into their operations. Jack Norris reviews three customer case studies in ad/media, financial services, and healthcare to show how a focus on real-time data streams can transform the development, deployment, and future agility of applications.
Himanshu Gupta explains why Yahoo has been increasingly investing in interactive analytics and how it leverages Druid to power a variety of internal- and external-facing data applications.
BBC Worldwide has a vast catalogue of content. David Boyle explains how data helps the BBC determine which countries a new show is best suited for—and which short-form content will be most engaging in promoting those shows—as he shares successes, failures, and frustrations from the BBC's latest work using predictive analytics, building a content genome, quant research, and social media monitoring.
Predicting which stories will become popular is an invaluable tool for newsrooms. Eui-Hong Han and Shuguang Wang explain how the Washington Post predicts what stories on its site will be popular with readers and share the challenges they faced in developing the tool and metrics on how they refined the tool to increase accuracy.
Clustering algorithms produce vectors of information, which are almost surely difficult to interpret. These are then laboriously translated by data scientists into insights for influencing product and executive decisions. June Andrews offers an overview of a human-in-the-loop method used at Pinterest and LinkedIn that has lead to fast, accurate, and pertinent human-readable insights.
Rick McFarland explains how the Hearst Corporation utilizes big data and analytics tools like Spark and Kinesis to stream click data in real-time from its 300+ websites worldwide. This streaming process feeds an editorial tool called Buzzing@Hearst, which provides instant feedback to authors on what is trending across the Hearst network.
Sridhar Alla and Kiran Muglurmath explain how real-time analytics on Comcast Xfinity set-top boxes (STBs) help drive several customer-facing and internal data-science-oriented applications and how Comcast uses Kudu to fill the gaps in batch and real-time storage and computation needs, allowing Comcast to process the high-speed data without the elaborate solutions needed till now.
How can the value of a patent be quantified? Josh Lemaitre explores how Thomson Reuters Labs approached this problem by applying machine learning to the patent corpus in an effort to predict those most likely to be enforced via litigation. Josh covers infrastructure, methods, challenges, and opportunities for future research.
Sabre operates stringent service-level agreements with each of its customers. Madhuri Kollu explains how, in the event of an incident, Sabre consolidates legacy data with data derived from its new ServiceNow platform to get an accurate picture of the SLAs and provide business managers the information they need to understand the impact.
The Netflix data platform is constantly evolving, but fundamentally it's an all-cloud platform at a massive scale (40+ PB and over 700 billion new events per day) focused on empowering developers. Kurt Brown dives into the current technology landscape at Netflix and offers some thoughts on what the future holds.
The Panama Papers investigation revealed the offshore holdings and connections of dozens of politicians and prominent public figures around the world and led to high-profile resignations, police raids, and official investigations. Almost 500 journalists had to sift through 2.6 terabytes of data—the biggest leak in the history of journalism. Mar Cabra explains how technology made it all possible.
American politics is adrift in a sea of polls. This year, that sea is deeper than ever before—and darker. Data science is upending the public opinion industry. But to what end? In a brief, illustrated history of the field, Jill Lepore demonstrates how pollsters rose to prominence by claiming that measuring public opinion is good for democracy and asks, "But what if it’s bad?"
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Karthik Ramasamy offers an overview of the end-to-end real-time stack Twitter designed in order to meet this challenge, consisting of DistributedLog (the distributed and replicated messaging system) and Heron (the streaming system for real-time computation).
In the last year, many publishers have begun moving more content off their own websites and onto Facebook Instant Articles, Accelerated Mobile Pages (AMP), and other new platforms promising improvements in exposure and performance. Joshua Laurito explores Gawker’s experience working with these new partners, sharing the advantages gained as well as the unexpected costs and complexities incurred.