14–17 Oct 2019

Audience projection of target consumers over multiple domains: A NER and Bayesian approach

16:0016:40 Wednesday, 16 October 2019
Location: Buckingham Room - Palace Suite
Average rating: ****.
(4.00, 2 ratings)

Who is this presentation for?

  • Data scientists, analysts, marketers, and machine learning practitioners




Traditional market research is generally conducted by questionnaires or other forms of explicit feedback, directly asked to an ad hoc panel that in aggregate is representative of a larger group of people. The goal is to generalize their habits, perceptions, and opinions on a given subject to understand the needs and interests of the greater consumer population. Unfortunately, those traditional approaches are often invasive, nonscalable, and biased. As such, these methodologies must be viewed as incomplete and only narrowly representative. Indirect approaches based on sparse and implicit consumer feedback (e.g., social network interactions, web browsing, or online purchases) are more scalable, authentic, and more suitable for real-time consumer insights. The rise of data availability, together with algorithm advancements and AI capabilities, will lead the next generation of market research methodologies.

Although those sources of implicit consumer feedback provide relevant and detailed pictures of the population, they individually provide only a limited set of observable behaviors. Unlike custom surveys, implicit observations are incomplete and don’t provide enough evidence on negative signals: what consumers are not interested in. A segment of the population having a high volume of interaction with a given brand may have a high affinity with it, but nothing can be said about unobserved interactions with other entities. Techniques based on user-generated content (e.g., reviews or customer care complaints) could provide negative feedback but are strongly influenced by immediate emotional status and often are too personal to be generalizable. Each implicit feedback domain provides a detailed but very narrow view that may lead to incomplete and nonactionable insights.

Gianmario Spacagna proposes the novel approach of audience projection by leveraging named entity recognition (NER) techniques to match related brands and Bayesian inference to transfer knowledge from the source domain. The challenge for the entity recognition algorithm, and in particular natural language processing techniques, is to measure the degree of similarity of two brands based only on extracted entities. Entity-based similarity, as opposed to text-based similarity, captures more realistic patterns and behaviors of the population. The entity-based similarity can be adapted to map the set of source brands to all destination brands and significantly improve the accuracy of the baseline method.

The classifier probability functions are derived from a binomial distribution based on the assumption that a target always shows consistent market penetration distributions of the entities in common. That is, the percentage of consumers interested in a particular entity reached by the target is preserved in both source and destination domains. This way, we can estimate the probability of the user belonging to the target using the source distribution of market penetrations as model evidence and the source target size as prior probability.

One of the greatest challenges in market research is the ability to merge different sources of consumers’ interests into an augmented view that connects all the dots across multiple domains. The task of audience projection is the ability to define a target audience as a subset of the population in a source domain and to project this target to a set of users into a destination dataset. The problem is modeled as a binary classification where the task is predicting for each user in the destination dataset their membership probability of belonging to the projected target.

Merging multiple data sources is generally conducted by “fusing” users based on unique keys, such as personal identifiers. When dealing with anonymized datasets, the absence of those identifiers is solved by a fuzzy look-alike record linkage. That is, users of a central dataset are linked to the most similar user in all other datasets based on common similarities. Even though there are many algorithms that can optimally find the best matches between two or more sets of users, those data-centric architectures present many limitations in the case of heterogeneous datasets strongly differing in terms of size and density and when the number of sources to merge increases.

Fusion algorithms at item level are often preferred to user-level linkages. In other words, even if the two datasets represent completely different types of observations, you can more easily identify matches of common entities (e.g., interacting with a brand’s social media page could be associated with purchasing the brand’s products). Based on this principle, cross-domain adaptation algorithms based on item similarities based on textual descriptions are not suitable for representing real consumer patterns. In content-based similarity, two competitors producing similar products are, by definition, very similar, but this does not necessarily mean they share the same consumer base.

Prerequisite knowledge

  • A basic understanding of market research, consumer insights, statistics, and natural language processing techniques

What you'll learn

  • Discover a proof of concept built on a synthetically generated dataset and a real, anonymized set of panelist data projecting social affinities to product consumption
  • Hear lessons learned about data structures for dealing with incomplete and sparse sources of data, generalizing learned patterns from a few individuals to the rest of the population, evaluating accuracy by applying a self-reconstruction testing technique on the same domain, measuring and adjusting biases by comparing with external benchmarks, and ensuring fairness by correctly representing a different demographic group
  • Learn how the adoption of machine learning in the marketing industry can open new, unexplored opportunities for the whole research community
Photo of Gianmario Spacagna

Gianmario Spacagna


Gianmario Spacagna is the chief scientist and head of AI at Helixa. His team’s mission is building the next generation of behavior algorithms and models of human decision making with careful attention to their potential and effects on society. His experience covers a diverse portfolio of machine learning algorithms and data products across different industries. Previously, he worked as a data scientist in IoT automotive (Pirelli Cyber Technology), retail and business banking (Barclays Analytics Centre of Excellence), threat intelligence (Cisco Talos), predictive marketing (AgilOne), plus some occasional freelancing. He’s a coauthor of the book Python Deep Learning, contributor to the “Professional Manifesto for Data Science,” and founder of the Data Science Milan community. Gianmario holds a master’s degree in telematics (Polytechnic of Turin) and software engineering of distributed systems (KTH of Stockholm). After having spent half of his career abroad, he now lives in Milan. His favorite hobbies include home cooking, hiking, and exploring the surrounding nature on his motorcycle.

  • Intel AI
  • O'Reilly
  • Amazon Web Services
  • IBM Watson
  • Dell Technologies
  • Hewlett Packard Enterprise
  • AXA

Contact us


For conference registration information and customer service


For more information on community discounts and trade opportunities with O’Reilly conferences


For information on exhibiting or sponsoring a conference


For media/analyst press inquires