Presented By
O’Reilly + Cloudera
Make Data Work
March 25-28, 2019
San Francisco, CA

Infinite segmentation: Scalable mutual information ranking on real-world graphs

Ken Johnston (Microsoft), Ankit Srivastava (Microsoft)
11:50am12:30pm Thursday, March 28, 2019
Average rating: ****.
(4.50, 2 ratings)

Who is this presentation for?

  • Data scientists, business intelligence analysts, and those in marketing and sales



Prerequisite knowledge

  • A basic understanding of probability and random variables (useful but not required)

What you'll learn

  • Explore a scalable implementation of mutual information ranking


Random variables can share information with each other. The amount of information shared between them can be measured via mutual information computed in shannons or bits. Typically, mutual information is applied on binary classes to find if a given feature (app, URL, IP address) is indicative of class A versus class B (e.g., commercial versus consumer devices or teacher versuss student devices). However, there;’s no easy way to scale this binary implementation to multiple classes sometimes ranging in millions (businesses).

Ken Johnston and Ankit Srivastava explain how to use mutual information at scale across massive data sources to help filter out noise and share critical insights with new cohort of users, businesses, and networks. They explain how they built a distributed implementation that can calculate mutual information entropy gain per feature, per partition. They can now scale to millions of partitions and billions of features (apps used across devices, URLs accessed within browsers) within minutes. A technique every data scientist will find handy, the implementation has helped in numerous business problems by extracting signals from large-scale graphs and has applications in projects across a company: teacher versus student versus admin detection, identifying developer devices, detecting distinguishing apps used by IT Pros versus non-IT Pros, detecting distinguishing apps per business (used heavily within an organization compared to other businesses), detecting distinguishing URLs per business and country, ranking businesses that have distinguishing use of Microsoft apps, linking IP addresses to businesses, building an automated commercial newsfeed ranking change in business metrics month over month, identifying hoppy networks where devices come and go (guest WiFi in coffee shops), and using a seed search query to expand the feature list indicative of a segment.

The implementation is easy to port to Azure, and because it has a very generic schema, it’s agnostic to the data.

Photo of Ken Johnston

Ken Johnston


Ken Johnston is the principal data science manager for the Microsoft 360 Business Intelligence Group (M360 BIG). In his time at Microsoft, Ken has shipped many products, including Commerce Server, Office 365, Bing Local and Segments, and Windows, and for two and a half years, he was the director of test excellence. A frequent keynote presenter, trainer, blogger, and author, Ken is a coauthor of How We Test Software at Microsoft and contributing author to Experiences of Test Automation: Case Studies of Software Test Automation. He holds an MBA from the University of Washington. Check out his blog posts on data science management on LinkedIn.

Photo of Ankit Srivastava

Ankit Srivastava


Ankit Srivastava is a senior data scientist on the core data science team for the Azure Cloud + AI Platform Division at Microsoft, where he focuses on commercial and education segment data science projects within the company. Previously, he was a developer on the data integration and insights team. He has built several production-scale ML enrichments that are leveraged for sales compensation and senior leadership team metrics.