Mar 15–18, 2020

Efficient feature engineering from digital identifiers for online fraud detection

Haopei Wang (DataVisor)
1:45pm2:25pm Tuesday, March 17, 2020
Location: LL20C
Secondary topics:  Security and Privacy

Who is this presentation for?

Data scientists or analysts




High-quality feature engineering is the key to the success of many machine learning systems. This is especially the case for online fraud detection, which relies on human experts with deep knowledge of fraudulent activities and business logic. Haopei Wang details the design and implementation of a system that automatically extracts fraud-related features from digital identifiers commonly collected by online services, including IP subnets, user-agent strings, email domains, and device types. The success of DataVisor’s features is based on the fact that fraud attacks are supported by a fraud-as-a-service underground economy, which reuses (i.e., resells or rents out) the same attack infrastructure for different types of malicious activities.

DataVisor’s feature extraction algorithm analyzes how normal and fraudulent users interact with common digital identifiers across global online services spanning multiple industry verticals. The features are created from several templates that capture different types of interactions with digital identifiers such as frequency, volume, timing, similarity with other identifiers, etc. By focusing on these digital identifiers commonly collected by online services, spanning financial, social, ecommerce, or mobile gaming industries, the features are agnostic to specific application semantics and can be applied to previously unseen datasets.

You’ll learn about DataVisor’s approach to addressing key challenges in this feature extraction system. To efficiently calculate features for billions of daily events in real-time, the company not only leverages a NoSQL database for data storage but also customized an open source version of Cassandra to distribute feature computation onto the nodes. And to create “generic” but useful features, it leverages results from an unsupervised fraud detection algorithm. DataVisor’s unique algorithm combines clustering and graph analysis techniques to discover fraudulent accounts from unlabeled data. The features combine fraud knowledge with statistical user behavior information.

Haopei demonstrates the effectiveness of features generated from one dataset applied to another independent dataset when used to train a supervised machine learning model. As examples, he presents the results in two scenarios: fraud detection and identifying “good” users for market segmentation.

Prerequisite knowledge

  • General knowledge of machine learning technologies
  • Experience working with large-scale datasets and systems

What you'll learn

  • Understand the motivation and high-level design of DataVisor's global intelligence network (GIN)
  • Discover the experiences and lessons learned from the implementation and deployment of GIN
  • Learn how GIN can improve the detection of supervised machine learning systems
Photo of Haopei Wang

Haopei Wang


Haopei Wang is a research scientist at DataVisor. Previously, he earned his PhD from the Department of Computer Science and Engineering at Texas A&M University. His research includes big data security and system security.

Leave a Comment or Question

Help us make this conference the best it can be for you. Have questions you'd like this speaker to address? Suggestions for issues that deserve extra attention? Feedback that you'd like to share with the speaker and other attendees?

Join the conversation here (requires login)

Contact us

For conference registration information and customer service

For more information on community discounts and trade opportunities with O’Reilly conferences

Become a sponsor

For information on exhibiting or sponsoring a conference

For media/analyst press inquires