In many prediction problems, the most informative data attributes are identities of objects, as they can be directly associated with the historical statistics collected for each object. For example, in click probability prediction for online advertising, key objects are the anonymous user, the ad, and the context (e.g., a query or a webpage). Their identities may be captured by such attributes as the user’s anonymous ID, a machine’s IP, an identifier for the ad or advertiser, and the text or hash for the query or webpage URL. While attribute values can be encoded as one-of-many (one-hot) binary features, this results in very high-dimensional representation. When training with terabytes of data, this practically limits one to use linear models. Although using attribute combinations can mitigate the paucity of linear representation, the resulting high-dimensional parameters remain difficult to interpret or monitor.
In this talk, we will describe a general technique that provides a state-of-the-art alternative for representing high-cardinality data in a flexible, intuitive representation. Learning with counts can be summarized as aggregating sufficient statistics for conditional probability distributions for various attributes and combinations – or, concretely, utilizing tables of per-class counts for each unique value or combination. Such methods as back-off and min-count sketches provide effective tools for compressing the tables, and their aggregation is a basic map-reduce operation allowing the use of commodity clusters. At training and prediction time, counts along with corresponding log-odds provide highly informative low-dimensional features. When used with such non-linear learners as boosted trees or deep neural networks, state-of-the-art accuracy is achieved for large-scale problems in online advertising and fraud detection, which explains the technique’s popularity in many of the industry’s revenue-critical pipelines.
Despite its success with practitioners, the technique has not been formally analyzed or even described in machine learning literature. We will provide its formal generalization, and describe how it is manifested in several examples from real-life industrial machine learning systems. The formal view opens the door to developing a number of extensions of the technique, and the talk will describe two such methods. One utilizes differential privacy analysis to prevent label leakage, allowing training downstream classifiers on same data as was used for counts. Another relies on a variant of hashing to provide an effective extension to time-series domains, enabling easy computation of per-time-interval counts.
Misha Bilenko is the principal researcher leading the Machine Learning Algorithms team in the Cloud+Enterprise division of Microsoft. Before that, he worked for seven years in the Machine Learning Group at Microsoft Research, where he collaborated with a number of product groups on applied ML algorithms, systems, and tools. Misha joined Microsoft in 2006 after receiving his Ph.D. in computer science from the University of Texas at Austin. He co-edited Scaling Up Machine Learning, published by Cambridge University Press, and his work has received best paper awards from KDD and SIGIR. His research interests include parallel and distributed learning algorithms, accuracy debugging methods, and learnable similarity functions.
Comments on this page are now closed.
©2015, O'Reilly Media, Inc. • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.