Fraud detection is a classic adversarial analytics challenge: As soon as an automated system successfully learns to stop one scheme, fraudsters move on to attack another way. Each scheme requires looking for different signals (i.e. features) to catch; is relatively rare (one in millions for finance or e-commerce); and may take months to investigate a single case (in healthcare or tax, for example) – making quality training data scarce.
This talk will cover, via live demo and code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll be looking for fraud signals in public email datasets, using IPython and popular open-source libraries (scikit-learn, statsmodel, nltk, etc.) for data science and Apache Spark as the compute engine for scalable parallel processing.
We will iteratively build a machine-learned hybrid model – combining features from different data sources and algorithmic approaches, to catch diverse aspects of suspect behavior:
This talk assumes basic understanding of these data science tools, so we can focus on their applicability for this use case and on how they complement each other.
Apache Spark is used to run these models at scale – in batch mode for model training and with Spark Streaming for production use. We’ll discuss the data model, computation, and feedback workflows, as well as some tools and libraries built on top of the open-source components to enable faster experimentation, optimization, and productization of the models.
David Talby is Atigeo’s senior vice president of engineering, leading the R&D, product management, and operations teams. David has extensive experience in building and operating web-scale analytics and business platforms, as well as building world-class, agile, distributed teams. Previously he was with Microsoft’s Bing group where he led business operations for Bing Shopping in the US and Europe, and earlier he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams which helped scale Amazon’s financial systems. David holds a PhD in Computer Science along with two masters degrees, in computer science and business administration.
Claudiu Branzan is a senior engineering lead at Atigeo, leading a team of data scientists and software engineers who tackle complex challenges in machine learning, data mining, information retrieval, and statistics. Claudiu has over 10 years of real-world data science experience across industries including finance, healthcare, legal, mobile, and retail. He has co-authored multiple patents, and holds a master’s degree in industrial intelligent systems from the Polytechnic University of Timișoara.
Comments on this page are now closed.
©2015, O’Reilly UK Ltd • (800) 889-8969 or (707) 827-7019 • Monday-Friday 7:30am-5pm PT • All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners. • email@example.com
Apache Hadoop, Hadoop, Apache Spark, Spark, and Apache are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries, and are used with permission. The Apache Software Foundation has no affiliation with and does not endorse, or review the materials provided at this event, which is managed by O'Reilly Media and/or Cloudera.