Presented By O'Reilly and Cloudera
Make Data Work
5–7 May, 2015 • London, UK

Hunting criminals with hybrid analytics, semi-supervised learning, and agent feedback

David Talby (Atigeo), Claudiu Branzan (G2 Web Services)
13:45–14:05 Thursday, 7/05/2015
Data Science
Location: King's Suite - Balmoral
Average rating: ***..
(3.71, 7 ratings)
Slides:   1-PPTX 

Prerequisite Knowledge

Familiarity with Python, with machine learning, and with the popular open source data science libraries for Python. Familiary with Spark, for the discussion on scaling & on streaming.

Description

Fraud detection is a classic adversarial analytics challenge: As soon as an automated system successfully learns to stop one scheme, fraudsters move on to attack another way. Each scheme requires looking for different signals (i.e. features) to catch; is relatively rare (one in millions for finance or e-commerce); and may take months to investigate a single case (in healthcare or tax, for example) – making quality training data scarce.

This talk will cover, via live demo and code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll be looking for fraud signals in public email datasets, using IPython and popular open-source libraries (scikit-learn, statsmodel, nltk, etc.) for data science and Apache Spark as the compute engine for scalable parallel processing.

We will iteratively build a machine-learned hybrid model – combining features from different data sources and algorithmic approaches, to catch diverse aspects of suspect behavior:

  • Natural language processing: finding keywords in relevant context within unstructured text
  • Statistical NLP: sentiment analysis via supervised machine learning
  • Time series analysis: understanding daily/weekly cycles and changes in habitual behavior
  • Graph analysis: finding actions outside the usual or expected network of people
  • Heuristic rules: finding suspect actions based on past schemes or external datasets
  • Topic modeling: highlighting use of keywords outside an expected context
  • Anomaly detection: Fully unsupervised ranking of unusual behavior

This talk assumes basic understanding of these data science tools, so we can focus on their applicability for this use case and on how they complement each other.

Apache Spark is used to run these models at scale – in batch mode for model training and with Spark Streaming for production use. We’ll discuss the data model, computation, and feedback workflows, as well as some tools and libraries built on top of the open-source components to enable faster experimentation, optimization, and productization of the models.

Photo of David Talby

David Talby

Atigeo

David Talby is Atigeo’s senior vice president of engineering, leading the R&D, product management, and operations teams. David has extensive experience in building and operating web-scale analytics and business platforms, as well as building world-class, agile, distributed teams. Previously he was with Microsoft’s Bing group where he led business operations for Bing Shopping in the US and Europe, and earlier he worked at Amazon both in Seattle and the UK, where he built and ran distributed teams which helped scale Amazon’s financial systems. David holds a PhD in Computer Science along with two masters degrees, in computer science and business administration.

Photo of Claudiu Branzan

Claudiu Branzan

G2 Web Services

Claudiu Branzan is a senior engineering lead at Atigeo, leading a team of data scientists and software engineers who tackle complex challenges in machine learning, data mining, information retrieval, and statistics. Claudiu has over 10 years of real-world data science experience across industries including finance, healthcare, legal, mobile, and retail. He has co-authored multiple patents, and holds a master’s degree in industrial intelligent systems from the Polytechnic University of Timișoara.

Comments on this page are now closed.

Comments

Puneet Lakhanpal
3/06/2015 2:50 BST

Hi Claudiu,

Thank you for introducing machine learning and nlp. Any change you can upload the ipython notebooks ?

Thanks,
Puneet

Picture of Claudiu Branzan
Claudiu Branzan
19/05/2015 14:53 BST

Please find the slides we presented here: SlideShare

We will make the notebooks available as well pretty soon… will post it here.

Picture of louis v
louis v
18/05/2015 12:04 BST

that was a great session . could you point us to the slide or the ipython notebook ?

Picture of Claudiu Branzan
Claudiu Branzan
7/05/2015 16:09 BST

I hope you guys enjoyed the session, if so, don’t be shy and rate it well … this encourages as to do more of these and hopefully keep you entertained in future sessions as well. If you feel we should’ve done something different let us know…we always incorporate feedback in our models ;)