Cost-based statistical methods for fraud detection. Prediction of never-paying customers considering individual risk

2D T-SNE representation of the dataset

Abstract

Telecommunication providers not only offer services but increasingly finance consumer devices. Credit scoring and the detection of fraud for new account applications gained importance as standard credit approval processes showed to fall short for new customers as there is only scarce information available in internal systems. Modern machine learning algorithms, however, can still infer intricate patterns from the data and thus can efficiently classify customers. Cost-sensitive methodologies can even enhance the savings. In this thesis, we develop a cost matrix which allows evaluating the individual risk of accepting a new customer and therefore helps to prevent new account subscription fraud optimally.

Publication
Cost-based statistical methods for fraud detection, MSc Thesis

Executive summary

Many devices are lost as the standard credit check process is focusing on detecting defaults but falls short at detecting fraudulent or customers who never pay a single bill as only scarce information is present. Machine learning can offer great possibilities to smarten business processes. Introducing the notion of cost and savings to the machine learning model can help to evaluate better the individual risk of accepting a single customer. We found that:

  • machine learned fraud predictors can offer huge savings compared to the classical credit scoring process
  • the strategy should be set out clearly in the beginning what the machine learning algorithm has to achieve and optimize
  • before starting a data science project a data engineering project is required to build an appropriate data pipeline which offers timely & quality controlled access to the data
  • make sure to dedicate IT resources to integrate the data science findings into the business processes
Georg Heiler
Georg Heiler
Researcher & data scientist

My research interests include large geo-spatial time and network data analytics.