Thesis of Yvan Lucas


Subject:
Credit Card Fraud Detection using Machine Learning with Integration of Contextual Knowledge

Summary:

In the last years, credit and debit cards usage has significantly increased. However a non negligible part of the credit card transactions are fraudulent and billions of euros are stolen every year throughout the world. In Belgium alone, the volume of credit card transactions reached 16 billion euros in 2017 with 140 million euros of illegitimate transactions.

Credit card fraud detection present several characteristics that makes it a challenging task. First, the feature set describing a credit card transaction usually ignores detailed sequential information which was proven to be very relevant for the detection of credit card fraudulent transactions. Second, purchase behaviours and fraudster strategies may change over time, making a learnt fraud detection decision function irrelevant if not updated. This phenomenon named dataset shift (change in the distribution ​p(x,y)) ​ may hinder fraud detection systems to obtain good performances. We conducted an exploratory analysis in order to quantify the day by day dataset shift and identified calendar related time periods that show different properties. Third, credit card transactions data suffer from a strong imbalance regarding the class labels which needs to be considered either from the classifier perspective or from the data perspective (less than 1\% of the transactions are fraudulent transactions).

Solutions for integrating sequential information in the feature set exist in the literature. The predominant one consists in creating a set of features which are descriptive statistics obtained by aggregating the sequences of transactions of the card-holders (sum of amount, count of transactions, etc..). We used this method as a benchmark feature engineering method for credit card fraud detection. However, this feature engineering strategies raised several research questions. First of all, we assumed that these descriptive statistics cannot fully describe the sequential properties of fraud and genuine patterns and that modelling the sequences of transactions could be beneficial for fraud detection. Moreover the creation of these aggregated features is guided by expert knowledge whereas sequences modelling could be automated thanks to the class labels available for past transactions. Finally, the aggregated features are point estimates that may be complemented by a multi-perspective univariate description of the transaction context (especially from the point of view of the seller).

We proposed a multi-perspective HMM-based automated feature engineering strategy in order to incorporate a broad spectrum of sequential information in the transactions feature sets. In fact, we model the genuine and fraudulent behaviours of the merchants and the card-holders according to two univariate features: the timing and the amount of the transactions. Moreover, the HMM-based features are created in a supervised way and therefore lower the need of expert knowledge for the creation of the fraud detection system. In the end, our multiple perspectives HMM-based approach offers automated feature engineering to model temporal correlations so as to complement and possibly supplement the use of transaction aggregation strategies in order to improve the effectiveness of the classification task.

Experiments conducted on a large real world credit card transaction dataset (46 million transactions from belgium card-holders between March and May 2015) have shown that the proposed HMM-based feature engineering allows for an increase in the detection of fraudulent transactions when combined with the state of the art expert based feature engineering strategy for credit card fraud detection.

To conclude, this work leads to a better understanding of what can be considered contextual knowledge for a credit card fraud detection task and how to include it in the classification task in order to get an increase in fraud detection. The method proposed can be extended to any supervised task with sequential datasets.


Advisor: Sylvie Calabretto
Coadvisor: Léa Laporte, Pierre-Edouard Portier

Defense date: wednesday, december 4, 2019

Jury:
Prof. Soulé-Dupuy ChantalProfesseur(e)University of ToulouseRapporteur(e)
Prof. Gaussier EricProfesseur(e)Grenoble Alps UniversityRapporteur(e)
Prof. Lux MathiasProfesseur(e)Alpen-Adria UniversitätExaminateur​(trice)
Dr. Gianini GabrieleMaître de conférenceUniversity of MilanExaminateur​(trice)
Prof. Calabretto SylvieProfesseur(e)INSA LyonCo-directeur (trice)
Prof. Granitzer MichaelProfesseur(e)Universität PassauCo-directeur (trice)
Dr. Portier Pierre-EdouardMaître de conférenceINSA LyonCo-encadrant(e)
Dr. Laporte LéaMaître de conférenceINSA LyonCo-encadrant(e)