Leveraging Machine Learning in Late Repayment Prediction

11 min readMay 2, 2021

Foreword

Loan defaulter prediction is one of the best and most common problems solved by Machine Learning. In one of Nanyang Poly’s specialist diploma projects, I and my team mates were trying to solve a similar credit risk problem — late repayment. Instead of utilizing python, the most common ML tool, we used RapidMiner, a data science software platform, for data preparation, training and testing.

Background

Since the global financial crisis, there has been a constant focus on the risk management in banks. There is an increasing influence of machine learning (ML) in business applications, with many solutions already implemented and many more being explored. The use of machine learning in credit risk management has been a hot topic in the banking industry for some time, from deciding how much a bank should lend to a customer, to detecting transaction fraud, and improving compliance and reducing risk.

Problem

NYP Bank, a fictional local Singaporean bank, provides housing loans to customers in exchange for the promise of repayment with interest. That means the bank only makes profit (interest) if the customer pays off the loan. There are two main risks for NYP Bank housing loan department. The first risk is that of default. NYP bank has process in place to insure their housing loan account receivable balance. This helps to minimize the value of bad debt written off. But this does not guard against the risk of late repayment.

Business Objectives

NYP Bank housing loan department would like to expand on the idea of how machine learning can improve risk management. They decided to use the supervised machine learning (ML) algorithms to uncover a range of independent variables from NYP Bank’s housing loan data and determine the relationship with the likelihood of late repayment; and develop an ML-powered application to assess the riskiness of a new housing loan application by trying to predict if the customer/applicant will repay the loan by its due date or not. For each identified potential late repayer, NYP Bank housing loan department might consider conducting a more stringent assessment of the individual creditworthiness and adding a customized ‘late payment policy’ on the loan agreement if needed.

NYP Bank housing loan department, the project stakeholders, expects the new application is able to capture 70% of true late repayers.

Work Accomplished

The project tasks were planned based on CRISP-DM (cross-industry process for data mining). The CRISP-DM methodology is an idealized sequence of events. In practice the tasks can be performed in a different order and it will often be necessary to backtrack to previous tasks and repeat certain actions. For example, the project team had repeated data preparation tasks during modeling stage. preparation tasks during modeling stage.

Choosing Machine Learning technique

The prediction of the likelihood of late repayment was a classification type of problem. The Support Vector Machines (SVM) and Decision Trees (DT) are the most popular techniques to solve the classification tasks. The project team decided to Use Rapidminer’s LibSVM learner in the modeling because of the considerations below:

LibSVM is a popular open source machine learning library supporting classification and regression. It has been reused in many machine learning toolkits including Rapidminer and scikit-learn, and programming languages such as Java, MATLAB R and Python. It has been considering to integrate LibSVM libraries with the development of new web-based risk assessment applications.
LibSVM is capable to solve both linear and nonlinear SVMs. And a recent study shows if RBF kernel is used, there is no need to consider the linear kernel, especially when there was little understanding in the data patterns at the initial stage.
Combined with k-fold cross validation method, It is fast and efficient to optimize parameters of a LibSVM model.
LibSVM supports internal multiclass learning when multiclass labels are required in future application enhancement.

Data Understanding

The project team had collected and analyzed the available data across multiple business functions in NYP Bank. The data covered:

1,484 housing loans records and their loan payment and late fee history between January 2008 and December 2016 — 613 (41%) housing loans have incurred 1–8 late fee & 871 (59%) loans have maintained a good repayment history.
Customer demographics data cover, including basic information such as gender, age, marital status and education level. And employment status, occupation, industry and annual income.
Other financial products with NYP Bank, for examples, auto loans and their payments and defaulted history, credit cards and their transaction history, deposit accounts and their transaction history.

However, the project team was unable to identify any recurring income or loan repayment based on deposit account transaction type. All deposit related data had been excluded during data preparation.

Data Preparation

During data preparation, by referring NYP Bank’s housing loan application criteria and procedures, the project team used the Expert Knowledge method to perform the initial feature selection and the following 19 attributes were chosen or derived across NYP Bank’s housing loan, customer master, customer employment, auto loan and credit card data.

Likelihood of late repayment, used as labelled data, derived from the existence of any late fee record for a housing loan.
Loan principal amount
Interest rate
Loan period
Estimated Monthly Repayment, derived with PMT function, with principal, interest rate and number of payments as input arguments.
Property purchase price
Property type
Age
Gender
Marital status
Resident status
Number of Dependents
Education Level
Years of working on loan start date, derived by loan start date — job start date.
Monthly income, derived from the annual income / 12 months.
Credit card average payments, derived average monthly aggregated transaction amount from credit card transaction data.
Auto loan payments
Defaulted, derived by checking the existence of defaulted loan for a customer.
Debt-to-income ratio, derived by dividing the sum of monthly loan payment by monthly income.

Besides feature selection, the following tasks were performed to filter and transform the data:

Removed personally identifiable information such as name, account IDs from the datasets to protect privacy.
Inspected unique categories in all categorical variables.
Addressed noisy data such as job start date > loan start date (project team fixed years of work with at least 1), age < 17 or age > 75 on loan start date (project team decided to remove age attributes), loan period and loan to value criteria didn’t meet current housing rules (project team decided not to apply the rules as it would reduce the numbers of available data).
Used box plot method to detect outliers, for examples 340 outliers were identified in loan principal amount. It was decided to remove partial outliers during data preparation stage, but the project team added the records back into datasets to improve the model accuracy during modelling stage.
Encoded all categorical variables as LibSVM supports only numerical data.
Used mix-max normalization to scale and group all the continuous variables within 0–5 (initially the scale was 0–10). It helped to improve the performance of machine learning algorithm and avoid the algorithm to be biased toward the particular variables.
Combined data from multiple sources to form the datasets and dropped duplicates resulted from the data joining process.

Modelling

The goal of model training was to make a prediction of true late payer correctly as often as possible, with minimal impact of compromise in the acceptance of good loan applications. The project team was using confusion matrix to measures the complete performance of the model against KPI — capture 70% of true repayers.

To improve the performance of SVM model, the dimensionality reduction, training and test data split, parameter tuning, trial and error techniques were applied:

Dimensionality reduction

Correlation Matrix was performed on the initial selected 15 attributes and correlation coefficient (r value) between attributes were calculated. The attributes with r < 0.02 were removed as they might carry little useful info. At the same time, pairs of attributes with high correlation coefficient were filtered as well and reduced to only one as they were also likely to carry very similar information.

After Correlation Matrix, 19 attributes were reduced to 8 including the labeled attribute. Please see the attributes below highlighted in red boxes.

Training and test data split

80/20 and 70/30 data split had provided the similar accuracy, while the rest of ratios had a poor performance.

Parameter tuning

The project team used Optimize Parameters operator combining with k-fold Cross Validation operator in Rapidminer to find the optimal values for the following key parameters of LibSVM model in order to achieve the best overall prediction accuracy.

— C, aka regularization value, to define how much margin the separating hyperplane is.

— Gamma, to define how far the influence of a single training example reach.

With 5-fold cross validation, the parameters are optimized to achieve close to 70% accuracy — SVM.gamma = 0.8416241606222868 (range: 0.0 ~ +∞), SVM.C = 6876.302257326854 (range: 0.0 ~ +∞).

Trial and error method

Tweaking the model through trial and error, the following activities had happened during the process:

Adjusted numeric value scaling from 0–10 to 0–5 to reduce data dimensionality.
Eventually removed Income-Debt ratio and selected Total Monthly Income and Total Monthly Expense/Loan Repayment instead when both attributes had higher r values individually than the derived ratio
“Year of working on loan start date” was not derived during data preparation. After discussion with teammates to align the dataset, calculated and added it into dataset. It was identified with high r value.
Removed 200+ noisy records and outliers during data preparation. But, during modelling stage, decided to fix the data mistake and added them back into the datasets. I assumed that it was because there were more training data points in either class helping the clustering. And the outlier removal might be flawed assumption about the normal distributed data pattern in our business case.

Evaluation

Applying SVM model with training (80) / test (20) data split, the trained SVM model has achieved overall accuracy 70.09% in training, and 70.71% in test which indicated no sign of overfitting or underfitting.

During test, the trained SVM model was able capture close to 66% of True Positive results which had high likelihood of late repayment or even a defaulted payment, with 26% impact (False Positive) in the acceptance of good loan applications which might result in a business opportunity lost or an unsatisfactory customer service.

The project team also compared it with the trained Decision Tree (DT) model built by team B which had overall accuracy 61.08% in training, 65.35% in test.

Decision Tree model’s training performance

Decision Tree model’s testing performance

DT model had a higher recall rate (78%) in identify the True Positive cases in test, but it had a lower precision rate and there was a greater chance (much lower True Negative recall rate) in misjudging a good loan applicant.

Conclusion

The project team would like to propose the stakeholders to adopt the SVM model which has a higher overall accuracy. Its True Positive recall rate is not far from the target 70% and the project team believes it can be improved with more quality data.

While iteratively improving the SVM model by adding more quality labeled data, fixing current data quality and performing regular parameter optimization, the project team suggest the stakeholders to streamline current housing loan application review processes to reduce the waiting time and providing better loan conditions to compensate the true quality customers who are impacted by the model prediction accuracy.

Reflection

The project has given me an experience in CRISP-DM methodology, starting from business and data understanding, data preparation, to modeling and evaluation. I was novices at the machine learning. During the processes handling data and training the model, a series of trial and error has help to deepen my knowledge and understanding in various data preparation and modeling techniques.

However, I also didn’t have a clear answer for some questions raised during the process especially when the machine learning algorithms are like a black box to me, for examples:

Debt-to-income ratio was derived by dividing total monthly loan repayment by total monthly income, but why it had low correlation coefficient than both total loan repayment and total monthly income instead? I assumed the potential outliers and noisy data in both attributes might result in “the compound effect” when deriving Debt-to-income ratio. It needs more data analysis.

How to determine the balanced scaling range, 0–5, 0–10, 1–5? I assumed depending on data, lesser binning might either help to reduce data dimensionality or underfitting.

What is the good accuracy % for a SVM model? I believed the it might be back to underfitting and overfitting issue again.

I have realized the great potential of machine learning and I have been exploring the opportunities to apply the skills and knowledge learned in my area of work. For example, I had shared with my director my Business Analytics Essential assignment and proposed the potential application of machine learning, combined with Statistical Process Control (SPC), in the production quality prediction. I was tasked to be part of AWS Computer Vision project and explore the potential to automate work-in-progress (WIP) component image analysis with machine learning.

However, I also realized the challenges in the imperfect real-world environment, such as lack of quality labeled data and lack of more efficient data collection mechanism. For example, in AWS Computer Vision project, we have found the existing electronic microscopes are unable to take a high-resolution image for data extraction, and there are not enough numbers of defective parts (labeled data) to train the machine learning model. Moreover, every data collection process is associated with a cost such as acquiring new data collection mechanism, re-engineering business processes to break down database silo and setup various data points.

Unfortunately, the machine learning models are data-hungry and their performance relies heavily on the size of training data available. A machine learning-powered application cannot be proven useful in a short term until the right quantity and quality of data are put in place.