EDAFinanceRisk Analysis

Lending Business Case Study: What the Data Says About Loan Defaults

An EDA of 2007–2011 lending data to identify the driving factors behind loan defaults — amount-to-income ratios, revolving utilisation, derogatory records, and loan purpose all tell a story.

Lending Business Case Study: What the Data Says About Loan Defaults
Contents

The lending business sits at the intersection of two risks: reject a creditworthy applicant and lose revenue; approve a defaulter and absorb the loss. Good credit risk analysis is about finding the signals that separate the two.

This case study works through 2007–2011 lending data to identify the factors most predictive of loan defaults — a 14.16% base rate that varies dramatically depending on how you slice it.


The Two-Sided Risk Problem

Every rejected loan application is a lost customer. Every approved default is a direct financial loss. The goal isn’t to minimise one — it’s to find the optimal cutoff that minimises expected loss across both.

Exploratory data analysis is the first step: before building a model, understand which variables actually discriminate between defaulters and payers.


Key Risk Signals

1. Amount-to-Income Ratio

Loans where the amount-to-income ratio exceeds 25% showed significantly elevated default rates. This is your first hard filter — borrowers who are already over-extended are the highest-risk cohort regardless of credit score.

2. Revolving Line Utilisation

Borrowers with revolving utilisation above 75% combined with high-value loans represent a compounding risk. High utilisation signals that existing credit is already stretched — adding more debt rarely improves the situation.

3. Derogatory Records

Previous derogatory marks on a borrower’s record were the strongest single predictor of future default. This is consistent with the broader credit literature: past behaviour is the best predictor of future behaviour.

4. Loan Purpose

Small business loans emerged as the top-defaulting category and should be approved with caution. The volatility of small business revenue makes repayment schedules difficult to maintain — especially during the 2007-2011 period which overlapped with the financial crisis.


Methodology

The analysis used four levels of examination:

  • Univariate analysis — distribution of each variable independently
  • Bivariate analysis — relationship between each variable and default status
  • Trivariate analysis — interaction effects between variables
  • Correlation mapping — identifying multicollinearity in the feature set

Charged-off loans (defaults) showed consistently higher average loan amounts, interest rates, and debt-to-income ratios than fully-paid loans — confirming that risk and rate are priced correctly in aggregate, but individual loan-level risk assessment can still improve.


Geographic Flag

Wyoming (WY) showed unusually high default amounts requiring further investigation. Geographic concentration risk is often overlooked in retail lending — local economic shocks can cluster defaults in ways that aggregate models miss.


What This Means for Credit Risk

The findings suggest a practical risk tiering framework:

SignalAction
Amount/income > 25%Decline or require collateral
Revolving utilisation > 75% + large loanDecline or reduce loan amount
Prior derogatory recordEscalate for manual review
Small business purposeApply higher interest rate + tighter terms
WY geographyFlag for geographic concentration monitoring

The 14.16% Default Rate in Context

A 14.16% default rate sounds high — but it’s an average. The high-risk cohorts described above had materially higher rates, while low-risk borrowers were well below the mean. The value of good credit risk analysis isn’t changing the average — it’s separating the distribution.

Original analysis published on Medium.