💰 Loan Approval Prediction Using Machine Learning

MAIN TOOL

R Programming Language

Secondary tool

Tableau

INDUSTRY

Housing

📚 About the Project

This project focuses on predicting loan approval using a variety of machine learning models, including Logistic RegressionDecision TreesRandom Forests, and Gradient Boosting. The analysis is based on the Loan Dataset, comprising 614 records and 13 attributes related to loan applicants. The aim is to automate and optimize the loan approval process while maintaining high accuracy and consistency.


🔑 Key Features

 

  1. Data Preprocessing:

    • Addressed missing values using k-Nearest Neighbor (kNN) imputation.
    • Feature transformations like scaling and log transformations for better model performance.
  2. Modeling Techniques:

    • Logistic Regression with feature selection.
    • Advanced tree-based models like Random Forests and XGBoost.
    • Support Vector Machines with SMOTE to handle class imbalance.
  3. Evaluation Metrics:

    • Accuracy, sensitivity, specificity, and ROC-AUC for model comparison.
    • Confusion matrices and variable importance plots for interpretability.
  4. Insights and Business Impact:

    • Identified key predictors influencing loan approval, such as Credit HistoryLoan Amount, and Marital Status.
    • Recommended ensemble methods for robust and scalable loan evaluation systems.

📊 Key Insights

 

  • Best Performing Model: Random Forest achieved the highest accuracy with a robust balance of sensitivity and specificity.
  • Critical Predictors:
    • Credit History: The strongest determinant of loan approval.
    • Loan Amount: Higher amounts are associated with reduced approval rates.
    • Property Area: Urban and semiurban properties showed distinct approval patterns.
  • Model Comparison:
    • Logistic Regression and Decision Trees provided interpretable results.
    • Ensemble methods (Random Forest and XGBoost) demonstrated superior performance in accuracy and robustness.

📜 Full Report

 

For a detailed analysis, including methodology, visualizations, and results, refer to the complete project report:
📄 Loan Approval Prediction Report


📂 Project Structure

 

.
├── Data/
│   ├── Loan_Dataset.csv
├── Scripts/
│   ├── Loan_Approval_Prediction.R
├── Reports/
│   ├── Loan_Approval_Prediction.pdf
├── README.md
 

🤝 Connect with Me

 

Feel free to reach out for feedback, questions, or collaboration opportunities:
LinkedInDr. Syed Faizan


Author: Syed Faizan
Master’s Student in Data Analytics and Machine Learning

R Code and the Report of the Analysis

#---------------------------------------------------------#
#Loan Approval Prediction - Syed Faizan #
# #
#---------------------------------------------------------#
# Starting with a clean environment
rm(list=ls())
# Clearing the Console
cat("\014") # Clears the console
# Clearing scientific notation
options(scipen = 999)
# Loading the necessary packages for Data cleaning and Analysis
# Installing libraries if not already installed
# Loading libraries
library(rpart, quietly = TRUE)
library(caret, quietly = TRUE)
library(randomForest, quietly = TRUE)
library(ipred, quietly = TRUE)
library(xgboost, quietly = TRUE)
library(MASS, quietly = TRUE)
library(rpart.plot, quietly = TRUE)
library(rattle)
library(readr)
# Reading the data set as a data frame
loanfinal <- read_csv("Loanfinal.csv")
# Checking the structure of the data
str(loanfinal)
# Checking for missing values
sum(is.na(loanfinal))
library(DataExplorer)
plot_missing(loanfinal)
# Imputing missing values using KNN
# Load the VIM library
library(VIM)
# Perform k-NN imputation
loanfinal_imputed <- kNN(loanfinal, k = 5) # k = 5 is a commonly used default
# Checking the structure after imputation
str(loanfinal_imputed)
# The kNN function adds additional columns with the suffix "_imp" to indicate
# which values were imputed. If you want to remove these indicators:
loanfinal_imputed <- loanfinal_imputed[, !grepl("_imp", names(loanfinal_imputed))]
sum(is.na(loanfinal_imputed))
loanfinal <- loanfinal_imputed
# Dropping any unnecessary variables if applicable
# e.g., loanfinal$some_column <- NULL
# Analyzing the 'Loan_Status' variable
table(loanfinal$Loan_Status)
# Feature Engineering
hist(loanfinal$ApplicantIncome,
xlab = 'Applicant Income',
ylab= 'Frequency',
main = 'Histogram of Applicant Income',
col = 'red')
# Try square root transformation
AppIncomesqrt <- sqrt(loanfinal$ApplicantIncome)
hist(AppIncomesqrt,
xlab = 'Square Root of Applicant Income',
ylab= 'Frequency',
main = 'Histogram of Applicant Income',
col = 'blue')
# Apply transformation
loanfinal$ApplicantIncome <- sqrt(loanfinal$ApplicantIncome)
#Apply similar transformations to other numeric variables after visualization
# Coapplicant Income
hist(loanfinal$CoapplicantIncome,
xlab = 'Coapplicant Income',
ylab = 'Frequency',
main = 'Histogram of Coapplicant Income',
col = 'green',
breaks = 50)
# Trying square root transformation
CoappIncomesqrt <- sqrt(loanfinal$CoapplicantIncome)
hist(CoappIncomesqrt,
xlab = 'Square Root of Coapplicant Income',
ylab = 'Frequency',
main = 'Histogram of Square Root of Coapplicant Income',
col = 'purple',
breaks = 50)
# Add a small constant to avoid log(0)
CoappIncome_log <- log(loanfinal$CoapplicantIncome + 1)
# Plot the histogram after log transformation
hist(CoappIncome_log,
xlab = 'Log of Coapplicant Income',
ylab = 'Frequency',
main = 'Histogram of Log Transformed Coapplicant Income',
col = 'lightblue',
breaks = 50)
sum(loanfinal$CoapplicantIncome == 0) #There are 273 zeros in this column
# I have decided to thus turn it into categorical column with 'yes' or 'no' as levels
loanfinal$CoapplicantIncome <- ifelse(loanfinal$CoapplicantIncome != 0, 'yes', 'no')
table(loanfinal$CoapplicantIncome) # 273 No and 341 Yes
loanfinal$CoapplicantIncome <- factor(loanfinal$CoapplicantIncome, levels = c('yes','no')) #factorize co-applicant income
# confirm factorization and the levels
class(loanfinal$CoapplicantIncome)
levels(loanfinal$CoapplicantIncome)
# Convert all character columns to factors
loanfinal[sapply(loanfinal, is.character)] <- lapply(loanfinal[sapply(loanfinal, is.character)], as.factor)
str(loanfinal)
# visualization
library(ggplot2)
ggplot(loanfinal, aes(x = CoapplicantIncome, fill = factor(Loan_Status))) +
geom_histogram(color = "red", alpha = 0.7, stat="count", bins = 30) +
scale_fill_manual(values = c("Y" = "blue", "N" = "orange")) + # Using actual levels from Loan_Status
labs(x = "Coapplicant Income", fill = "Loan Status") +
ggtitle("Coapplicant Income by Loan Status") +
theme_bw()
# LoanAmount
hist(loanfinal$LoanAmount,
xlab = 'Loan Amount',
ylab = 'Frequency',
main = 'Histogram of Loan Amount',
col = 'orange',
breaks = 50)
# Try square root transformation
LoanAmountsqrt <- sqrt(loanfinal$LoanAmount)
hist(LoanAmountsqrt,
xlab = 'Square Root of Loan Amount',
ylab = 'Frequency',
main = 'Histogram of Square Root of Loan Amount',
col = 'cyan',
breaks = 50)
# Apply transformation
loanfinal$LoanAmount <- sqrt(loanfinal$LoanAmount)
# Loan_Amount_Term
hist(loanfinal$Loan_Amount_Term,
xlab = 'Loan Amount Term',
ylab = 'Frequency',
main = 'Histogram of Loan Amount Term',
col = 'pink',
breaks = 50)
# Since distribution appears to be discrete no transformation is contemplated
# Data splicing
set.seed(12345)
train <- sample(1:nrow(loanfinal), size = ceiling(0.80 * nrow(loanfinal)), replace = FALSE)
loan_train <- loanfinal[train, ]
loan_test <- loanfinal[-train, ]
loan_test$Loan_Status <- factor(loan_test$Loan_Status, levels = c('Y', 'N')) #confirming the levels in target variable
loan_train$Loan_Status <- factor(loan_train$Loan_Status, levels = c('Y', 'N'))
# Confirming the structure of the train data frame
# Classification modelling
# Logistic regression
# Fit logistic regression model using all variables as predictors and Loan_Status as the response
glm.fits <- glm(Loan_Status ~ Gender + Married + Dependents + Education + Self_Employed + ApplicantIncome +
CoapplicantIncome + LoanAmount + Loan_Amount_Term + Credit_History + Property_Area,
data = loan_train,
family = binomial)
# Summary of the model
summary(glm.fits)
# Coefficients of the model
coef(glm.fits)
# p-values of the coefficients
summary(glm.fits)$coef[, 4]
# Visual confirmation of loan status by Marital Status
ggplot(loan_train, aes(x = Married, fill = factor(Loan_Status))) +
geom_histogram(color = "red", alpha = 0.7, stat="count", bins = 30) +
scale_fill_manual(values = c("Y" = "blue", "N" = "orange")) + # Using actual levels from Loan_Status
labs(x = "Married", fill = "Loan Status") +
ggtitle("Marital Status by Loan Status") +
theme_bw()
table(loan_train$Married, loan_train$Loan_Status) # [102, 62,
# 238, 90]
# I retrain the model using only the significant variables
# Checking model fit on training data
glm.fits_significant <- glm(Loan_Status ~ Married + Credit_History + Property_Area,
data = loan_train,
family = binomial)
# Summary of the updated model
summary(glm.fits_significant)
# Predict probabilities for the model with significant variables
glm.probs_significant <- predict(glm.fits_significant, type = "response")
# Convert probabilities to class predictions (using 0.5 threshold)
glm.pred_significant <- ifelse(glm.probs_significant > 0.5, "Y", "N")
# Convert predictions and actual Loan_Status to factors
glm.pred_significant <- factor(glm.pred_significant, levels = c("Y", "N"))
loan_train$Loan_Status <- factor(loan_train$Loan_Status, levels = c("Y", "N"))
# Confusion matrix using caret package for the model with significant variables
conf_matrix_significant <- confusionMatrix(glm.pred_significant, loan_train$Loan_Status)
# Print confusion matrix and performance metrics
print(conf_matrix_significant)
# Test data model fitting
# Predict probabilities on the test set using the model with significant variables
glm.probs_significant_test <- predict(glm.fits_significant, newdata = loan_test, type = "response")
# Convert probabilities to class predictions (using 0.5 threshold) on the test set
glm.pred_significant_test <- ifelse(glm.probs_significant_test > 0.5, "Y", "N")
# Convert predictions and actual Loan_Status to factors for comparison
glm.pred_significant_test <- factor(glm.pred_significant_test, levels = c("Y", "N"))
loan_test$Loan_Status <- factor(loan_test$Loan_Status, levels = c("Y", "N"))
# Confusion matrix using caret package for the test data
conf_matrix_significant_test <- confusionMatrix(glm.pred_significant_test, loan_test$Loan_Status)
# Print confusion matrix and performance metrics for the test set
print(conf_matrix_significant_test)
# ROC Curve
# Load necessary libraries
library(pROC) # For AUC and ROC calculations
library(ggplot2) # For plotting
# Predict probabilities on the test set using the model with significant variables
glm.probs_significant_test <- predict(glm.fits_significant, newdata = loan_test, type = "response")
# Create ROC object
roc_obj <- roc(loan_test$Loan_Status, glm.probs_significant_test, levels = c("Y", "N"))
# Print AUC value
auc_value <- auc(roc_obj)
print(paste("AUC:", auc_value))
# Plot the ROC curve
ggroc(roc_obj) +
ggtitle(paste("ROC Curve (AUC =", round(auc_value, 2), ")")) +
theme_bw()
# Linear Discriminant Analysis
# Load necessary packages
library(MASS) # For LDA
library(caret) # For confusionMatrix
# In LDA and QDA only numerical variables can be used as a common covariance is assumed by LDA.
# Extracting numeric columns from loan_train
numeric_cols_train <- sapply(loan_train, is.numeric)
train_numeric <- loan_train[, numeric_cols_train]
# Extracting numeric columns from loan_test
numeric_cols_test <- sapply(loan_test, is.numeric)
test_numeric <- loan_test[, numeric_cols_test]
# Checking the structure of the extracted numeric columns
str(train_numeric)
str(test_numeric)
# Add Loan_Status to the numeric data
train_numeric$Loan_Status <- loan_train$Loan_Status
test_numeric$Loan_Status <- loan_test$Loan_Status
# Now, fit the LDA model on training data using the numeric columns
lda_model <- lda(Loan_Status ~ ., data = train_numeric)
summary(lda_model)
# Predict on the test data using numeric columns
pred_lda <- predict(lda_model, test_numeric)$class
# Create a confusion matrix (predictions first, actuals second)
conf_matrix_lda <- confusionMatrix(pred_lda, test_numeric$Loan_Status)
# Print the confusion matrix and performance metrics for LDA
print(conf_matrix_lda)
# Quadratic discriminant Analysis
# Load necessary packages
library(MASS) # For QDA
library(caret) # For confusionMatrix
# Fit QDA model on training data using numeric columns
qda_model <- qda(Loan_Status ~ ., data = train_numeric)
summary(qda_model)
# Predict on the test data using numeric columns
pred_qda <- predict(qda_model, test_numeric)$class
# Create a confusion matrix (predictions first, actual observations second)
conf_matrix_qda <- confusionMatrix(pred_qda, test_numeric$Loan_Status)
# Print the confusion matrix and performance metrics for QDA
print(conf_matrix_qda)
# Tree Based modelling
# Classification Tree ( model using rpart)
tree <- rpart(Loan_Status ~ ., data = loan_train, method = "class")
rpart.plot(tree, nn = TRUE)
print(tree)
summary(tree)
# Best Complexity Parameter (Pruning the tree)
cp.optim <- tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"]
tree.pruned <- prune(tree, cp = cp.optim)
rpart.plot(tree.pruned, nn = TRUE)
summary(tree.pruned)
print(tree.pruned)
# Testing the model with Pruned Tree
pred_tree <- predict(object = tree.pruned, loan_test, type = "class")
conf_matrix_tree <- confusionMatrix(table(loan_test$Loan_Status, pred_tree))
print(conf_matrix_tree)
# Bagging and Random Forests
library(randomForest)
ncol_train <- ncol(loan_train) - 1
print(ncol_train) # 11 variables excepting the target variable Loan_Status.
# Bagging: Random Forest with mtry = total number of predictors
set.seed(1987)
bag.loan <- randomForest(Loan_Status ~ ., data = loan_train, mtry = 11, importance = TRUE)
# Print the bagging model details
print(bag.loan)
varImpPlot(bag.loan, main = "Variable Importance plot of bagging Tree model")
summary(bag.loan)
# Predict on the test set using Bagging model
bag_pred <- predict(bag.loan, newdata = loan_test)
# Confusion matrix for Bagging model predictions
conf_matrix_bag <- confusionMatrix(bag_pred, loan_test$Loan_Status)
print(conf_matrix_bag)
# Random Forest: Random Forest with mtry = subset of predictors (p/3 is default but
# shall use approximation of sqrt(p) that is near sqrt(11) which is 4)
set.seed(314)
rf.loan <- randomForest(Loan_Status ~ ., data = loan_train, mtry = 4, importance = TRUE)
# Print the random forest model details
print(rf.loan)
varImpPlot(rf.loan, main = "Variable Importance plot of Random Tree model")
summary(rf.loan)
# Predict on the test set using Random Forest model
rf_pred <- predict(rf.loan, newdata = loan_test)
# Confusion matrix for Random Forest model predictions
conf_matrix_rf <- confusionMatrix(rf_pred, loan_test$Loan_Status)
print(conf_matrix_rf)
# Boosting
# Load necessary libraries
# Load necessary libraries
install.packages("xgboost")
install.packages("caret")
library(xgboost)
library(caret)
# Create new data sets loan_train1 and loan_test1 to avoid overwriting original data sets
loan_train1 <- loan_train
loan_test1 <- loan_test
# Convert Loan_Status to a numeric binary (0 = N, 1 = Y) in the new data sets
loan_train1$Loan_Status <- ifelse(loan_train1$Loan_Status == "Y", 1, 0)
loan_test1$Loan_Status <- ifelse(loan_test1$Loan_Status == "Y", 1, 0)
# Convert categorical variables to dummy variables using model.matrix()
# Remove the intercept column generated by model.matrix() with `[, -1]`
train_matrix <- model.matrix(Loan_Status ~ . - 1, data = loan_train1)
test_matrix <- model.matrix(Loan_Status ~ . - 1, data = loan_test1)
# Check the resulting train_matrix and test_matrix
str(train_matrix)
str(test_matrix)
# Convert data to D Matrix format for xgboost
dtrain <- xgb.DMatrix(data = train_matrix, label = train_label)
dtest <- xgb.DMatrix(data = test_matrix, label = test_label)
# Set parameters for xgboost (for binary classification)
params <- list(
objective = "binary:logistic", # For binary classification
eval_metric = "error", # Use classification error as the evaluation metric
max_depth = 4, # Maximum depth of trees
eta = 0.1, # Learning rate (shrinkage)
nthread = 2 # Number of threads for parallel computing
)
# Train the xgboost model
set.seed(234)
xgb_model <- xgboost(
data = dtrain,
params = params, # Parameters are provided here
nrounds = 100, # Number of boosting rounds (trees)
verbose = 0 # Turn off printing of boosting iterations
)
summary(xgb_model)
# Predict on the test data
xgb_pred_prob <- predict(xgb_model, dtest)
# Convert predicted probabilities to binary class predictions using 0.5 as threshold
xgb_pred <- ifelse(xgb_pred_prob > 0.5, 1, 0)
# Confusion matrix using caret package (reconverting numeric predictions back to factors for evaluation)
xgb_pred_factor <- factor(xgb_pred, levels = c(0, 1), labels = c("Y", "N"))
test_label_factor <- factor(test_label, levels = c(0, 1), labels = c("Y", "N"))
# Print confusion matrix and performance metrics
conf_matrix_xgb <- confusionMatrix(xgb_pred_factor, test_label_factor)
print(conf_matrix_xgb)
# Advanced EDA using PCA and Clustering
# Creating a Data frame with only numerical variables
# Identify the numeric columns in the loan final data frame
numeric_cols_final <- sapply(loanfinal, is.numeric)
# Extract only the numeric columns into a new data frame
final_numeric <- loanfinal[, numeric_cols_final]
# View the first few rows of the numeric data frame
head(final_numeric)
# Implementing a Principal Component Analysis Model
names(final_numeric)
apply(final_numeric, 2, mean)
apply(final_numeric, 2, var)
pr.out=prcomp(final_numeric, scale=TRUE)
names(pr.out)
print(pr.out)
summary(pr.out)
par(mfrow = c(1, 1))
plot(summary(pr.out)$importance[3,], ylab = "Cumulative Proportion of Variance Explained",
xlab = "PC1 PC2 PC3 ", col= "red",pch= c(21, 22, 23, 24), cex = c(1,2,3,3), type = "b", bg = "yellow")
title(main = "Cumulative Proportion of Variance Explained by the 3 Principal Components")
pr.out$rotation=-pr.out$rotation
pr.out$x=-pr.out$x
biplot(pr.out, scale=0)
pr.out$sdev
pr.var=pr.out$sdev^2
pr.var
pve=pr.var/sum(pr.var)
pve
par(mfrow = c(1, 2))
plot(pve, xlab="Principal Component", ylab="Proportion of Variance Explained", ylim=c(0,1), col= "purple",pch= c(21, 22, 23, 24), cex = c(1,2,3,3), type = "b", bg = "orange")
plot(cumsum(pve), xlab="Principal Component", ylab="Cumulative Proportion of Variance Explained", ylim=c(0,1),col= "red",pch= c(21, 22, 23, 24), cex = c(1,2,3,3), type = "b", bg = "blue")
# Hierarchical Clustering
# Checking Correlation
cor_matrix <- cor(final_numeric)
library(ggcorrplot)
ggcorrplot(cor_matrix, lab = TRUE)
# Hierarchical clustering on different linkage methods
hc.complete <- hclust(dist(final_numeric), method = "complete")
hc.average <- hclust(dist(final_numeric), method = "average")
hc.single <- hclust(dist(final_numeric), method = "single")
# Set up a plotting area with 3 columns for subplots
par(mfrow = c(1, 3))
# Plot the dendrograms for different linkage methods
plot(hc.complete, main = "Complete Linkage", xlab = "", sub = "", cex = .9)
plot(hc.average, main = "Average Linkage", xlab = "", sub = "", cex = .9)
plot(hc.single, main = "Single Linkage", xlab = "", sub = "", cex = .9)
# Combining Hierarchical Clustering with PCA
hc.out <- hclust(dist(pr.out$x[, 1:3]))
plot(hc.out, main = "Hierarchical Clustering on Three Principal Components", xlab = "Observations")
clusters <- cutree(hc.out,2 )
table(clusters, loanfinal$Loan_Status) # clusters N Y 1 188 412
# 2 4 10
#SVM
# Load necessary libraries
library(e1071)
library(ROCR)
library(caret)
library(caTools)
library(smotefamily)
# Apply SMOTE to balance the training data
set.seed(123) # For reproducibility
train_numeric_smote <- SMOTE(train_numeric[, -ncol(train_numeric)], # Exclude the target variable for SMOTE
train_numeric$Loan_Status, # Target variable
K = 5)
# The SMOTE function from smotefamily package returns a list with 'data' that includes both features and the class
train_smote <- train_numeric_smote$data
# Ensure Loan_Status is a factor in the new dataset
train_smote$Loan_Status <- as.factor(train_smote$class)
# Drop the extra 'class' column (which is just a copy of Loan_Status)
train_smote$class <- NULL
# Check class distribution after SMOTE
table(train_smote$Loan_Status)
# Train an SVM model using the SMOTE-balanced data
svmfit_smote <- svm(Loan_Status ~ ., data = train_smote, kernel = "radial", cost = 100, gamma = 4)
summary(svmfit_smote)
tune.smote.svm <- tune(svm, Loan_Status ~ ., data = train_smote, kernel = "radial",
ranges = list(cost = c(0.1, 1, 10, 100, 1000),
gamma = c(0.5, 1, 2, 3, 4)))
summary(tune.smote.svm)
# Tuned SVM modelwith best parameters
svmfit_smote <- tune.smote.svm$best.model # since accuracy is low this may be due to overfitting
# I shall try hyper parameter tuning
# Ensure that Loan_Status in test_numeric is a factor
test_numeric$Loan_Status <- as.factor(test_numeric$Loan_Status)
# Use the trained SVM model (svmfit_smote) to make predictions on the test set
ypred_test <- predict(svmfit_smote, test_numeric)
# Create a confusion matrix to evaluate the predictions
confusion_matrix_test <- table(Predicted = ypred_test, Actual = test_numeric$Loan_Status)
print(confusion_matrix_test)
# If you want to compute other metrics like accuracy, precision, recall, etc., you can use the caret package
library(caret)
# Compute confusion matrix and other performance metrics
conf_matrix_svm <- confusionMatrix(ypred_test, test_numeric$Loan_Status)
# Print the confusion matrix and performance metrics
print(conf_matrix_svm)
# Visualizing SVM on PCA
# Load necessary libraries
library(e1071)
library(ggplot2)
library(caret) # For evaluation metrics like confusion matrix
# Step 1: Apply PCA on the training data (excluding the target variable)
pca_model <- prcomp(train_smote[, -ncol(train_smote)], center = TRUE, scale. = TRUE)
# Extract the first two principal components from the training data
train_pca <- data.frame(PC1 = pca_model$x[, 1], PC2 = pca_model$x[, 2], Loan_Status = train_smote$Loan_Status)
# Step 2: Tune SVM model on the PCA-transformed training data
set.seed(123) # For reproducibility
tune.out <- tune(svm, Loan_Status ~ ., data = train_pca, kernel = "radial",
ranges = list(cost = c(0.1, 1, 10, 100),
gamma = c(0.01, 0.1, 0.5, 1)))
# Get the best model from the tuning process
best_model <- tune.out$best.model
# Step 3: Apply PCA on the test set (transform test data using the same PCA model)
test_pca <- data.frame(PC1 = predict(pca_model, test_numeric[, -ncol(test_numeric)])[, 1],
PC2 = predict(pca_model, test_numeric[, -ncol(test_numeric)])[, 2],
Loan_Status = test_numeric$Loan_Status)
# Step 4: Test the best model on the PCA-transformed test data
pred_test <- predict(best_model, test_pca)
# Confusion matrix to evaluate performance on the test set
confusion_matrix_test <- confusionMatrix(pred_test, test_pca$Loan_Status)
print(confusion_matrix_test)
# Step 5: Plot the decision boundary using the best model
# Create a grid of points for decision boundary visualization
x_min <- min(train_pca$PC1) - 1
x_max <- max(train_pca$PC1) + 1
y_min <- min(train_pca$PC2) - 1
y_max <- max(train_pca$PC2) + 1
# Create a grid of values
grid <- expand.grid(PC1 = seq(x_min, x_max, length.out = 200),
PC2 = seq(y_min, y_max, length.out = 200))
# Predict on the grid using the best model
grid$Loan_Status <- predict(best_model, grid)
grid$Loan_Status <- as.numeric(grid$Loan_Status) # Convert factor to numeric
# Step 6: Plot the decision boundary
ggplot() +
geom_point(data = train_pca, aes(x = PC1, y = PC2, color = Loan_Status), size = 2) +
geom_tile(data = grid, aes(x = PC1, y = PC2, fill = Loan_Status), alpha = 0.3) + # Better visualization
labs(title = "SVM Decision Boundary on PCA-Transformed Data",
x = "Principal Component 1",
y = "Principal Component 2") +
theme_minimal() +
scale_color_manual(values = c("blue", "red")) +
scale_fill_gradient(low = "lightblue", high = "yellow")
# Print out the best parameters found during tuning
print(tune.out$best.parameters)
# Comparison of all the models
# Create a function to extract metrics from a confusion matrix
extract_metrics <- function(conf_matrix) {
accuracy <- conf_matrix$overall['Accuracy']
sensitivity <- conf_matrix$byClass['Sensitivity']
specificity <- conf_matrix$byClass['Specificity']
positive_predictive_value <- conf_matrix$byClass['Pos Pred Value'] # Precision
return(c(accuracy, sensitivity, specificity, positive_predictive_value))
}
# Extracting metrics for each model
metrics_logistic <- extract_metrics(conf_matrix_significant_test)
metrics_lda <- extract_metrics(conf_matrix_lda)
metrics_qda <- extract_metrics(conf_matrix_qda)
metrics_tree <- extract_metrics(conf_matrix_tree)
metrics_bagging <- extract_metrics(conf_matrix_bag)
metrics_rf <- extract_metrics(conf_matrix_rf)
metrics_xgb <- extract_metrics(conf_matrix_xgb)
metrics_svm <- extract_metrics(conf_matrix_svm)
# Combine all metrics into a data frame
model_comparison <- data.frame(
Model = c("Logistic Regression", "LDA", "QDA", "Decision Tree", "Bagging", "Random Forest", "XGBoost", "SVM"),
Accuracy = c(metrics_logistic[1], metrics_lda[1], metrics_qda[1], metrics_tree[1], metrics_bagging[1], metrics_rf[1], metrics_xgb[1],metrics_svm[1] ),
Sensitivity = c(metrics_logistic[2], metrics_lda[2], metrics_qda[2], metrics_tree[2], metrics_bagging[2], metrics_rf[2], metrics_xgb[2], metrics_svm[2]),
Specificity = c(metrics_logistic[3], metrics_lda[3], metrics_qda[3], metrics_tree[3], metrics_bagging[3], metrics_rf[3], metrics_xgb[3], metrics_svm[3]),
Positive_Predictive_Value = c(metrics_logistic[4], metrics_lda[4], metrics_qda[4], metrics_tree[4], metrics_bagging[4], metrics_rf[4], metrics_xgb[4], metrics_svm[4])
)
# Rank models by Accuracy
model_comparison <- model_comparison[order(-model_comparison$Accuracy), ]
# Display the results
print(model_comparison)
# This completes the Project
view raw LoanApproval.R hosted with ❤ by GitHub