🎓 College Type Prediction Using Logistic Regression

MAIN TOOL

R Programming Language

Technique

Logistic Regression

INDUSTRY

Education

📚 About the Project

This project uses Logistic Regression to classify colleges into Public or Private categories based on the College dataset from the ISLR package. The dataset provides information about U.S. colleges, including tuition fees, student demographics, and graduation rates.

By employing advanced modeling techniques, this project showcases the ability to predict college types accurately, identify significant predictors, and explore the educational sector in depth.


🔑 Key Features

 

  1. Data Preprocessing:

    • Comprehensive data cleaning, including handling of outliers and feature encoding.
    • Correlation analysis to identify meaningful predictors.
  2. Logistic Regression Modeling:

    • Multiple iterations of logistic regression models.
    • Feature selection using Likelihood Ratio Tests (LRT) and Akaike Information Criterion (AIC).
  3. Evaluation Metrics:

    • Confusion matrices, ROC curves, and AUC values for model evaluation.
  4. Business Impact:

    • Insights into predictors like Out-of-State TuitionAlumni Donation Percentage, and Enrollment that distinguish public from private institutions.

📊 Key Insights

 

  • Best Predictors:

    • Out-of-State Tuition: Higher tuition is strongly associated with private institutions.
    • Percentage of Alumni Donating: Higher alumni donation rates correlate with private institutions.
    • Enrollment: Larger enrollment numbers favor private classification.
  • Model Accuracy:

    • Training Data: 93.93% accuracy, AUC = 0.9746.
    • Testing Data: 93.53% accuracy, AUC = 0.9747.
  • Confusion Matrix:

    • Balanced high sensitivity (85.71%) and specificity (96.15%) in the test dataset.

📜 Full Report

 

For a detailed analysis, including methodology, visualizations, and insights, refer to the complete project report:
📄 College Type Prediction Report


📂 Project Structure

 

.
├── Data/
│   ├── College_Dataset.csv
├── Scripts/
│   ├── College_Type_Prediction.R
├── Reports/
│   ├── College_Type_Prediction.pdf
├── README.md
 

🤝 Connect with Me

 

Feel free to reach out for feedback, questions, or collaboration opportunities:
LinkedInDr. Syed Faizan


Author: Syed Faizan
Master’s Student in Data Analytics and Machine Learning

R Code and the Report of the Analysis

#---------------------------------------------------------#
# Syed Faizan #
# College Type Prediction #
# #
# #
# #
# #
#---------------------------------------------------------#
# Starting with a clean environment
rm(list=ls())
# Clearing the Console
cat("\014") # Clears the console
# Removing Scientific Notation
options(scipen = 999)
# Loading the packages utilized for Data cleaning and Data Analysis
library(tidyverse)
library(grid)
library(gridExtra)
library(dplyr)
library(ggplot2)
library(ISLR)
library(caret)
library(vtable)
library(dlookr)
library(DataExplorer)
library(psych)
library(pROC)
library(rms)
# Import the data set and perform Exploratory Data Analysis by
# using descriptive statistics and plots to describe the data set.
cl <- College
# Rough overview of the data set
summary(cl)
names(cl)
View(cl)
# Descriptive statistics table using automated package
st(cl)
# Univariate Analysis
# Histograms of numeric variables
plot_histogram(cl)
# Box plots of numeric variables by College Type
Apps_Private <- cl$Apps[cl$Private == "Yes"]
Apps_Public <- cl$Apps[cl$Private == "No"]
summary(Apps_Public)
boxplot(Apps_Private, Apps_Public, col = "red",
main = "Box Plot of Applications received by College Type (excluding one outlier of 48094 Applications received)",
xlab = " Private vs public College Type", ylim = c(0, 21000))
# Automating boxp lots by college type using a 'for loop'
numeric_vars <- sapply(cl, is.numeric)
numeric_vars["Private"] <- FALSE # remove 'Private' if it's mistakenly considered numeric
# Iterate over each numeric variable and create a box plot
par(mfrow=c(4, 4)) # Adjust the grid dimensions according to the number of variables
for (var in names(numeric_vars[numeric_vars])) {
Private_Var <- cl[[var]][cl$Private == "Yes"]
Public_Var <- cl[[var]][cl$Private == "No"]
boxplot(Private_Var, Public_Var, col = c("blue", "green"),
main = paste("Box Plot of", var, "by College Type"),
xlab = "College Type", ylab = var,
names = c("Private", "Public"))
}
# Scatterplots
attach(cl)
# scatterplot 1
qplot(x = Apps, y = Accept, color = Private, shape = Private, geom = 'point') +
scale_color_manual(values = c("red", "blue")) + # Colors for points
scale_shape_manual(values = c(16, 17)) + # ylim to exclude lone outlier of 48094 applications
ylim(c(0, 20000)) + xlim(c(0,23000))
# scatterplot 2
qplot(x = F.Undergrad, y = Outstate, color = Private, shape = Private, geom = 'point') +
scale_color_manual(values = c("red", "blue")) + # Colors for points
scale_shape_manual(values = c(16, 17))
#Scatterplot 3
qplot(x = perc.alumni, y = Outstate, color = Private, shape = Private, geom = 'point') +
scale_color_manual(values = c("green", "magenta")) + # Colors for points
scale_shape_manual(values = c(16, 17))
#Scatterplot 4
qplot(x = Books, y = Room.Board, color = Private, shape = Private, geom = 'point') +
scale_color_manual(values = c("yellow", "purple")) + # Colors for points
scale_shape_manual(values = c(16, 17))
detach(cl)
#Pair plot of the variables in the data set
pairs(cl)
# Feature Engineering
# Firstly, I am creating two new columns - 1. One for college names from row names and other
# 2. To numerically encode College Types with Public = 0 and Private = 1.
cl$collegebinary <- ifelse(cl$Private == "No", 0, 1)
cl$collegenames <- rownames(cl)
# Verify these changes
head(cl)
# Secondly, I factorize and re-level 'Private' to suit my preferences
# Converting 'Private' to a factor
cl$Private <- factor(cl$Private)
# Set 'No', that is Public colleges, as the reference level
cl$Private <- relevel(cl$Private, ref = "No")
# Verify the changes
str(cl$Private)
# Correlation analysis to identify potential explanatory variables in the glm model.
cl_numeric <- cl[ , c(2:19)]
cor_matrix <- cor(cl_numeric)
print(cor_matrix)
# In order to ensure that the plot is not too busy I choose only those variables with a pearsons correlation
# more than .4 in absolute value.
cl_numeric2 <- cl_numeric[ ,c(1,2,3, 6,7,8,14,15,18)]
cor_matrix2 <- cor(cl_numeric2)
print(cor_matrix2)
plot_correlation(cor_matrix2)
# Visualizing the correlation matrix
library(ggcorrplot)
ggcorrplot(cor_matrix2, method = "square", type = "lower",
lab = TRUE, lab_size = 4,
colors = c("red", "white", "blue"),
title = "Correlation matrix of College Data",
tl.cex = 10)
# Splitting the data into a train and test set.
partition <- createDataPartition(cl$Private, p = 0.7, list = FALSE, times = 1)
# Creating the training data set
cl_train <- cl[partition, ]
# Create the testing data set
cl_test <- cl[-partition, ]
# Examining the two Data sets
summary(cl_train)
summary(cl_test)
# Using the glm() function in the ‘stats’ package
# to fit a logistic regression model to the training set using at least two predictors.
# making sure "Private" is a factor
str(cl_train$Private)
# Fitting a logistic regression model using glm()
logistic_model_glm1 <- glm(Private ~ Apps + Accept + Enroll + F.Undergrad + P.Undergrad + Outstate + S.F.Ratio + perc.alumni,
data = cl_train, family = "binomial")
summary(logistic_model_glm1)
# Removing 'Accept' from the model to create model 2
logistic_model_glm2 <- glm(Private ~ Apps + Enroll + F.Undergrad + P.Undergrad + Outstate + S.F.Ratio + perc.alumni,
data = cl_train, family = "binomial")
summary(logistic_model_glm2)
# Compare the nested models with a likelihood ratio test
lrt_result1 <- anova(logistic_model_glm2, logistic_model_glm1, test = "Chisq")
# Print the LRT result
print(lrt_result1) # Second model is better
# Further refinement of the model by removing P.Undergrad
logistic_model_glm3 <- glm(Private ~ Apps + Enroll + F.Undergrad + Outstate + S.F.Ratio + perc.alumni,
data = cl_train, family = "binomial")
summary(logistic_model_glm3)
# Comparing the nested models with a likelihood ratio test
lrt_result2 <- anova(logistic_model_glm3, logistic_model_glm2, test = "Chisq")
# Printing the second LRT result
print(lrt_result2) # Third model is better
# Refining lastly by removing S.F.Ratio
logistic_model_glm4 <- glm(Private ~ Apps + Enroll + F.Undergrad + Outstate + perc.alumni,
data = cl_train, family = "binomial")
summary(logistic_model_glm4)
# Comparing the nested models with a likelihood ratio test
lrt_result3 <- anova(logistic_model_glm4, logistic_model_glm3, test = "Chisq")
# Printing the second LRT result
print(lrt_result3) # Fourth model is better
# Using rms package for detailed summary of the final model
lrm_model4 <- lrm(Private ~ Apps + Enroll + F.Undergrad + Outstate + perc.alumni,
data = cl_train, x = TRUE, y = TRUE )
lrm_model4
# Creating a confusion matrix and report the results of your model for the train set.
# Interpret and discuss the confusion matrix.
# Load the required package
library(caret)
# Predict probabilities for the training set
predicted_probabilities <- predict(logistic_model_glm4, newdata = cl_train, type = "response")
# Convert probabilities to binary classification using a threshold of 0.5
predicted_classes <- ifelse(predicted_probabilities > 0.5, "Yes", "No")
# Actual classes
actual_classes <- cl_train$Private
# Create a confusion matrix
confusion_matrix <- table(Predicted = predicted_classes, Actual = actual_classes)
# Print the confusion matrix
print(confusion_matrix)
# Calculate and print confusion matrix metrics using the caret package
confusion_matrix_metrics <- confusionMatrix(confusion_matrix)
print(confusion_matrix_metrics)
# Assignment Question 6. Create a confusion matrix and report the results of your model for the test set.
predicted_probabilities_test <- predict(logistic_model_glm4, newdata = cl_test, type = "response")
predicted_classes_test <- ifelse(predicted_probabilities_test > 0.5, "Yes", "No")
actual_classes_test <- cl_test$Private
confusion_matrix_test <- table(Predicted = predicted_classes_test, Actual = actual_classes_test)
print(confusion_matrix_test)
confusion_matrix_metrics_test <- confusionMatrix(confusion_matrix_test)
print(confusion_matrix_metrics_test)
# Model on the test set
logistic_model_glm4_test <- glm(Private ~ Apps + Enroll + F.Undergrad + Outstate + perc.alumni,
data = cl_test, family = "binomial")
summary(logistic_model_glm4_test)
# Model using rms package
lrm_model4_test <- lrm(Private ~ Apps + Enroll + F.Undergrad + Outstate + perc.alumni,
data = cl_test, x = TRUE, y = TRUE )
lrm_model4_test
# Assignment Question 7
# Plot and interpret the ROC curve.
library(pROC)
# ROC curve for the training data
roc_train <- roc(actual_classes, predicted_probabilities)
plot(roc_train, col = "green", main = "ROC for Training Data")
# ROC curve for the testing data
roc_test <- roc(actual_classes_test, predicted_probabilities_test)
plot(roc_test, add = FALSE, col = "red", main = "ROC for Testing Data")
# Assignment Question 8
# Calculate and interpret the AUC.
# Add AUC to the plot
auc(roc_train)
# Add AUC to the plot
auc(roc_test)
# END OF PROJECT