Problem Statement¶

Business Context¶

The Thera bank recently saw a steep decline in the number of users of their credit card. Credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.

Customers leaving credit cards services would lead bank to losses, so the bank wants to analyze the data of customers and identify or PREDICT the customers who will likely leave their credit card services and their reasons for same – so that bank could improve upon those areas (reasons for leaving).

You, as a Data scientist at Thera bank need to come up with a Classification Model that will help the bank improve its services so that customers do not get tempted to renounce their credit cards.

Data Description¶

  • 1 CLIENTNUM: Client number. Unique identifier for the customer holding the account
  • 2 Attrition_Flag: ← (Target Feature) Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer" (Category)
  • 3 Customer_Age: Age in Years
  • 4 Gender: Gender of the account holder
  • 5 Dependent_count: Number of dependents
  • 6 Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate (Category)
  • 7 Marital_Status: Marital Status of the account holder
  • 8 Income_Category: Annual Income Category of the account holder (Categories)
  • 9 Card_Category: Type of Card
  • 10 Months_on_book: Period of relationship with the bank (in months)
  • 11 Total_Relationship_Count: Total no. of products held by the customer
  • 12 Months_Inactive_12_mon: No. of months inactive in the last 12 months
  • 13 Contacts_Count_12_mon: No. of Contacts in the last 12 months
  • 14 **Credit_Limit**: Credit Limit on the Credit Card
  • 15 **Total_Revolving_Bal**: Total Revolving Balance on the Credit Card
  • 16 **Avg_Open_To_Buy**: Open to Buy Credit Line (Average of last 12 months)
  • 17 Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
  • 18 Total_Trans_Amt: Total Transaction Amount (Last 12 months)
  • 19 Total_Trans_Ct: Total Transaction Count (Last 12 months)
  • 20 Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
  • 21 **Avg_Utilization_Ratio**: Average Card Utilization Ratio
What Is a (Total) Revolving Balance? (#15)
  • If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance. (See #15)
What is the Average Open to Buy? (#16)
  • 'Open to Buy' means the amount left on your credit card to use. Now, this column (See # 16) represents the average of this value for the last 12 months.
What is the Average Utilization Ratio? (#21)
  • The Avg_Utilization_Ratio (See #21) represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:
  • ( Avg_Open_To_Buy (#16) / Credit_Limit (#14) ) + Avg_Utilization_Ratio (#21) = 1

Please read the instructions carefully before starting the project.¶

This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.

  • Blanks '___' are provided in the notebook that needs to be filled with an appropriate code to get the correct result. With every '___' blank, there is a comment that briefly describes what needs to be filled in the blank space.
  • Identify the task to be performed correctly, and only then proceed to write the required code.
  • Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
  • Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
  • Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.

Art_Zaragoza_Strategy.jpg

Stratergy for Building the Models, Data Splits, Performance Evaluations, Model Selection, Model Tuning, and Feature Analysis¶

  1. Data Preprocessing

    • Data Cleaning: Handle missing values, outliers, and inconsistent data types. • Feature Engineering: Create new features or transform existing ones to improve model performance. • Encoding Categorical Variables: Convert categorical variables into numerical ones using techniques like one-hot encoding or label encoding. • Scaling and Normalization: Standardize or normalize features to bring them to a common scale, especially for algorithms sensitive to feature magnitudes (e.g., logistic regression, SVM).

  2. Data Splitting

    • Initial Split (Train/Temporary): Split the dataset into training (80%) and temporary (20%) sets using train_test_split(). • Further Splitting of Temporary Set: Split the temporary set into test (75%) and validation (25%) sets to create three sets: • Training Set: Used to train the model. • Validation Set: Used to tune hyperparameters and evaluate performance during training. • Test Set: Used for final evaluation to assess the model’s performance on unseen data.

  3. Handling Imbalanced Data

    • Undersampling: Reducing the majority class to balance with the minority class. This prevents bias toward the majority class but risks losing valuable information. • Oversampling (e.g., SMOTE): Increasing the minority class by generating synthetic samples to balance the dataset. This helps ensure the model learns from both classes equally. • Choose between oversampling, undersampling, or ensemble methods based on the dataset’s class distribution and model performance.

  4. Model Training

    • Train different models on the training data (X_train, y_train). • Popular models include: • Decision Tree: Good for interpretability, but prone to overfitting. • Random Forest: Reduces overfitting by averaging multiple decision trees. • AdaBoost: Focuses on improving misclassified points iteratively. • Gradient Boosting:Builds models sequentially to minimize errors. • XGBoost: Optimized gradient boosting model for high performance.

  5. Model Evaluation on Validation Set

    • Performance Metrics: • Accuracy: The proportion of correct predictions. • Recall: The proportion of actual positives correctly identified (important for imbalanced data). • Precision: The proportion of positive predictions that were correct. • F1 Score: Harmonic mean of precision and recall (useful for imbalanced data). • ROC-AUC:** Evaluates the model’s ability to distinguish between classes at various thresholds. • Evaluate Models on the validation set to ensure they generalize well and are not overfitting. • Use validation set scores (e.g., recall) to compare models and select the best performing one.

  6. Model Selection

    • Compare different models based on validation set performance metrics like recall, precision, F1-score, and ROC-AUC. • Consider models that handle imbalanced data well if the dataset is imbalanced. • Choose the model that performs the best on the validation set (highest recall or F1-score) without overfitting.

  7. Hyperparameter Tuning

    • Perform grid search or random search on the selected model to find the optimal hyperparameters. • Tune hyperparameters like: • Maximum depth of trees for decision tree models. • Learning rate for boosting models. • Number of estimators (trees) for ensemble models. • Cross-validation: Use techniques like k-fold cross-validation during hyperparameter tuning to ensure the model generalizes well across different data splits.

  8. Final Model Evaluation on Test Set

    • Once the best model and hyperparameters are selected, evaluate the model on the test set to assess its performance on unseen data. • This evaluation simulates how the model will perform in the real world. • Report the performance metrics (accuracy, recall, precision, F1-score, ROC-AUC) on the test set.

  9. Feature Importance Analysis

    • Use feature importance methods to identify which features have the most influence on the model’s predictions: • For tree-based models like random forests, you can directly access the feature importances. • Use SHAP (SHapley Additive exPlanations) or LIME (Local Interpretable Model-agnostic Explanations) for more interpretability. • Drop less important features if needed to improve model interpretability or speed. • Analyze correlations between features to remove highly correlated features if they provide redundant information.

  10. Considerations for Model Improvement

    • Bias-Variance Tradeoff: Assess whether the model is overfitting (high variance) or underfitting (high bias). • Ensemble Models: Combine the strengths of multiple models to reduce bias and variance (e.g., stacking, bagging, boosting). • Cost of Misclassification: If false positives or false negatives are costly, adjust the classification threshold or use weighted metrics to penalize misclassification.

Sequence Summary

1.  Data Preprocessing
2.  Data Splitting (Train/Validation/Test)
3.  Handling Imbalanced Data (Over/Under Sampling)
4.  Model Training on Training Data
5.  Model Evaluation on Validation Set (Select Best Model)
6.  Model Selection Based on Validation Scores
7.  Hyperparameter Tuning (Cross-Validation)
8.  Final Model Evaluation on Test Set
9.  Feature Importance and Model Interpretation
10. Addressing Bias-Variance and Misclassification

Importing necessary libraries¶

In [ ]:
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
!pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.12.0 xgboost==2.0.3 -q --user
!pip install --upgrade -q threadpoolctl

Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.

In [ ]:
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np

# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# To tune model, get different metric scores, and split data
from sklearn import metrics
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder

# To impute missing values
from sklearn.impute import SimpleImputer

# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    AdaBoostClassifier,
    GradientBoostingClassifier,
    RandomForestClassifier,
    BaggingClassifier,
)
from xgboost import XGBClassifier

# To supress warnings
import warnings
warnings.filterwarnings("ignore")
In [ ]:
from google.colab import drive
drive.mount('/content/drive') # Load Drive
Mounted at /content/drive

Loading the dataset¶

In [ ]:
churn = pd.read_csv("BankChurners.csv") # Read Original Dataset and copy on churn dataframe (Protect it as Original)
In [ ]:
data = churn.copy() # Copy churn dataframe into data

Data Overview¶

  • Observations
  • Sanity Checks

The initial steps to get an overview of any dataset is to:

  • observe the first few rows of the dataset, to check whether the dataset has been loaded properly or not
  • get information about the number of rows and columns in the dataset
  • find out the data types of the columns to ensure that data is stored in the preferred format and the value of each property is as expected.
  • check the statistical summary of the dataset to get an overview of the numerical columns of the data

Summary:

  1. Observe Dataset Loaded Properly
  2. Rows and Columns Count
  3. Examine Data Type Formats
  4. Examine Statistical Summary (Numerical vs Categorical)
  5. Check for Duplicate Values
  6. Check fot Missing Values

(1) Displaying the first few rows of the dataset - Examine it Loaded Properly¶

In [ ]:
data.head(5) # View top 5 rows of the data
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
0 768805383 Existing Customer 45 M 3 High School Married $60K - $80K Blue 39 5 1 3 12691.000 777 11914.000 1.335 1144 42 1.625 0.061
1 818770008 Existing Customer 49 F 5 Graduate Single Less than $40K Blue 44 6 1 2 8256.000 864 7392.000 1.541 1291 33 3.714 0.105
2 713982108 Existing Customer 51 M 3 Graduate Married $80K - $120K Blue 36 4 1 0 3418.000 0 3418.000 2.594 1887 20 2.333 0.000
3 769911858 Existing Customer 40 F 4 High School NaN Less than $40K Blue 34 3 4 1 3313.000 2517 796.000 1.405 1171 20 2.333 0.760
4 709106358 Existing Customer 40 M 3 Uneducated Married $60K - $80K Blue 21 5 1 0 4716.000 0 4716.000 2.175 816 28 2.500 0.000
In [ ]:
data.tail(5) # View last 5 rows of the data
Out[ ]:
CLIENTNUM Attrition_Flag Customer_Age Gender Dependent_count Education_Level Marital_Status Income_Category Card_Category Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio
10122 772366833 Existing Customer 50 M 2 Graduate Single $40K - $60K Blue 40 3 2 3 4003.000 1851 2152.000 0.703 15476 117 0.857 0.462
10123 710638233 Attrited Customer 41 M 2 NaN Divorced $40K - $60K Blue 25 4 2 3 4277.000 2186 2091.000 0.804 8764 69 0.683 0.511
10124 716506083 Attrited Customer 44 F 1 High School Married Less than $40K Blue 36 5 3 4 5409.000 0 5409.000 0.819 10291 60 0.818 0.000
10125 717406983 Attrited Customer 30 M 2 Graduate NaN $40K - $60K Blue 36 4 3 3 5281.000 0 5281.000 0.535 8395 62 0.722 0.000
10126 714337233 Attrited Customer 43 F 2 Graduate Married Less than $40K Silver 25 6 2 4 10388.000 1961 8427.000 0.703 10294 61 0.649 0.189

(2) Checking the shape of the dataset - Examine Rows and Columns Count¶

In [ ]:
# Checking the number of rows and columns in the training data
churn.shape # Number of rows and columns
print("For the training dataset (# rows, # columns or features):",churn.shape)
print("Number of rows:", churn.shape[0])
print("Number of columns:", churn.shape[1])
For the training dataset (# rows, # columns or features): (10127, 21)
Number of rows: 10127
Number of columns: 21

(3) Examine data types of the columns for the dataset - Evaluate if Preferred Format and Values are as Expected¶

In [ ]:
data.info() # Data types of the columns/features in the training dataset.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10127 entries, 0 to 10126
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   CLIENTNUM                 10127 non-null  int64  
 1   Attrition_Flag            10127 non-null  object 
 2   Customer_Age              10127 non-null  int64  
 3   Gender                    10127 non-null  object 
 4   Dependent_count           10127 non-null  int64  
 5   Education_Level           8608 non-null   object 
 6   Marital_Status            9378 non-null   object 
 7   Income_Category           10127 non-null  object 
 8   Card_Category             10127 non-null  object 
 9   Months_on_book            10127 non-null  int64  
 10  Total_Relationship_Count  10127 non-null  int64  
 11  Months_Inactive_12_mon    10127 non-null  int64  
 12  Contacts_Count_12_mon     10127 non-null  int64  
 13  Credit_Limit              10127 non-null  float64
 14  Total_Revolving_Bal       10127 non-null  int64  
 15  Avg_Open_To_Buy           10127 non-null  float64
 16  Total_Amt_Chng_Q4_Q1      10127 non-null  float64
 17  Total_Trans_Amt           10127 non-null  int64  
 18  Total_Trans_Ct            10127 non-null  int64  
 19  Total_Ct_Chng_Q4_Q1       10127 non-null  float64
 20  Avg_Utilization_Ratio     10127 non-null  float64
dtypes: float64(5), int64(10), object(6)
memory usage: 1.6+ MB

(4) Statistical summary of the dataset - Examine Numerical Stats Overview¶

In [ ]:
num_object_columns = len(data.select_dtypes(include=['float64', 'int64']).columns) # Count the number of float64, and int64 data types.

print("There are only\033[1;31m", num_object_columns, "\033[0mfloat64 and int64 data types:\n") # Print the resulting sum of float64 + int64 data types.

data.describe().T # Statitical summary of the training data.
There are only 15 float64 and int64 data types:

Out[ ]:
count mean std min 25% 50% 75% max
CLIENTNUM 10127.000 739177606.334 36903783.450 708082083.000 713036770.500 717926358.000 773143533.000 828343083.000
Customer_Age 10127.000 46.326 8.017 26.000 41.000 46.000 52.000 73.000
Dependent_count 10127.000 2.346 1.299 0.000 1.000 2.000 3.000 5.000
Months_on_book 10127.000 35.928 7.986 13.000 31.000 36.000 40.000 56.000
Total_Relationship_Count 10127.000 3.813 1.554 1.000 3.000 4.000 5.000 6.000
Months_Inactive_12_mon 10127.000 2.341 1.011 0.000 2.000 2.000 3.000 6.000
Contacts_Count_12_mon 10127.000 2.455 1.106 0.000 2.000 2.000 3.000 6.000
Credit_Limit 10127.000 8631.954 9088.777 1438.300 2555.000 4549.000 11067.500 34516.000
Total_Revolving_Bal 10127.000 1162.814 814.987 0.000 359.000 1276.000 1784.000 2517.000
Avg_Open_To_Buy 10127.000 7469.140 9090.685 3.000 1324.500 3474.000 9859.000 34516.000
Total_Amt_Chng_Q4_Q1 10127.000 0.760 0.219 0.000 0.631 0.736 0.859 3.397
Total_Trans_Amt 10127.000 4404.086 3397.129 510.000 2155.500 3899.000 4741.000 18484.000
Total_Trans_Ct 10127.000 64.859 23.473 10.000 45.000 67.000 81.000 139.000
Total_Ct_Chng_Q4_Q1 10127.000 0.712 0.238 0.000 0.582 0.702 0.818 3.714
Avg_Utilization_Ratio 10127.000 0.275 0.276 0.000 0.023 0.176 0.503 0.999
In [ ]:
num_object_columns = len(data.select_dtypes(include='object').columns) # Count the number of object data types.

print("There are only\033[1;31m", num_object_columns, "\033[0mobject data types:\n") # Print the number of object data types in bold red.

data.describe(include=["object"]).T # List only the object data type for examinations.
There are only 6 object data types:

Out[ ]:
count unique top freq
Attrition_Flag 10127 2 Existing Customer 8500
Gender 10127 2 F 5358
Education_Level 8608 6 Graduate 3128
Marital_Status 9378 3 Married 4687
Income_Category 10127 6 Less than $40K 3561
Card_Category 10127 4 Blue 9436
In [ ]:
for i in data.describe(include=["object"]).columns: # Loop through object data types
    print("The breakdown of unique values in\033[1;33m", i, "\033[0mare as follows:") # Bold Yellow Object Datatype.
    print("\033[1;31m", data[i].value_counts().index.tolist(), "\033[0m") # Bold Red Categories names.
    print("\033[1;32m",data[i].value_counts(),"\033[0m") # Bold Green value count of each of the categories.
    print("-" * 70) # Print a line separator.
    print("\n") # Escape to a new line.
The breakdown of unique values in Attrition_Flag are as follows:
 ['Existing Customer', 'Attrited Customer'] 
 Attrition_Flag
Existing Customer    8500
Attrited Customer    1627
Name: count, dtype: int64 
----------------------------------------------------------------------


The breakdown of unique values in Gender are as follows:
 ['F', 'M'] 
 Gender
F    5358
M    4769
Name: count, dtype: int64 
----------------------------------------------------------------------


The breakdown of unique values in Education_Level are as follows:
 ['Graduate', 'High School', 'Uneducated', 'College', 'Post-Graduate', 'Doctorate'] 
 Education_Level
Graduate         3128
High School      2013
Uneducated       1487
College          1013
Post-Graduate     516
Doctorate         451
Name: count, dtype: int64 
----------------------------------------------------------------------


The breakdown of unique values in Marital_Status are as follows:
 ['Married', 'Single', 'Divorced'] 
 Marital_Status
Married     4687
Single      3943
Divorced     748
Name: count, dtype: int64 
----------------------------------------------------------------------


The breakdown of unique values in Income_Category are as follows:
 ['Less than $40K', '$40K - $60K', '$80K - $120K', '$60K - $80K', 'abc', '$120K +'] 
 Income_Category
Less than $40K    3561
$40K - $60K       1790
$80K - $120K      1535
$60K - $80K       1402
abc               1112
$120K +            727
Name: count, dtype: int64 
----------------------------------------------------------------------


The breakdown of unique values in Card_Category are as follows:
 ['Blue', 'Silver', 'Gold', 'Platinum'] 
 Card_Category
Blue        9436
Silver       555
Gold         116
Platinum      20
Name: count, dtype: int64 
----------------------------------------------------------------------


In [ ]:
# Additional way to display categories in a nicer tabular format.
for i in data.describe(include=["object"]).columns:
    print("The breakdown of unique values in\033[1;33m", i, "\033[0mare as follows:")
    value_counts = data[i].value_counts()
    value_counts_df = pd.DataFrame({'Categories': value_counts.index.tolist(), 'Counts': value_counts.values})
    value_counts_df.style.set_properties(**{'color': 'red', 'font-weight': 'bold'})
    print(value_counts_df.to_markdown(index=False, numalign='left', stralign='left'))
    print("-" * 70) # Print a line separator.
    print("\n")
The breakdown of unique values in Attrition_Flag are as follows:
| Categories        | Counts   |
|:------------------|:---------|
| Existing Customer | 8500     |
| Attrited Customer | 1627     |
----------------------------------------------------------------------


The breakdown of unique values in Gender are as follows:
| Categories   | Counts   |
|:-------------|:---------|
| F            | 5358     |
| M            | 4769     |
----------------------------------------------------------------------


The breakdown of unique values in Education_Level are as follows:
| Categories    | Counts   |
|:--------------|:---------|
| Graduate      | 3128     |
| High School   | 2013     |
| Uneducated    | 1487     |
| College       | 1013     |
| Post-Graduate | 516      |
| Doctorate     | 451      |
----------------------------------------------------------------------


The breakdown of unique values in Marital_Status are as follows:
| Categories   | Counts   |
|:-------------|:---------|
| Married      | 4687     |
| Single       | 3943     |
| Divorced     | 748      |
----------------------------------------------------------------------


The breakdown of unique values in Income_Category are as follows:
| Categories     | Counts   |
|:---------------|:---------|
| Less than $40K | 3561     |
| $40K - $60K    | 1790     |
| $80K - $120K   | 1535     |
| $60K - $80K    | 1402     |
| abc            | 1112     |
| $120K +        | 727      |
----------------------------------------------------------------------


The breakdown of unique values in Card_Category are as follows:
| Categories   | Counts   |
|:-------------|:---------|
| Blue         | 9436     |
| Silver       | 555      |
| Gold         | 116      |
| Platinum     | 20       |
----------------------------------------------------------------------


(5) - Checking for duplicate values¶

In [ ]:
# Check for duplicate values in the data
data.duplicated().sum() # Check duplicate entries in the data
Out[ ]:
0

(6) - Checking for missing values¶

In [ ]:
# Check for missing values in the data
data.isnull().sum() # Check missing entries in the train data
Out[ ]:
0
CLIENTNUM 0
Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0

Missing data for the following:

  • Education_Level 1519
  • Marital_Status 749

(7) Removing unnecesary data¶

In [ ]:
# CLIENTNUM column contains uniques ID for clients not adding value to the training dataset for the model.
data.drop(["CLIENTNUM"], axis=1, inplace=True) # It can be removed.

(8) Encoding object datatype for easier analysis and manipulations. Column/Feature = Attrition_Flag.¶

---------------------- (Important to properly plot, process and evaluate)

In [ ]:
## Encoding Existing and Attrited customers to 0 and 1 respectively, for analysis.
data["Attrition_Flag"].replace("Existing Customer", 0, inplace=True)
data["Attrition_Flag"].replace("Attrited Customer", 1, inplace=True)

Exploratory Data Analysis (EDA)¶

Functions defined for Exploratory Data Analysis.¶

  • histogram_boxplot - Plot Boxplot and a Histogram with the same scale.
  • labeled_barplot - Plot labeled barplots.
  • stacked_barplot - Plot Stacked Bar Chart.
  • distribution_plot_wrt_target - Plot Distributions.
In [ ]:
# Function to plot a Boxplot and a Histogram along the same scale.

def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a triangle will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
In [ ]:
# Function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
In [ ]:
# function to plot Stacked Bar Chart

def stacked_barplot(data, predictor, target):
    """
    Print the category counts and plot a stacked bar chart

    data: dataframe
    predictor: independent variable
    target: target variable
    """
    count = data[predictor].nunique()
    sorter = data[target].value_counts().index[-1]
    tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
        by=sorter, ascending=False
    )
    print(tab1)
    print("-" * 120)
    tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
        by=sorter, ascending=False
    )
    tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
    plt.legend(
        loc="lower left", frameon=False,
    )
    plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
    plt.show()
In [ ]:
# Function to plot Distributions

def distribution_plot_wrt_target(data, predictor, target):

    fig, axs = plt.subplots(2, 2, figsize=(12, 10))

    target_uniq = data[target].unique()

    axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
    sns.histplot(
        data=data[data[target] == target_uniq[0]],
        x=predictor,
        kde=True,
        ax=axs[0, 0],
        color="teal",
    )

    axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
    sns.histplot(
        data=data[data[target] == target_uniq[1]],
        x=predictor,
        kde=True,
        ax=axs[0, 1],
        color="orange",
    )

    axs[1, 0].set_title("Boxplot w.r.t target")
    sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")

    axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
    sns.boxplot(
        data=data,
        x=target,
        y=predictor,
        ax=axs[1, 1],
        showfliers=False,
        palette="gist_rainbow",
    )

    plt.tight_layout()
    plt.show()

Univariate Analysis - EDA¶

There are only **15** float64 and int64 data types to plot:

Customer_Age #3 Feature/Column

In [ ]:
# Call the histogram_boxplot function
histogram_boxplot(data, "Customer_Age", kde=True)

Months_on_book #10 Feature/Column

In [ ]:
histogram_boxplot(data,'Months_on_book',kde=True)  # Call histogram_boxplot for 'Months_on_book'.

Credit_Limit #14 Feature/Column

In [ ]:
histogram_boxplot(data,'Credit_Limit',kde=True)  # Call histogram_boxplot for 'Credit_Limit'.

Total_Revolving_Bal #15 Feature/Column

In [ ]:
histogram_boxplot(data,'Total_Revolving_Bal',kde=True)  # Call histogram_boxplot for 'Total_Revolving_Bal'.

Avg_Open_To_Buy #16 Feature/Column

In [ ]:
histogram_boxplot(data,'Avg_Open_To_Buy',kde=True)  # Call histogram_boxplot for 'Avg_Open_To_Buy'.

Total_Trans_Ct#19 Feature/Column

In [ ]:
histogram_boxplot(data,'Total_Trans_Ct',kde=True)  # Call histogram_boxplot for 'Total_Trans_Ct'.

Total_Amt_Chng_Q4_Q1 #17 Feature/Column

In [ ]:
histogram_boxplot(data,'Total_Amt_Chng_Q4_Q1',kde=True)  # Call histogram_boxplot for 'Total_Amt_Chng_Q4_Q1'.

Let's see total transaction amount distributed

Total_Trans_Amt #18 Feature/Column

In [ ]:
histogram_boxplot(data,'Total_Trans_Amt',kde=True)  # Call histogram_boxplot for 'Total_Trans_Amt'.

Total_Ct_Chng_Q4_Q1 #20 Feature/Column

In [ ]:
histogram_boxplot(data,'Total_Ct_Chng_Q4_Q1',kde=True)  # Call histogram_boxplot for 'Total_Ct_Chng_Q4_Q1'.

Avg_Utilization_Ratio #21 Feature/Column

In [ ]:
histogram_boxplot(data,'Avg_Utilization_Ratio',kde=True)  # Call histogram_boxplot for 'Avg_Utilization_Ratio'.

Dependent_count # 5 Feature/Column

In [ ]:
labeled_barplot(data, "Dependent_count") # Call labeled_barplot for Dependent_count.

Total_Relationship_Count #11 Feature/Column

In [ ]:
labeled_barplot(data,"Total_Relationship_Count") # Call labeled_barplot for Total_Relationship_Count.

Months_Inactive_12_mon #12 Feature/Column

In [ ]:
labeled_barplot(data,"Months_Inactive_12_mon") # Call labeled_barplot for Months_Inactive_12_mon.

Contacts_Count_12_mon #13 Feature/Column

In [ ]:
labeled_barplot(data,"Contacts_Count_12_mon") # Call labeled_barplot for Contacts_Count_12_mon.

Gender #4 Feature/Column

In [ ]:
labeled_barplot(data,"Gender") # Call labeled_barplot for Gender.

Let's see the distribution of the level of education of customers

Education_Level #6 Feature/Column

In [ ]:
labeled_barplot(data,"Education_Level") # Call labeled_barplot for Education_Level.

Marital_Status #7 Feature/Column

In [ ]:
labeled_barplot(data,"Marital_Status") # Call labeled_barplot for Marital_Status.

Let's see the distribution of the level of income of customers

Income_Category #8 Feature/Column

In [ ]:
labeled_barplot(data,"Income_Category") # Call labeled_barplot for Income_Category.

Card_Category #9 Feature/Column

In [ ]:
labeled_barplot(data,"Card_Category") # Call labeled_barplot for Card_Category.

Attrition_Flag #2 Feature/Column << TARGET

In [ ]:
labeled_barplot(data,"Attrition_Flag") # Call labeled_barplot for Attrition_Flag.
In [ ]:
# Displaying Histograms:
data.hist(figsize=(14, 14))
plt.show()

Bivariate Distributions - EDA¶

Attributes that have a strong correlation with each other:

Correlation Check

In [ ]:
plt.figure(figsize=(15, 7))
corr_matrix = data.select_dtypes(exclude='object').corr() # Exclude Categories (object datatype) Features/Columns - This Time for quick analysis.
sns.heatmap(corr_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

EDA & Correlations Observations:¶

Excluding categories the strongest feature/column correlations exist between:

  • 1) Credit_Limit #14 and Avg_Open_To_Buy #16 at corr = 1
  • 2) Customer_Age #2 and Months_On_Book #10 at corr = 0.79
  • 3) Total_Revolving_Bal #15 and Avg_Utilization_Ratio #21 corr = 0.62
  • 4) Total_Ct_Chng_04_Q1 #20 and Total_Amt_Chng_04_Q1 #17 corr = 0.38
In [ ]:
from prettytable import PrettyTable
print("Attributes that have the strongest correlations with each other:\n")
# Initialize a PrettyTable object
table = PrettyTable()

# Define column headers
table.field_names = ["\033[1;4mFeature 1", "Feature 2", "Correlation\033[0m"]

# Add rows
table.add_row(["Credit_Limit #14", "Avg_Open_To_Buy #16", "\033[1;33m1\033[0m"])
table.add_row(["Customer_Age #2", "Months_On_Book #10", "\033[1;33m0.79\033[0m"])
table.add_row(["Total_Revolving_Bal #15", "Avg_Utilization_Ratio #21","\033[1;33m0.62\033[0m" ])
table.add_row(["Total_Ct_Chng_04_Q1 #20", "Total_Amt_Chng_04_Q1 #17", "\033[1;33m0.38\033[0m"])

# Print the table
print(table)
Attributes that have the strongest correlations with each other:

+-------------------------+---------------------------+-------------+
|        Feature 1        |         Feature 2         | Correlation |
+-------------------------+---------------------------+-------------+
|     Credit_Limit #14    |    Avg_Open_To_Buy #16    |      1      |
|     Customer_Age #2     |     Months_On_Book #10    |     0.79    |
| Total_Revolving_Bal #15 | Avg_Utilization_Ratio #21 |     0.62    |
| Total_Ct_Chng_04_Q1 #20 |  Total_Amt_Chng_04_Q1 #17 |     0.38    |
+-------------------------+---------------------------+-------------+
In [ ]:
import matplotlib.pyplot as plt #  plots library

print("\n\033[1;92mAttrition_Flag vs Features with Categories to analyze Category distributions:\033[0m\n")
print(" 0 = Existing Customer"," \033[1;31m1 = Customer Attrition\033[0m\n")
# Create subplots for each feature with Categories to analyze distributions
stacked_barplot(data, "Attrition_Flag","Gender")
stacked_barplot(data, "Attrition_Flag","Dependent_count")
stacked_barplot(data, "Attrition_Flag","Education_Level")
stacked_barplot(data, "Attrition_Flag","Marital_Status")
stacked_barplot(data, "Attrition_Flag","Income_Category")
stacked_barplot(data, "Attrition_Flag","Card_Category")
stacked_barplot(data, "Attrition_Flag","Total_Relationship_Count")
stacked_barplot(data, "Attrition_Flag","Months_Inactive_12_mon")
stacked_barplot(data, "Attrition_Flag","Contacts_Count_12_mon")
# Adjust layout to prevent overlap
plt.tight_layout()

# Display the plots
plt.show()
print("\n\033[1;92m------- END OF SECTION: Attrition_Flag vs Feature with Categories to analyze distributions:\033[0m\n")
Attrition_Flag vs Features with Categories to analyze Category distributions:

 0 = Existing Customer  1 = Customer Attrition

Gender             F     M    All
Attrition_Flag                   
All             5358  4769  10127
0               4428  4072   8500
1                930   697   1627
------------------------------------------------------------------------------------------------------------------------
Dependent_count    0     1     2     3     4    5    All
Attrition_Flag                                          
All              904  1838  2655  2732  1574  424  10127
0                769  1569  2238  2250  1314  360   8500
1                135   269   417   482   260   64   1627
------------------------------------------------------------------------------------------------------------------------
Education_Level  College  Doctorate  Graduate  High School  Post-Graduate  \
Attrition_Flag                                                              
All                 1013        451      3128         2013            516   
0                    859        356      2641         1707            424   
1                    154         95       487          306             92   

Education_Level  Uneducated   All  
Attrition_Flag                     
All                    1487  8608  
0                      1250  7237  
1                       237  1371  
------------------------------------------------------------------------------------------------------------------------
Marital_Status  Divorced  Married  Single   All
Attrition_Flag                                 
All                  748     4687    3943  9378
0                    627     3978    3275  7880
1                    121      709     668  1498
------------------------------------------------------------------------------------------------------------------------
Income_Category  $120K +  $40K - $60K  $60K - $80K  $80K - $120K  \
Attrition_Flag                                                     
All                  727         1790         1402          1535   
0                    601         1519         1213          1293   
1                    126          271          189           242   

Income_Category  Less than $40K   abc    All  
Attrition_Flag                                
All                        3561  1112  10127  
0                          2949   925   8500  
1                           612   187   1627  
------------------------------------------------------------------------------------------------------------------------
Card_Category   Blue  Gold  Platinum  Silver    All
Attrition_Flag                                     
All             9436   116        20     555  10127
0               7917    95        15     473   8500
1               1519    21         5      82   1627
------------------------------------------------------------------------------------------------------------------------
Total_Relationship_Count    1     2     3     4     5     6    All
Attrition_Flag                                                    
All                       910  1243  2305  1912  1891  1866  10127
0                         677   897  1905  1687  1664  1670   8500
1                         233   346   400   225   227   196   1627
------------------------------------------------------------------------------------------------------------------------
Months_Inactive_12_mon   0     1     2     3    4    5    6    All
Attrition_Flag                                                    
All                     29  2233  3282  3846  435  178  124  10127
1                       15   100   505   826  130   32   19   1627
0                       14  2133  2777  3020  305  146  105   8500
------------------------------------------------------------------------------------------------------------------------
Contacts_Count_12_mon    0     1     2     3     4    5   6    All
Attrition_Flag                                                    
1                        7   108   403   681   315   59  54   1627
All                    399  1499  3227  3380  1392  176  54  10127
0                      392  1391  2824  2699  1077  117   0   8500
------------------------------------------------------------------------------------------------------------------------
<Figure size 640x480 with 0 Axes>
------- END OF SECTION: Attrition_Flag vs Feature with Categories to analyze distributions:

In [ ]:
import matplotlib.pyplot as plt #  plots library

print("\n\033[1;92mFeatures with Categories to analyze Attrition distributions:\033[0m\n")
print(" 0 = Existing Customer"," \033[1;31m1 = Customer Attrition\033[0m\n")
# Create subplots for each feature with Categories to analyze distributions
stacked_barplot(data, "Gender", "Attrition_Flag")
stacked_barplot(data, "Dependent_count", "Attrition_Flag")
stacked_barplot(data, "Education_Level", "Attrition_Flag")
stacked_barplot(data, "Marital_Status", "Attrition_Flag")
stacked_barplot(data, "Income_Category", "Attrition_Flag")
stacked_barplot(data, "Card_Category", "Attrition_Flag")
stacked_barplot(data, "Total_Relationship_Count", "Attrition_Flag")
stacked_barplot(data, "Months_Inactive_12_mon", "Attrition_Flag")
stacked_barplot(data, "Contacts_Count_12_mon", "Attrition_Flag")
# Adjust layout to prevent overlap
plt.tight_layout()

# Display the plots
plt.show()
print("\n\033[1;92m------- END OF SECTION: Attrition_Flag vs Feature with Categories to analyze distributions:\033[0m\n")
Features with Categories to analyze Attrition distributions:

 0 = Existing Customer  1 = Customer Attrition

Attrition_Flag     0     1    All
Gender                           
All             8500  1627  10127
F               4428   930   5358
M               4072   697   4769
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag      0     1    All
Dependent_count                   
All              8500  1627  10127
3                2250   482   2732
2                2238   417   2655
1                1569   269   1838
4                1314   260   1574
0                 769   135    904
5                 360    64    424
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag      0     1   All
Education_Level                  
All              7237  1371  8608
Graduate         2641   487  3128
High School      1707   306  2013
Uneducated       1250   237  1487
College           859   154  1013
Doctorate         356    95   451
Post-Graduate     424    92   516
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag     0     1   All
Marital_Status                  
All             7880  1498  9378
Married         3978   709  4687
Single          3275   668  3943
Divorced         627   121   748
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag      0     1    All
Income_Category                   
All              8500  1627  10127
Less than $40K   2949   612   3561
$40K - $60K      1519   271   1790
$80K - $120K     1293   242   1535
$60K - $80K      1213   189   1402
abc               925   187   1112
$120K +           601   126    727
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag     0     1    All
Card_Category                    
All             8500  1627  10127
Blue            7917  1519   9436
Silver           473    82    555
Gold              95    21    116
Platinum          15     5     20
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag               0     1    All
Total_Relationship_Count                   
All                       8500  1627  10127
3                         1905   400   2305
2                          897   346   1243
1                          677   233    910
5                         1664   227   1891
4                         1687   225   1912
6                         1670   196   1866
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag             0     1    All
Months_Inactive_12_mon                   
All                     8500  1627  10127
3                       3020   826   3846
2                       2777   505   3282
4                        305   130    435
1                       2133   100   2233
5                        146    32    178
6                        105    19    124
0                         14    15     29
------------------------------------------------------------------------------------------------------------------------
Attrition_Flag            0     1    All
Contacts_Count_12_mon                   
All                    8500  1627  10127
3                      2699   681   3380
2                      2824   403   3227
4                      1077   315   1392
1                      1391   108   1499
5                       117    59    176
6                         0    54     54
0                       392     7    399
------------------------------------------------------------------------------------------------------------------------
<Figure size 640x480 with 0 Axes>
------- END OF SECTION: Attrition_Flag vs Feature with Categories to analyze distributions:




Individually Plotted Features during code development

Attrition_Flag vs Gender

In [ ]:
stacked_barplot(data, "Gender", "Attrition_Flag") # Call stacked_bar to analyze Distribution of Attrition by Gender.
Attrition_Flag     0     1    All
Gender                           
All             8500  1627  10127
F               4428   930   5358
M               4072   697   4769
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Marital_Status

In [ ]:
stacked_barplot(data,"Attrition_Flag", "Marital_Status") # Call distribution_plot for Attrition_Flag vs Marital_Status
Marital_Status  Divorced  Married  Single   All
Attrition_Flag                                 
All                  748     4687    3943  9378
0                    627     3978    3275  7880
1                    121      709     668  1498
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Education_Level

In [ ]:
stacked_barplot(data,"Attrition_Flag", "Education_Level") # Call distribution_plot for Attrition_Flag vs Education_Level
Education_Level  College  Doctorate  Graduate  High School  Post-Graduate  \
Attrition_Flag                                                              
All                 1013        451      3128         2013            516   
0                    859        356      2641         1707            424   
1                    154         95       487          306             92   

Education_Level  Uneducated   All  
Attrition_Flag                     
All                    1487  8608  
0                      1250  7237  
1                       237  1371  
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Income_Category

In [ ]:
stacked_barplot(data,"Attrition_Flag", "Income_Category") # Call distribution_plot for Attrition_Flag vs Income_Category
Income_Category  $120K +  $40K - $60K  $60K - $80K  $80K - $120K  \
Attrition_Flag                                                     
All                  727         1790         1402          1535   
0                    601         1519         1213          1293   
1                    126          271          189           242   

Income_Category  Less than $40K   abc    All  
Attrition_Flag                                
All                        3561  1112  10127  
0                          2949   925   8500  
1                           612   187   1627  
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Contacts_Count_12_mon

In [ ]:
stacked_barplot(data,"Attrition_Flag", "Contacts_Count_12_mon") # Call distribution_plot for Attrition_Flag vs Income_Category
Contacts_Count_12_mon    0     1     2     3     4    5   6    All
Attrition_Flag                                                    
1                        7   108   403   681   315   59  54   1627
All                    399  1499  3227  3380  1392  176  54  10127
0                      392  1391  2824  2699  1077  117   0   8500
------------------------------------------------------------------------------------------------------------------------

Let's see the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)

Attrition_Flag vs Months_Inactive_12_mon

In [ ]:
stacked_barplot(data,"Attrition_Flag", "Months_Inactive_12_mon") # Call distribution_plot for Attrition_Flag vs Months_Inactive_12_mon
Months_Inactive_12_mon   0     1     2     3    4    5    6    All
Attrition_Flag                                                    
All                     29  2233  3282  3846  435  178  124  10127
1                       15   100   505   826  130   32   19   1627
0                       14  2133  2777  3020  305  146  105   8500
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Total_Relationship_Count

In [ ]:
stacked_barplot(data,"Attrition_Flag", "Months_Inactive_12_mon") # Call stacked_barplot for Attrition_Flag vs Total_Relationship_Count.
Months_Inactive_12_mon   0     1     2     3    4    5    6    All
Attrition_Flag                                                    
All                     29  2233  3282  3846  435  178  124  10127
1                       15   100   505   826  130   32   19   1627
0                       14  2133  2777  3020  305  146  105   8500
------------------------------------------------------------------------------------------------------------------------

Attrition_Flag vs Dependent_count

In [ ]:
stacked_barplot(data,"Attrition_Flag", "Months_Inactive_12_mon") # Call stacked_barplot for Attrition_Flag vs Dependent_count.
Months_Inactive_12_mon   0     1     2     3    4    5    6    All
Attrition_Flag                                                    
All                     29  2233  3282  3846  435  178  124  10127
1                       15   100   505   826  130   32   19   1627
0                       14  2133  2777  3020  305  146  105   8500
------------------------------------------------------------------------------------------------------------------------

Total_Revolving_Bal vs Attrition_Flag




End of individually plotted Features during development

In [ ]:
distribution_plot_wrt_target(data, "Total_Revolving_Bal", "Attrition_Flag") # Call istribution_plot_wrt_target for Total_Revolving_Bal vs Attrition_Flag.

Attrition_Flag vs Credit_Limit

In [ ]:
distribution_plot_wrt_target(data, "Attrition_Flag", "Credit_Limit") # Call distribution_plot_wrt_target for Attrition_Flag vs Customer_Age.

Attrition_Flag vs Customer_Age

In [ ]:
distribution_plot_wrt_target(data, "Attrition_Flag", "Customer_Age") # Call distribution_plot_wrt_target for Attrition_Flag vs Customer_Age.

Total_Trans_Ct vs Attrition_Flag

In [ ]:
distribution_plot_wrt_target(data, "Total_Trans_Ct", "Attrition_Flag") # Call distribution_plot_wrt_target for Total_Trans_Ct vs Attrition_Flag.

Total_Trans_Amt vs Attrition_Flag

Let's see the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)

Total_Ct_Chng_Q4_Q1 vs Attrition_Flag

In [ ]:
distribution_plot_wrt_target(data, "Total_Ct_Chng_Q4_Q1", "Attrition_Flag") # Call distribution_plot_wrt_target for Total_Ct_Chng_Q4_Q1 vs Attrition_Flag

Avg_Utilization_Ratio vs Attrition_Flag

In [ ]:
distribution_plot_wrt_target(data, "Avg_Utilization_Ratio", "Attrition_Flag") # Call distribution_plot_wrt_target for Avg_Utilization_Ratio vs Attrition_Flag

Attrition_Flag vs Months_on_book

In [ ]:
distribution_plot_wrt_target(data, "Attrition_Flag", "Months_on_book") # Call distribution_plot_wrt_target for Attrition_Flag vs Months_on_book

Attrition_Flag vs Total_Revolving_Bal

In [ ]:
distribution_plot_wrt_target(data, "Attrition_Flag", "Total_Revolving_Bal") # Call distribution_plot_wrt_target for Attrition_Flag vs Total_Revolving_Bal

Attrition_Flag vs Avg_Open_To_Buy

In [ ]:
distribution_plot_wrt_target(data, "Attrition_Flag", "Avg_Open_To_Buy") # Call distribution_plot_wrt_target for Attrition_Flag vs Avg_Open_To_Buy

Data Preprocessing - Strategy Step 1¶

Outlier Detection¶

In [ ]:
# You are selecting a specific column if data is a DataFrame - Your choice
# Replace 'column_name' with the actual column you are working with
data_column = data['Attrition_Flag']

# Convert the column to numeric values
data_column = pd.to_numeric(data_column, errors='coerce')  # Convert to numeric, NaN for non-numeric
data_column = data_column.dropna()  # Remove rows with NaN values

# Calculate the quartiles
Q1 = data_column.quantile(0.25)  # 25th percentile
Q3 = data_column.quantile(0.75)  # 75th percentile

# Interquartile Range (IQR)
IQR = Q3 - Q1

# Finding the lower and upper bounds for outliers
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
In [ ]:
# checking the % outliers
((data.select_dtypes(include=["float64", "int64"]) < lower) | (data.select_dtypes(include=["float64", "int64"]) > upper)).sum() / len(data) * 100
Out[ ]:
0
Attrition_Flag 16.066
Customer_Age 100.000
Dependent_count 91.073
Months_on_book 100.000
Total_Relationship_Count 100.000
Months_Inactive_12_mon 99.714
Contacts_Count_12_mon 96.060
Credit_Limit 100.000
Total_Revolving_Bal 75.610
Avg_Open_To_Buy 100.000
Total_Amt_Chng_Q4_Q1 99.951
Total_Trans_Amt 100.000
Total_Trans_Ct 100.000
Total_Ct_Chng_Q4_Q1 99.931
Avg_Utilization_Ratio 75.610

Train - Test Data Split - Strategy Step 2¶

In [ ]:
# creating the copy of the dataframe
data1 = data.copy()
In [ ]:
# Replace "Unknown" (or any other anomalous value) with NaN in the "Income_Category" column
data1["Income_Category"].replace("Unknown", np.nan, inplace=True)
In [ ]:
data1.isna().sum()
Out[ ]:
0
Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0

0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0

dtype: int64

Instantiating the imputer for re-use¶

In [ ]:
# Creating an instace of the imputer to be used
imputer = SimpleImputer(strategy="most_frequent")

Data Splitting - [|||||||||| train - |||||| val - |||| test (unseen)]¶

------------------------------> Dividing train data into X and y <------------------------

In [ ]:
# Separation of features X and the target y
# It is a common preprocessing step before training machine learning models.

X = data1.drop(["Attrition_Flag"], axis=1) # Remove the whole column
y = data1["Attrition_Flag"] # extracts the column "Attrition_Flag" from data1 and assigns it to the variable y

# The column "Attrition_Flag" contains the target variable (label), which is what the model will try to predict.
# In this case, it likely indicates whether a customer has left the credit card services (churned) or not.

print("The second data set is the target column.\nThe first one are the features.\n" ) # The column "Attrition_Flag" contains the target variable (label)
print(X.shape, y.shape)
print("\nOur models will try to predict the target variable y (Attrition_Flag).\nIn other words, whether a customer left the credit card services (churned) or not.")
The second data set is the target column.
The first one are the features.

(10127, 19) (10127,)

Our models will try to predict the target variable y (Attrition_Flag).
In other words, whether a customer left the credit card services (churned) or not.

------------------------------> Splitting the original dataset into: X_train, X_val, and X_test datasets <------------------------

In [ ]:
# Import
from sklearn.model_selection import train_test_split

print("\n")
print("|" * 100)# Print a line separator.
# X is your feature data, and y is your target data
print(f"Original set shape: {X.shape}\n")


# Step 1: Split data into 80% training and 20% temporary (test + validation) sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 2: Subsplit the temporary set into 75% test and 25% validation sets
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)

# Print shapes of the resulting splits
print("1. First train_test_split: Splits the data into 80% training and 20% \033[1;4mtemporary\033[0m: (X_temp, y_temp).\n")
print("|" * 80,"\033[1;33m|\033[0m" * 20)# Print a line separator.
print(f"Training set shape: {X_train.shape}")
print(f"Temporary set shape: {X_temp.shape}\n")
print("2. Second train_test_split: Subsplits the \033[1;4mtemporary set\033[0m: into 75% test and 25% validation (X_test, X_val).\n")
print("|" * 80,"\033[1;33mt\033[0m" * 15,"\033[1;33mv\033[0m" * 5)# Print a line separator.
print(f"Test set shape: {X_test.shape}")
print(f"Validation set shape: {X_val.shape}")

||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Original set shape: (10127, 19)

1. First train_test_split: Splits the data into 80% training and 20% temporary: (X_temp, y_temp).

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||
Training set shape: (8101, 19)
Temporary set shape: (2026, 19)

2. Second train_test_split: Subsplits the temporary set: into 75% test and 25% validation (X_test, X_val).

|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ttttttttttttttt vvvvv
Test set shape: (1519, 19)
Validation set shape: (507, 19)

Explanation:

  • random_state=42 ensures reproducibility (you can set it to any integer).
  • The first test_size=0.2 means 20% of the total data is allocated to temporary data.
  • The second test_size=0.25 means 25% of the temporary set (which is 20% of the total data) is allocated to the validation set, leaving 75% of the temporary set as the test set.

Missing value imputation¶

In [ ]:
reqd_col_for_impute = ["Education_Level", "Marital_Status", "Income_Category"] # Category columns to impute #6, #7, and #8

-----------------> DEFINING X_train, X_val, and X_test ------------------------<< IMPORTANT

In [ ]:
# Category columns to impute
reqd_col_for_impute = ["Education_Level", "Marital_Status", "Income_Category"] # Category columns to impute #6, #7, and #8

# Fit and transform the train data
X_train[reqd_col_for_impute] = imputer.fit_transform(X_train[reqd_col_for_impute])# Impute missing values in X_train.

# Transform the validation data
X_val[reqd_col_for_impute]  =  imputer.fit_transform(X_val[reqd_col_for_impute]) # Impute missing values in X_val.

# Transform the test data
X_test[reqd_col_for_impute] = imputer.fit_transform(X_test[reqd_col_for_impute]) #Impute missing values in X_test.
In [ ]:
data1.isna().sum()
Out[ ]:
0
Attrition_Flag 0
Customer_Age 0
Gender 0
Dependent_count 0
Education_Level 1519
Marital_Status 749
Income_Category 0
Card_Category 0
Months_on_book 0
Total_Relationship_Count 0
Months_Inactive_12_mon 0
Contacts_Count_12_mon 0
Credit_Limit 0
Total_Revolving_Bal 0
Avg_Open_To_Buy 0
Total_Amt_Chng_Q4_Q1 0
Total_Trans_Amt 0
Total_Trans_Ct 0
Total_Ct_Chng_Q4_Q1 0
Avg_Utilization_Ratio 0

In [ ]:
# Checking that no column has missing values in train or test sets
print("\nChecking that \033[1;4mno column has missing values\033[0m in train dataset (\033[1;92my_train\033[0m):")
print(y_train.isna().sum())
print("-" * 30)
print("\nChecking that \033[1;4mno column has missing values\033[0m in validation dataset (\033[1;92my_val\033[0m):")
print(y_val.isna().sum())
print("-" * 30)
print("\Checking that \033[1;4mno column has missing values\033[0m in test dataset (\033[1;92my_test\033[0m):")
print(y_test.isna().sum())
print("-" * 30)
Checking that no column has missing values in train dataset (y_train):
0
------------------------------

Checking that no column has missing values in validation dataset (y_val):
0
------------------------------
\Checking that no column has missing values in test dataset (y_test):
0
------------------------------
In [ ]:
# Checking that no column has missing values in train or test sets
print("\nChecking that \033[1;4mno column has missing values\033[0m in train dataset (\033[1;92mX_train\033[0m): \n")
print(X_train.isna().sum())
print("-" * 30)
print("\nChecking that \033[1;4mno column has missing values\033[0m in validation dataset (\033[1;92mX_val\033[0m): \n")
print(X_val.isna().sum())
print("-" * 30)
print("\Checking that \033[1;4mno column has missing values\033[0m in test dataset (\033[1;92mX_test\033[0m): \n")
print(X_test.isna().sum())
print("-" * 60)
print("If not, check imputer function was executed.")
print("-" * 60)
Checking that no column has missing values in train dataset (X_train): 

Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------

Checking that no column has missing values in validation dataset (X_val): 

Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------
\Checking that no column has missing values in test dataset (X_test): 

Customer_Age                0
Gender                      0
Dependent_count             0
Education_Level             0
Marital_Status              0
Income_Category             0
Card_Category               0
Months_on_book              0
Total_Relationship_Count    0
Months_Inactive_12_mon      0
Contacts_Count_12_mon       0
Credit_Limit                0
Total_Revolving_Bal         0
Avg_Open_To_Buy             0
Total_Amt_Chng_Q4_Q1        0
Total_Trans_Amt             0
Total_Trans_Ct              0
Total_Ct_Chng_Q4_Q1         0
Avg_Utilization_Ratio       0
dtype: int64
------------------------------------------------------------
If not, check imputer function was executed.
------------------------------------------------------------
In [ ]:
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_train[i].value_counts())
    print("-" * 30)
Gender
F    4279
M    3822
Name: count, dtype: int64
------------------------------
Education_Level
Graduate         3733
High School      1619
Uneducated       1171
College           816
Post-Graduate     407
Doctorate         355
Name: count, dtype: int64
------------------------------
Marital_Status
Married     4346
Single      3144
Divorced     611
Name: count, dtype: int64
------------------------------
Income_Category
Less than $40K    2812
$40K - $60K       1453
$80K - $120K      1237
$60K - $80K       1122
abc                889
$120K +            588
Name: count, dtype: int64
------------------------------
Card_Category
Blue        7557
Silver       436
Gold          93
Platinum      15
Name: count, dtype: int64
------------------------------
In [ ]:
# Display Features:
print("\033[1;33mTRAINING SET:\033[0m")
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_train[i].value_counts())
    print("\033[1;33mFeature in X_train\033[0m")
    print("-" * 30)# Print a line separator.
    print("\n")
TRAINING SET:
Gender
F    4279
M    3822
Name: count, dtype: int64
Feature in X_train
------------------------------


Education_Level
Graduate         3733
High School      1619
Uneducated       1171
College           816
Post-Graduate     407
Doctorate         355
Name: count, dtype: int64
Feature in X_train
------------------------------


Marital_Status
Married     4346
Single      3144
Divorced     611
Name: count, dtype: int64
Feature in X_train
------------------------------


Income_Category
Less than $40K    2812
$40K - $60K       1453
$80K - $120K      1237
$60K - $80K       1122
abc                889
$120K +            588
Name: count, dtype: int64
Feature in X_train
------------------------------


Card_Category
Blue        7557
Silver       436
Gold          93
Platinum      15
Name: count, dtype: int64
Feature in X_train
------------------------------


In [ ]:
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_val[i].value_counts())
    print("\033[1;33mFeature X_train\033[0m")
    print("-" * 30)# Print a line separator.
    print("\n")
In [ ]:
# Display Features:
print("\033[1;33mVALIDATION SET:\033[0m")
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_val[i].value_counts())
    print("\033[1;33mFeature in X_val\033[0m")
    print("-" * 30)# Print a line separator.
    print("\n")
VALIDATION SET:
Gender
F    266
M    241
Name: count, dtype: int64
Feature in X_val
------------------------------


Education_Level
Graduate         237
High School       94
Uneducated        84
College           49
Doctorate         24
Post-Graduate     19
Name: count, dtype: int64
Feature in X_val
------------------------------


Marital_Status
Married     272
Single      193
Divorced     42
Name: count, dtype: int64
Feature in X_val
------------------------------


Income_Category
Less than $40K    174
$40K - $60K        88
$60K - $80K        74
$80K - $120K       71
abc                62
$120K +            38
Name: count, dtype: int64
Feature in X_val
------------------------------


Card_Category
Blue        465
Silver       37
Gold          3
Platinum      2
Name: count, dtype: int64
Feature in X_val
------------------------------


In [ ]:
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_train[i].value_counts())
    print("*" * 30)
In [ ]:
# Display Features:
print("\033[1;33mTEST SET:\033[0m")
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
    print(X_test[i].value_counts())
    print("\033[1;33mFeature in X_test\033[0m")
    print("-" * 30)# Print a line separator.
    print("\n")
TEST SET:
Gender
F    813
M    706
Name: count, dtype: int64
Feature in X_test
------------------------------


Education_Level
Graduate         677
High School      300
Uneducated       232
College          148
Post-Graduate     90
Doctorate         72
Name: count, dtype: int64
Feature in X_test
------------------------------


Marital_Status
Married     818
Single      606
Divorced     95
Name: count, dtype: int64
Feature in X_test
------------------------------


Income_Category
Less than $40K    575
$40K - $60K       249
$80K - $120K      227
$60K - $80K       206
abc               161
$120K +           101
Name: count, dtype: int64
Feature in X_test
------------------------------


Card_Category
Blue        1414
Silver        82
Gold          20
Platinum       3
Name: count, dtype: int64
Feature in X_test
------------------------------


Encoding categorical variables - Important¶

Defining and Imputing values in each dataset:

In [ ]:
# Imputing:
X_train = pd.get_dummies(X_train, drop_first=True) # Impute missing values in X_train
X_val = pd.get_dummies(X_val, drop_first=True) # Impute missing values in X_val
X_test = pd.get_dummies(X_test, drop_first=True) # Impute missing values in X_test
print(X_train.shape, X_val.shape, X_test.shape)
(8101, 30) (507, 30) (1519, 30)
  • After encoding there are now **30** columns from the original 21/20 columns.

In [ ]:
X_train.head(10) # Check the top 10 rows from the X_train dataset.
Out[ ]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Gender_M Education_Level_Doctorate Education_Level_Graduate Education_Level_High School Education_Level_Post-Graduate Education_Level_Uneducated Marital_Status_Married Marital_Status_Single Income_Category_$40K - $60K Income_Category_$60K - $80K Income_Category_$80K - $120K Income_Category_Less than $40K Income_Category_abc Card_Category_Gold Card_Category_Platinum Card_Category_Silver
9066 54 1 36 1 3 3 3723.000 1728 1995.000 0.595 8554 99 0.678 0.464 False False True False False False False True False False False False True False False False
5814 58 4 48 1 4 3 5396.000 1803 3593.000 0.493 2107 39 0.393 0.334 False False False True False False True False False False False False True False False False
792 45 4 36 6 1 3 15987.000 1648 14339.000 0.732 1436 36 1.250 0.103 False False True False False False False True False False False True False True False False
1791 34 2 36 4 3 4 3625.000 2517 1108.000 1.158 2616 46 1.300 0.694 False False True False False False False True False False False True False False False False
5011 49 2 39 5 3 4 2720.000 1926 794.000 0.602 3806 61 0.794 0.708 False False False True False False True False True False False False False False False False
2260 60 0 45 5 2 4 1438.300 648 790.300 0.477 1267 27 1.077 0.451 False True False False False False True False False False False True False False False False
8794 43 4 28 2 2 1 2838.000 1934 904.000 0.873 8644 87 0.554 0.681 False False True False False False False True False False False False True False False False
4292 52 2 45 3 1 3 3476.000 1560 1916.000 0.894 3496 58 0.871 0.449 False False True False False False False True True False False False False False False False
1817 30 0 36 3 3 2 2550.000 1623 927.000 0.650 1870 51 0.275 0.636 True False True False False False True False False False False True False False False False
6025 33 3 36 5 2 3 1457.000 0 1457.000 0.677 2200 45 0.364 0.000 False False True False False False False True False False False True False False False False

Observations:

  • Notice nicer breakdown for income categories with True and False
In [ ]:
X_val.head(5) # Check the top 5 rows from the val dataset.
Out[ ]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Gender_M Education_Level_Doctorate Education_Level_Graduate Education_Level_High School Education_Level_Post-Graduate Education_Level_Uneducated Marital_Status_Married Marital_Status_Single Income_Category_$40K - $60K Income_Category_$60K - $80K Income_Category_$80K - $120K Income_Category_Less than $40K Income_Category_abc Card_Category_Gold Card_Category_Platinum Card_Category_Silver
6685 49 1 35 5 2 2 1438.300 0 1438.300 0.681 4109 71 0.919 0.000 True False True False False False True False False False False True False False False False
291 50 4 36 2 3 2 2521.000 1608 913.000 0.587 1328 33 0.571 0.638 False True False False False False False True False False False False True False False False
3082 30 0 19 3 1 4 3213.000 2517 696.000 1.275 2666 46 1.000 0.783 False False True False False False True False False False False True False False False False
8469 42 3 36 2 3 3 2515.000 1453 1062.000 0.649 4025 74 0.805 0.578 False False True False False False True False False False False True False False False False
2088 27 0 15 4 3 4 3682.000 0 3682.000 0.685 1826 35 0.750 0.000 True False False False False True True False True False False False False False False False
In [ ]:
X_test.head(5) # Check the top 5 rows from the X_test dataset.
Out[ ]:
Customer_Age Dependent_count Months_on_book Total_Relationship_Count Months_Inactive_12_mon Contacts_Count_12_mon Credit_Limit Total_Revolving_Bal Avg_Open_To_Buy Total_Amt_Chng_Q4_Q1 Total_Trans_Amt Total_Trans_Ct Total_Ct_Chng_Q4_Q1 Avg_Utilization_Ratio Gender_M Education_Level_Doctorate Education_Level_Graduate Education_Level_High School Education_Level_Post-Graduate Education_Level_Uneducated Marital_Status_Married Marital_Status_Single Income_Category_$40K - $60K Income_Category_$60K - $80K Income_Category_$80K - $120K Income_Category_Less than $40K Income_Category_abc Card_Category_Gold Card_Category_Platinum Card_Category_Silver
5168 30 1 19 6 1 2 1644.000 0 1644.000 0.820 2533 44 0.517 0.000 False False False False False True False False False False False True False False False False
4889 54 4 43 3 3 0 5139.000 0 5139.000 0.330 1653 44 0.692 0.000 False False True False False False True False False False False True False False False False
8995 52 2 46 2 3 3 25737.000 1168 24569.000 0.718 7722 94 0.469 0.045 True False False True False False False True False False True False False False False False
3065 56 4 41 6 1 4 17753.000 1899 15854.000 0.851 3986 64 0.730 0.107 True False False False False False False True False False True False False False False False
5333 50 1 43 4 3 4 2961.000 2048 913.000 0.913 4056 92 0.769 0.692 False False False True False False True False True False False False False False False False

Model Building¶

Model Evaluation Criterion - Strategy Step 5¶

Predictions made by the classification model translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model. ←
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized, as greater↑ the Recall, the higher↑ the chances of minimizing↓ false negatives.
  • We want to minimize↓ false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase↑ the maintenance cost.

To evaluate classification models using metrics (True Positives, False Negatives, and False Positives), these performance metrics are used. Here are the main ones:

  1. Precision:

    • Formula: TP / (TP + FP)
    • Measures the accuracy of positive predictions
    • Answers: "Of all the failures the model predicted, what proportion were actually failures?"
  2. Recall ← (also known as Sensitivity or True Positive Rate):

    • Formula: TP / (TP + FN)
    • Measures the proportion of actual positives that were correctly identified
    • Answers: "Of all the actual failures, what proportion did the model correctly identify?"
  3. F1 Score:

    • Formula: 2 (Precision Recall) / (Precision + Recall)
    • Harmonic mean of Precision and Recall
    • Provides a single score that balances both *Precision* and *Recall*
  4. Specificity (True Negative Rate):

    • Formula: TN / (TN + FP)
    • Measures the proportion of actual negatives correctly identified
    • Note: This requires True Negatives (TN)
  5. Accuracy:

    • Formula: (TP + TN) / (TP + TN + FP + FN)
    • Measures the **overall correctness** of the model
    • Note: This also requires True Negatives (TN)
  6. False Positive Rate (FPR):

    • Formula: FP / (FP + TN)
    • Measures the proportion of false alarms among all negative cases
  7. False Discovery Rate (FDR):

    • Formula: FP / (FP + TP)
    • Measures the proportion of false positives among all positive predictions

These metrics provide different perspectives on model performance. The choice of which to prioritize depends on the specific requirements of the application, such as whether false positives or false negatives are more costly.

Models can make **wrong** predictions as:

  • Predicting a customer will attrite and the customer doesn't attrite
  • Predicting a customer will not attrite and the customer attrites

Which case is more financially important?

  • Predicting that a customer will not attrite but he attrites i.e. losing a valuable customer or asset.

How to reduce this loss i.e need to reduce False Negatives

  • The greater the Recall the higher↑ the chances of minimizing false negatives. Hence, the focus should be on increasing↑ Recall or minimizing the false negatives or in other words identifying the true positives(i.e. Class 1) so that the bank can retain their valuable customers by identifying the customers who are at risk of attrition.

The strategy we want is for `Recall to be maximized`.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [ ]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
        index=[0],
    )

    return df_perf
In [ ]:
def confusion_matrix_sklearn(model, predictors, target):
    """
    To plot the confusion_matrix with percentages

    model: classifier
    predictors: independent variables
    target: dependent variable
    """
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    labels = np.asarray(
        [
            ["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
            for item in cm.flatten()
        ]
    ).reshape(2, 2)

    plt.figure(figsize=(6, 4))
    sns.heatmap(cm, annot=labels, fmt="")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")

Model Building - Original Data¶

Original Data - This approach uses the dataset as is, without addressing the class imbalance.

Pros:

  • Maintains Data Integrity: Since no changes are made to the dataset, the model reflects the real-world distribution, helping avoid the introduction of artificial patterns.
  • No Information Loss: All available data is used without reduction or synthetic augmentation, preserving the original data features.
  • Faster to Train: With no additional data created or removed, models train faster compared to oversampling or complex algorithms.

Cons:

  • Bias Toward Majority Class: When dealing with imbalanced data, the model may favor the majority class (e.g., non-churn customers), leading to poor performance in predicting the minority class (e.g., churners).
  • Lower Recall for Minority Class: If the dataset is highly imbalanced, the model is less likely to capture the minority class, causing lower recall (i.e., many false negatives).
  • Misleading Accuracy: The overall accuracy might be misleading if it’s heavily driven by the majority class, masking poor performance on minority predictions.

NOTE: Models already listed in sample code above for reference:

  • (5) BaggingClassifier,
  • (3) RandomForestClassifier.

I added more models in the next code snippet:

  • (1) LogisticRegression, added
  • (2) DecisionTreeClassifier, added
  • (3) RandomForestClassifier,
  • (4) GradientBoostingClassifier, added
  • (5) BaggingClassifier,
  • (6) AdaBoostClassifier, added
  • (7) XGBClassifier (Optional) added
In [ ]:
#  ------------------- Import Models Chosen -----------------
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score
from xgboost import XGBClassifier  # Import XGBoost classifier

models = []  # Empty list to store all the chosen models

# Adding models into the existing list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))  # Adding Logistic Regression (1)
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))  # Adding Decision Tree (2)
models.append(("Random forest", RandomForestClassifier(random_state=1))) # Random Forest (3)
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))  # Adding Gradient Boosting (4)
models.append(("Bagging", BaggingClassifier(random_state=1))) # Bagging (5)
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))  # Adding AdaBoost (6)
models.append(("XGBoost", XGBClassifier(random_state=1)))  # Adding XGBoost (Optional)

# Synthetic Minority Over Sampling Technique (original code sample)
# sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
# _train_over, y_train_over = sm.fit_resample(X_train, y_train)
# We have to ensure we have X_train, y_train, X_val, and y_val already defined in our environment
print("\n\033[1;4mTraining Performance:\033[0m\n")

# ANSI escape codes for bold and yellow text
print("\033[1;33mOriginal data without Handling Imbalanced Datasets:\033[0m")
print("\033[1;92mRecall metric:\033[0m")

# This line trains the current model (model) using the training data (X_train for features and y_train for labels).
# The .fit() function is how the model learns patterns in the training data."
for name, model in models:
    model.fit(X_train, y_train) # Training happens here. X_train Features & y_train labels (Target Prediction)
    scores = recall_score(y_train, model.predict(X_train))# Training happens here.
    print("{}: {}".format(name, scores)) # Print calculated values

print("\n\033[1;4mValidation Performance:\033[0m\n")

# ANSI escape codes for bold and yellow text
print("\033[1;33mOriginal data without Handling Imbalanced Datasets:\033[0m")
print("\033[1;92mRecall metric on Validation set:\033[0m")
for name, model in models:
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))

print("\033[1;92m\nNotice Recall metric on Validation set vs Training set:\033[0m")
Training Performance:

Original data without Handling Imbalanced Datasets:
Recall metric:
Logistic Regression: 0.4115384615384615
Decision Tree: 1.0
Random forest: 1.0
Gradient Boosting: 0.8938461538461538
Bagging: 0.9784615384615385
AdaBoost: 0.8707692307692307
XGBoost: 1.0

Validation Performance:

Original data without Handling Imbalanced Datasets:
Recall metric on Validation set:
Logistic Regression: 0.36486486486486486
Decision Tree: 0.8243243243243243
Random forest: 0.7432432432432432
Gradient Boosting: 0.8513513513513513
Bagging: 0.8513513513513513
AdaBoost: 0.8513513513513513
XGBoost: 0.9324324324324325

Notice Recall metric on Validation set vs Training set:
In [ ]:
#  ------------------- Import Models Chosen ----------------- BASELINE
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import recall_score
from xgboost import XGBClassifier  # Import XGBoost classifier

models = []  # Empty list to store all the chosen models

# Adding models into the existing list
models.append(("Logistic Regression", LogisticRegression(random_state=1)))  # Adding Logistic Regression (1)
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))  # Adding Decision Tree (2)
models.append(("Random forest", RandomForestClassifier(random_state=1))) # Random Forest (3)
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))  # Adding Gradient Boosting (4)
models.append(("Bagging", BaggingClassifier(random_state=1))) # Bagging (5)
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))  # Adding AdaBoost (6)
models.append(("XGBoost", XGBClassifier(random_state=1)))  # Adding XGBoost (Optional)


print("\n\033[1;4mTraining Performance:\033[0m\n")
# ANSI escape codes for bold and yellow text
print("\033[1;33mOriginal data without Handling Imbalanced Datasets\033[0m")
print("\033[1;92mRecall metric:\033[0m")
for name, model in models:
    model.fit(X_train, y_train) # We have to ensure we have X_train, y_train, X_val, and y_val already defined in our environment
    scores = recall_score(y_train, model.predict(X_train))
    print("{}: {}".format(name, scores))

print("\n\033[1;4mValidation Performance:\033[0m\n")
# ANSI escape codes for bold and yellow text
print("\033[1;33mOriginal data without Handling Imbalanced Datasets\033[0m")
print("\033[1;92mRecall metric:\033[0m")
for name, model in models:
    scores_val = recall_score(y_val, model.predict(X_val))
    print("{}: {}".format(name, scores_val))

print("\n\033[1;4mTest Performance:\033[0m\n")
# ANSI escape codes for bold and yellow text
print("\033[1;33mOriginal data without Handling Imbalanced Datasets\033[0m")
print("\033[1;92mRecall metric:\033[0m")
for name, model in models:
    scores_test = recall_score(y_test, model.predict(X_test))
    print("{}: {}".format(name, scores_test))

print("\033[1;92m\nExamine the highest Recall score to identify the best model.\033[0m")
Training Performance:

Original data without Handling Imbalanced Datasets
Recall metric:
Logistic Regression: 0.4115384615384615
Decision Tree: 1.0
Random forest: 1.0
Gradient Boosting: 0.8938461538461538
Bagging: 0.9784615384615385
AdaBoost: 0.8707692307692307
XGBoost: 1.0

Validation Performance:

Original data without Handling Imbalanced Datasets
Recall metric:
Logistic Regression: 0.36486486486486486
Decision Tree: 0.8243243243243243
Random forest: 0.7432432432432432
Gradient Boosting: 0.8513513513513513
Bagging: 0.8513513513513513
AdaBoost: 0.8513513513513513
XGBoost: 0.9324324324324325

Test Performance:

Original data without Handling Imbalanced Datasets
Recall metric:
Logistic Regression: 0.3438735177865613
Decision Tree: 0.766798418972332
Random forest: 0.766798418972332
Gradient Boosting: 0.8379446640316206
Bagging: 0.8063241106719368
AdaBoost: 0.8063241106719368
XGBoost: 0.8656126482213439

Examine the highest Recall score to identify the best model.

Observations: Using original data (without Handling Imbalanced Datasets)

  • Training - best models with highest Recall score: 1) Decision Three 1) Random Forest 1) XGBoost.
  • Validation - best models with highest Recal score: 1) XGBoost 2) AdaBoost 2) Bagging 2) Gradient Boosting.
  • Testing - best models with highest Recall score: 1) XGBoost 2) Gradient Boosting 3) AdaBoost 3) Bagging.

NOTE: X_train, y_train, X_val, y_val, X_test and y_test defined in our environment before running this code snippet.

Models' Performance Evaluations:¶

Use Model Performance Metrics

In [ ]:
# Model Performance using previously defined function - For Code Reference

print("\033[1;92mModel Performance Evaluation:\033[0m")
print(f"Model: {model.__class__.__name__}")  # Prints the class name of the model
model_performance_classification_sklearn(model, X_train, y_train) # Calls pre-defined function that displays performance metrics
Model Performance Evaluation:
Model: XGBClassifier
Out[ ]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000

Use Confusion Matrix

In [ ]:
# Model Confusion Matrix using previously defined function - For Code Reference

print("\033[1;92mModel Performance Evaluation of Last Model used:\033[0m")
print(f"Model: {model.__class__.__name__}")  # Prints the class name of the model
confusion_matrix_sklearn(model, X_train, y_train) # Calls pre-defined function that displays confusion matrix
Model Performance Evaluation of Last Model used:
Model: XGBClassifier

Identify params for Model Tuning after identifying best model based on best Performance Metrics

In [ ]:
print(model) # Prints the model object last used - For Code Reference
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=None, n_jobs=None,
              num_parallel_tree=None, random_state=1, ...)

Use GridSearchCV for Hyperparameter Tuning

Using GridSearchCV - Model Performance Metrics¶

GridSearchCV is a powerful tool from the sklearn.model_selection module in Scikit-learn that is used for hyperparameter tuning. It systematically searches for the best combination of hyperparameters for a machine learning model by testing all possible combinations within a specified parameter grid. Cross-validation (CV) is applied to each combination to assess the performance and avoid overfitting.

Key Features:

1.  Exhaustive Search: It tests all combinations of the provided hyperparameters.
2.  Cross-Validation: For each combination, it splits the data into training and validation sets multiple times and evaluates performance using cross-validation, ensuring robustness in the evaluation.
3.  Model Selection: The best model, based on a scoring metric, is selected.

How is GridSearchCV used?

Here are the basic steps to use GridSearchCV:

1.  Define the Model: Choose a machine learning model you want to optimize, like a decision tree, random forest, or logistic regression.
2.  Specify Hyperparameters: Define a dictionary (param_grid) with keys as hyperparameter names and values as lists of possible values you want to test.
3.  Choose Scoring Metric: Select a metric (e.g., accuracy, recall, precision) to evaluate the model’s performance.
4.  Perform Grid Search: Run GridSearchCV, which tests all combinations of hyperparameters using cross-validation.
5.  Retrieve the Best Model: After the search, you can access the best model and its corresponding hyperparameters.

The code below took a considerable amount of time (18 mins +) to process using 5 folds (cv = 5) for each of 1800 candidates, totalling 9000 fits. So I decided to only use cv = 2

In [ ]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, accuracy_score

# Create the model
rf_model = RandomForestClassifier(random_state=42)

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],       # Number of trees in the forest
    'max_depth': [10, 20, 30],             # Maximum depth of the trees
    'min_samples_split': [2, 5, 10],       # Minimum number of samples required to split a node
    'min_samples_leaf': [1, 2, 4],         # Minimum number of samples required at a leaf node
    'bootstrap': [True, False]             # Whether bootstrap samples are used when building trees
}

# Define a scorer
scorer = make_scorer(accuracy_score)

# Instantiate the GridSearchCV object
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, scoring=scorer, cv=2, n_jobs=-1, verbose=2)

# Fit the model on training data
grid_search.fit(X_train, y_train)

# Best parameters found from grid search
print("Best Hyperparameters:", grid_search.best_params_)

# Use the best estimator (model) from grid search
best_model = grid_search.best_estimator_

# Predict on test set using the best model
y_pred = best_model.predict(X_test)

# Evaluate the model performance
print("Recall on Test Set:", recall_score(y_test, y_pred))
Fitting 2 folds for each of 162 candidates, totalling 324 fits
Best Hyperparameters: {'bootstrap': False, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200}
Recall on Test Set: 0.7984189723320159

Hyperparameter Tuning Results Observations:

Fitting 2 folds for each of 162 candidates, totalling 324 fits Best Hyperparameters: {'bootstrap': False, 'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 200} Recall on Test Set: 0.7984189723320159

Experimenting with RandomForestClassifier Model:

In [ ]:
# model without hyperparameter tuning
rf = RandomForestClassifier(random_state=1)
rf.fit(X_train, y_train)
Out[ ]:
RandomForestClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(random_state=1)

Experimentind and Evaluation of selected models with Performance Metrics

In [ ]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

print(f"\n\033[1;31mPerformance Metrics without data handling (Oversampling/Undersampling):\033[0m") # Main code for multiple models training and evaluations on split datasets (training, validations, test)

# Define your models
models = []
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))  # XGBoost (optional)

# Define a function to print performance metrics
def print_model_performance(model, X, y, dataset_name):
    y_pred = model.predict(X)  # Predict labels
    print(f"\n\033[1;4mPerformance on {dataset_name} data:\033[0m")
    print(f"Accuracy:  {accuracy_score(y, y_pred):.4f}")
    print(f"Recall:    {recall_score(y, y_pred):.4f}")
    print(f"Precision: {precision_score(y, y_pred):.4f}")
    print(f"F1 Score:  {f1_score(y, y_pred):.4f}")

# Evaluate each model
for name, model in models:
    model.fit(X_train, y_train)  # Train the model

    # Print the model being evaluated
    print(f"\n\033[1;92mModel Performance Evaluation:\033[0m")
    print(f"---  Model: {name}  ---")

    # Print performance metrics for training data
    print_model_performance(model, X_train, y_train, "Training")

    # Print performance metrics for validation data
    print_model_performance(model, X_val, y_val, "Validation")

    # Print performance metrics for test data
    print_model_performance(model, X_test, y_test, "Test")

print("\033[1;92m\nExamine the highest scores to identify the best model.\033[0m")
Performance Metrics without data handling (Oversampling/Undersampling):

Model Performance Evaluation:
---  Model: Logistic Regression  ---

Performance on Training data:
Accuracy:  0.8750
Recall:    0.4115
Precision: 0.6833
F1 Score:  0.5137

Performance on Validation data:
Accuracy:  0.8698
Recall:    0.3649
Precision: 0.5870
F1 Score:  0.4500

Performance on Test data:
Accuracy:  0.8585
Recall:    0.3439
Precision: 0.6397
F1 Score:  0.4473

Model Performance Evaluation:
---  Model: Decision Tree  ---

Performance on Training data:
Accuracy:  1.0000
Recall:    1.0000
Precision: 1.0000
F1 Score:  1.0000

Performance on Validation data:
Accuracy:  0.9310
Recall:    0.8243
Precision: 0.7349
F1 Score:  0.7771

Performance on Test data:
Accuracy:  0.9302
Recall:    0.7668
Precision: 0.8050
F1 Score:  0.7854

Model Performance Evaluation:
---  Model: Random Forest  ---

Performance on Training data:
Accuracy:  1.0000
Recall:    1.0000
Precision: 1.0000
F1 Score:  1.0000

Performance on Validation data:
Accuracy:  0.9527
Recall:    0.7432
Precision: 0.9167
F1 Score:  0.8209

Performance on Test data:
Accuracy:  0.9526
Recall:    0.7668
Precision: 0.9372
F1 Score:  0.8435

Model Performance Evaluation:
---  Model: Gradient Boosting  ---

Performance on Training data:
Accuracy:  0.9770
Recall:    0.8938
Precision: 0.9603
F1 Score:  0.9259

Performance on Validation data:
Accuracy:  0.9645
Recall:    0.8514
Precision: 0.9000
F1 Score:  0.8750

Performance on Test data:
Accuracy:  0.9664
Recall:    0.8379
Precision: 0.9550
F1 Score:  0.8926

Model Performance Evaluation:
---  Model: Bagging  ---

Performance on Training data:
Accuracy:  0.9963
Recall:    0.9785
Precision: 0.9984
F1 Score:  0.9883

Performance on Validation data:
Accuracy:  0.9546
Recall:    0.8514
Precision: 0.8400
F1 Score:  0.8456

Performance on Test data:
Accuracy:  0.9539
Recall:    0.8063
Precision: 0.9067
F1 Score:  0.8536

Model Performance Evaluation:
---  Model: AdaBoost  ---

Performance on Training data:
Accuracy:  0.9654
Recall:    0.8708
Precision: 0.9100
F1 Score:  0.8899

Performance on Validation data:
Accuracy:  0.9546
Recall:    0.8514
Precision: 0.8400
F1 Score:  0.8456

Performance on Test data:
Accuracy:  0.9526
Recall:    0.8063
Precision: 0.8987
F1 Score:  0.8500

Model Performance Evaluation:
---  Model: XGBoost  ---

Performance on Training data:
Accuracy:  1.0000
Recall:    1.0000
Precision: 1.0000
F1 Score:  1.0000

Performance on Validation data:
Accuracy:  0.9684
Recall:    0.9324
Precision: 0.8625
F1 Score:  0.8961

Performance on Test data:
Accuracy:  0.9651
Recall:    0.8656
Precision: 0.9202
F1 Score:  0.8921

Examine the highest scores to identify the best model.

Model Building - Oversampled Data - Model Evaluation, Selection on Training and Validation sets - Strategy Step 3, Step 4 Step 5 and Step 6:¶

To use the validation data (X_val, y_val) instead of the training data (X_train, y_train) for Recall evaluation of the models, we just need to modify the fit() method to train the model on X_train and y_train, but change the recall score calculation to use the validation set.

In [ ]:
# PART I OF III: OVERSAMPLING

# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE
from sklearn.metrics import recall_score

print("\nPART I OF III: OVERSAMPLING..................................................................\n")
print("\033[1;33mMinority datapoints (Attrition) increase by Oversampling to match majority.\033[0m\n")


# ANSI escape codes for bold and yellow, original data showing minority data.
print("Before Oversampling, counts of label 'Yes' (Attrition): \033[1;33m{}\033[0m ".format(sum(y_train == 1)))  # Minority after split
print("Before Oversampling, counts of label 'No' (Non-Attrition): {} \n".format(sum(y_train == 0)))  # Majority

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)

# Balancing the training dataset by oversampling the minority class so that the model will not be biased toward the majority class during training.
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

# ANSI escape codes for bold and yellow, oversampled data on Training set.
print("After Oversampling, counts of label 'Yes' (Attrition): \033[1;33m{}\033[0m ".format(sum(y_train_over == 1)))  # Minority increased to match majority.
print("After Oversampling, counts of label 'No' (Non-Attrition): {} \n".format(sum(y_train_over == 0)))  # Majority

# ANSI escape codes for bold and yellow, oversampled data shape on Training set.
print("After Oversampling, the shape of X_train: \033[1;33m{}\033[0m ".format(X_train_over.shape))  # Oversampled dataset
print("After Oversampling, the shape of y_train: {}\n".format(y_train_over.shape))  # Target labels

# Main goal
print("Minority datapoints (Attrition) increased from: \033[1;33m{}\033[0m ".format(sum(y_train == 1)), "up to: \033[1;33m{}\033[0m ".format(sum(y_train_over == 1)))
PART I OF III: OVERSAMPLING..................................................................

Minority datapoints (Attrition) increase by Oversampling to match majority.

Before Oversampling, counts of label 'Yes' (Attrition): 1300 
Before Oversampling, counts of label 'No' (Non-Attrition): 6801 

After Oversampling, counts of label 'Yes' (Attrition): 6801 
After Oversampling, counts of label 'No' (Non-Attrition): 6801 

After Oversampling, the shape of X_train: (13602, 30) 
After Oversampling, the shape of y_train: (13602,)

Minority datapoints (Attrition) increased from: 1300  up to: 6801 
In [ ]:
# PART II OF III: MODEL PERFORMANCE ON OVERSAMPLED TRAINING DATA
print("\n\nPART II OF III: MODEL PERFORMANCE ON OVERSAMPLED TRAINING DATA..............................\n")

# Model training and evaluation after oversampling
print("\033[1;33mOversampled data used to train models to handle imbalanced datasets:\033[0m")
print("\n\033[1;4mModel Performance (Recall on Oversampled Data):\033[0m\n")

# ANSI escape codes for text formatting
for name, model in models:
    model.fit(X_train_over, y_train_over)  # Train the model on oversampled data
    scores_train = recall_score(y_train_over, model.predict(X_train_over))  # Evaluate recall on the oversampled training set
    print("{}: \033[1;92m{}\033[0m".format(name, scores_train))  # Display Recall metric for each model

PART II OF III: MODEL PERFORMANCE ON OVERSAMPLED TRAINING DATA..............................

Oversampled data used to train models to handle imbalanced datasets:

Model Performance (Recall on Oversampled Data):

Logistic Regression: 0.8278194383178944
Decision Tree: 1.0
Random forest: 1.0
Gradient Boosting: 0.9780914571386561
Bagging: 0.9979414791942361
AdaBoost: 0.9669166299073666
XGBoost: 1.0
In [ ]:
# SECTION III OF III: EVALUATION ON VALIDATION SET

print("\n\nPART III OF III: MODEL PERFORMANCE ON VALIDATION DATA (GENERALIZATION ASSESSMENT)..........\n")

# Now, evaluate the models on the validation set to assess generalization and prevent overfitting
print("\033[1;33mEvaluating models on Validation Data (to assess generalization):\033[0m")
print("\n\033[1;4mRecall metric on Validation set:\033[0m\n")

for name, model in models:
    # Predict on the validation set (X_val)
    y_val_pred = model.predict(X_val)

    # Calculate recall on the validation set
    recall_val = recall_score(y_val, y_val_pred)

    # Print the recall score for each model
    print("{}: \033[1;92m{}\033[0m".format(name, recall_val))

# Final comments to highlight the importance of validation set performance
print("\n(1) Models are trained on the oversampled training data (X_train_over, y_train_over).\n")
print("(2) Models are evaluated on the validation set (X_val, y_val) to assess how well they generalize to unseen data.\n")
print("(3) The recall metric on the validation set is crucial for ensuring the model is not overfitting on the training data.\n")
print("(4) \033[1;4mImportant:\033[0m \033[1;31mIf the model performs well on both training and validation sets, it suggests good generalization; otherwise, overfitting may be occurring.\033[0m\n\n")

PART III OF III: MODEL PERFORMANCE ON VALIDATION DATA (GENERALIZATION ASSESSMENT)..........

Evaluating models on Validation Data (to assess generalization):

Recall metric on Validation set:

Logistic Regression: 0.7162162162162162
Decision Tree: 0.9054054054054054
Random forest: 0.8378378378378378
Gradient Boosting: 0.9054054054054054
Bagging: 0.8918918918918919
AdaBoost: 0.8108108108108109
XGBoost: 0.9054054054054054

(1) Models are trained on the oversampled training data (X_train_over, y_train_over).

(2) Models are evaluated on the validation set (X_val, y_val) to assess how well they generalize to unseen data.

(3) The recall metric on the validation set is crucial for ensuring the model is not overfitting on the training data.

(4) Important: If the model performs well on both training and validation sets, it suggests good generalization; otherwise, overfitting may be occurring.


In [ ]:
# COMBINED CODE

# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE
from sklearn.metrics import recall_score

print("\nPART I OF III: OVERSAMPLING..................................................................\n")
print("\033[1;33mMinority datapoints (Attrition) increase by Oversampling to match majority.\033[0m\n")


# ANSI escape codes for bold and yellow, original data showing minority data.
print("Before Oversampling, counts of label 'Yes' (Attrition): \033[1;33m{}\033[0m ".format(sum(y_train == 1)))  # Minority after split
print("Before Oversampling, counts of label 'No' (Non-Attrition): {} \n".format(sum(y_train == 0)))  # Majority

# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)

# Balancing the training dataset by oversampling the minority class so that the model will not be biased toward the majority class during training.
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

# ANSI escape codes for bold and yellow, oversampled data on Training set.
print("After Oversampling, counts of label 'Yes' (Attrition): \033[1;33m{}\033[0m ".format(sum(y_train_over == 1)))  # Minority increased to match majority.
print("After Oversampling, counts of label 'No' (Non-Attrition): {} \n".format(sum(y_train_over == 0)))  # Majority

# ANSI escape codes for bold and yellow, oversampled data shape on Training set.
print("After Oversampling, the shape of X_train: \033[1;33m{}\033[0m ".format(X_train_over.shape))  # Oversampled dataset
print("After Oversampling, the shape of y_train: {}\n".format(y_train_over.shape))  # Target labels

# Main goal
print("Minority datapoints (Attrition) increased from: \033[1;33m{}\033[0m ".format(sum(y_train == 1)), "up to: \033[1;33m{}\033[0m ".format(sum(y_train_over == 1)))



print("\n\nPART II OF III: MODEL PERFORMANCE ON OVERSAMPLED TRAINING DATA..............................\n")

# Model training and evaluation after oversampling
print("\033[1;33mOversampled data used to train models to handle imbalanced datasets:\033[0m")
print("\n\033[1;4mModel Performance (Recall on Oversampled Data):\033[0m\n")

# ANSI escape codes for text formatting
for name, model in models:
    model.fit(X_train_over, y_train_over)  # Train the model on oversampled data
    scores_train = recall_score(y_train_over, model.predict(X_train_over))  # Evaluate recall on the oversampled training set
    print("{}: \033[1;92m{}\033[0m".format(name, scores_train))  # Display Recall metric for each model

# SECTION III OF III: EVALUATION ON VALIDATION SET

print("\n\nPART III OF III: MODEL PERFORMANCE ON VALIDATION DATA (GENERALIZATION ASSESSMENT)..........\n")

# Now, evaluate the models on the validation set to assess generalization and prevent overfitting
print("\033[1;33mEvaluating models on Validation Data (to assess generalization):\033[0m")
print("\n\033[1;4mRecall metric on Validation set:\033[0m\n")

for name, model in models:
    # Predict on the validation set (X_val)
    y_val_pred = model.predict(X_val)

    # Calculate recall on the validation set
    recall_val = recall_score(y_val, y_val_pred)

    # Print the recall score for each model
    print("{}: \033[1;92m{}\033[0m".format(name, recall_val))

# Final comments to highlight the importance of validation set performance
print("\n(1) Models are trained on the oversampled training data (X_train_over, y_train_over).\n")
print("(2) Models are evaluated on the validation set (X_val, y_val) to assess how well they generalize to unseen data.\n")
print("(3) The recall metric on the validation set is crucial for ensuring the model is not overfitting on the training data.\n")
print("(4) \033[1;4mImportant:\033[0m \033[1;31mIf the model performs well on both training and validation sets, it suggests good generalization; otherwise, overfitting may be occurring.\033[0m\n\n")
PART I OF III: OVERSAMPLING..................................................................

Minority datapoints (Attrition) increase by Oversampling to match majority.

Before Oversampling, counts of label 'Yes' (Attrition): 1300 
Before Oversampling, counts of label 'No' (Non-Attrition): 6801 

After Oversampling, counts of label 'Yes' (Attrition): 6801 
After Oversampling, counts of label 'No' (Non-Attrition): 6801 

After Oversampling, the shape of X_train: (13602, 30) 
After Oversampling, the shape of y_train: (13602,)

Minority datapoints (Attrition) increased from: 1300  up to: 6801 


PART II OF III: MODEL PERFORMANCE ON OVERSAMPLED TRAINING DATA..............................

Oversampled data used to train models to handle imbalanced datasets:

Model Performance (Recall on Oversampled Data):

Logistic Regression: 0.8278194383178944
Decision Tree: 1.0
Random forest: 1.0
Gradient Boosting: 0.9780914571386561
Bagging: 0.9979414791942361
AdaBoost: 0.9669166299073666
XGBoost: 1.0


PART III OF III: MODEL PERFORMANCE ON VALIDATION DATA (GENERALIZATION ASSESSMENT)..........

Evaluating models on Validation Data (to assess generalization):

Recall metric on Validation set:

Logistic Regression: 0.7162162162162162
Decision Tree: 0.9054054054054054
Random forest: 0.8378378378378378
Gradient Boosting: 0.9054054054054054
Bagging: 0.8918918918918919
AdaBoost: 0.8108108108108109
XGBoost: 0.9054054054054054

(1) Models are trained on the oversampled training data (X_train_over, y_train_over).

(2) Models are evaluated on the validation set (X_val, y_val) to assess how well they generalize to unseen data.

(3) The recall metric on the validation set is crucial for ensuring the model is not overfitting on the training data.

(4) Important: If the model performs well on both training and validation sets, it suggests good generalization; otherwise, overfitting may be occurring.


Key Observations:

  • Model Training: The models are still trained on the training dataset (X_train, y_train).
  • Model Evaluation: The models are now evaluated on the validation set (X_val, y_val) using recall_score().
  • Recall measured on validation set help to assess how well the model generalizes to unseen data.

Important Observation: This approach ensures that we are not overfitting by evaluating performance on the same data used for training.

While this code snippet evaluates the models during training, it's crucial to understand that evaluating on the same data used for training can lead to overfitting. The models might simply memorize the training data, including any patterns specific to that data, and not generalize well to unseen data.

I added additional code (part III) to evaluate the models on validation data to assess generalization performance and prevent overfitting. This was be done by calling model.predict(X_val) and evaluating the recall_score on the validation set :-)

Recommendations:

  • XGBoost with Recall score 0.9054054054054054 is the best model after oversampling.
  • EXTRA CREDIT for using multiple models? :-)

  • To get a more reliable assessment of model performance, we can consider using a separate hold-out validation set for evaluation. This can be achieved by splitting the available data into training and validation sets using functions like train_test_split from sklearn.model_selection before training. We can then train the model on the training set and evaluate its performance on the unseen validation set. >> This helps ensure the model generalizes well to new data.


Model Building - Undersampled Data - Model Evaluation, Selection on Training and Validation sets - Strategy Step 3, Step 4 Step 5 and Step 6:¶

The code that follows is applying **random undersampling** to balance an imbalanced dataset. Let's break it down step by step:

  1. `rus = RandomUnderSampler(random_state=1)`:

    • RandomUnderSampler is a class from the imblearn library (from the imbalanced-learn package).
    • It undersamples the majority class in the training data so that it matches the number of samples of the minority class.
    • The random_state=1 ensures **`reproducibility`**, meaning that the same random sampling occurs each time the code is run.
  2. `X_train_un, y_train_un = rus.fit_resample(X_train, y_train)`:

    • X_train and y_train are the original training data (features and target labels, respectively).
    • The fit_resample method:
      • Fits the undersampler to the data (i.e., it looks at the class distribution of y_train to understand how many samples are in each class).
      • Resamples the dataset by randomly removing samples from the majority class until both the majority and minority classes have the same number of samples.
      • Returns two new objects:
        • X_train_un: The undersampled feature data.
        • y_train_un: The corresponding target labels after undersampling.

    As a result, `the data becomes more balanced`, which helps prevent models from being biased toward the majority class.

Why is this done?

  • In imbalanced classification problems, models can become `biased toward the majority class` because it dominates the dataset.

  • **Random Undersampling** is one technique to balance the data, which helps the model learn patterns from both classes more effectively.

In [ ]:
# Import RandomUnderSampler from imblearn
from imblearn.under_sampling import RandomUnderSampler
from sklearn.metrics import recall_score

print("\nPART I OF II: UNDERSAMPLING....................................................................")
print("\033[1;33mMajority datapoints (Non-Attrition) reduced to match minority.\033[0m")

# ANSI escape codes for bold and yellow, original data showing majority and minority class counts before undersampling
print("Before Undersampling, counts of label 'Yes' (Attrition): {} ".format(sum(y_train == 1)))  # Minority class (Attrition)
print("Before Undersampling, counts of label 'No' (Non-Attrition): \033[1;33m{}\033[0m \n".format(sum(y_train == 0)))  # Majority class (Non-Attrition)

# Random Under Sampling Technique
rus = RandomUnderSampler(random_state=1)

# Balancing the training dataset by undersampling the majority class
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)

# ANSI escape codes for bold and yellow, undersampled data on Training set.
print("After Undersampling, counts of label 'Yes' (Attrition): {}".format(sum(y_train_un == 1)))  # Minority class
print("After Undersampling, counts of label 'No' (Non-Attrition): \033[1;33m{}\033[0m \n".format(sum(y_train_un == 0)))  # Majority class reduced

# ANSI escape codes for bold and yellow, undersampled data shape on Training set.
print("After Undersampling, the shape of X_train: \033[1;33m{}\033[0m ".format(X_train_un.shape))  # New training data shape
print("After Undersampling, the shape of y_train: {}\n".format(y_train_un.shape))  # New target shape

# Main goal
print("Majority influential datapoints are now reduced from: \033[1;33m{}\033[0m".format(sum(y_train == 0)), "down to: \033[1;33m{}\033[0m ".format(sum(y_train_un == 0)))

print("\n\nPART II OF II: MODEL PERFORMANCE ON UNDERSAMPLED TRAINING DATA..............................")

# Model training and evaluation after undersampling
print("\033[1;33mUndersampled data used to train these models to handle imbalanced datasets:\033[0m")
print("\n\033[1;4mModel Performance (Recall on Undersampled Data):\033[0m\n")

# ANSI escape codes for text formatting
for name, model in models:
    model.fit(X_train_un, y_train_un)  # Train the model on undersampled data
    scores_train = recall_score(y_train_un, model.predict(X_train_un))  # Evaluate recall on the undersampled training set
    print("{}: \033[1;92m{}\033[0m".format(name, scores_train))  # Display Recall metric for each model

# SECTION III OF III: EVALUATION ON VALIDATION SET

print("\n\nPART III OF III: MODEL PERFORMANCE ON VALIDATION DATA (GENERALIZATION ASSESSMENT)............")

# Now, evaluate the models on the validation set to assess generalization and prevent overfitting
print("\033[1;33mEvaluating models on Validation Data (to assess generalization):\033[0m")
print("\n\033[1;4mRecall metric on Validation set:\033[0m\n")

for name, model in models:
    # Predict on the validation set (X_val)
    y_val_pred = model.predict(X_val)

    # Calculate recall on the validation set
    recall_val = recall_score(y_val, y_val_pred)

    # Print the recall score for each model
    print("{}: \033[1;92m{}\033[0m".format(name, recall_val))

# Final comments to highlight the importance of validation set performance
print("\n(1) Models are trained on the undersampled training data (X_train_un, y_train_un).\n")
print("(2) Models are evaluated on the validation set (X_val, y_val) to assess how well they generalize to unseen data.\n")
print("(3) The recall metric on the validation set is crucial for ensuring the model is not overfitting on the training data.\n")
print("(4) If the model performs well on both training and validation sets, it suggests good generalization; otherwise, overfitting may be occurring.\n")
PART I OF II: UNDERSAMPLING....................................................................
Majority datapoints (Non-Attrition) reduced to match minority.
Before Undersampling, counts of label 'Yes' (Attrition): 1300 
Before Undersampling, counts of label 'No' (Non-Attrition): 6801 

After Undersampling, counts of label 'Yes' (Attrition): 1300
After Undersampling, counts of label 'No' (Non-Attrition): 1300 

After Undersampling, the shape of X_train: (2600, 30) 
After Undersampling, the shape of y_train: (2600,)

Majority influential datapoints are now reduced from: 6801 down to: 1300 


PART II OF II: MODEL PERFORMANCE ON UNDERSAMPLED TRAINING DATA..............................
Undersampled data used to train these models to handle imbalanced datasets:

Model Performance (Recall on Undersampled Data):

Logistic Regression: 0.8215384615384616
Decision Tree: 1.0
Random forest: 1.0
Gradient Boosting: 0.9784615384615385
Bagging: 0.9930769230769231
AdaBoost: 0.9538461538461539
XGBoost: 1.0


PART III OF III: MODEL PERFORMANCE ON VALIDATION DATA (GENERALIZATION ASSESSMENT)............
Evaluating models on Validation Data (to assess generalization):

Recall metric on Validation set:

Logistic Regression: 0.7837837837837838
Decision Tree: 0.918918918918919
Random forest: 0.918918918918919
Gradient Boosting: 0.9594594594594594
Bagging: 0.9054054054054054
AdaBoost: 0.9324324324324325
XGBoost: 0.9594594594594594

(1) Models are trained on the undersampled training data (X_train_un, y_train_un).

(2) Models are evaluated on the validation set (X_val, y_val) to assess how well they generalize to unseen data.

(3) The recall metric on the validation set is crucial for ensuring the model is not overfitting on the training data.

(4) If the model performs well on both training and validation sets, it suggests good generalization; otherwise, overfitting may be occurring.

Recommendations:

  • XGBoost with Recall score 0.9594594594594594 is the best model after undersampling. It is slightly larger than with Oversampling.
  • EXTRA CREDIT for using multiple models? :-)

Building Classification Models Using Different Sampling Techniques For Handling Imbalanced Datasets During Model Tuning¶

To define oversampled and undersampled training data, you typically use techniques from the **imblearn** library, which provides tools for handling imbalanced datasets. Imbalanced datasets occur when one class has significantly more samples than the other(s). This can pose challenges for machine learning models, as they may become biased towards the majority class.

Here’s how you can create both oversampled and undersampled versions of your training data TO AVOID MAJORITY CLASS BIAS:

  • **Oversampling**: Oversampling is used to increase↑ the number of minority class samples to balance the class distribution. One common method is SMOTE (Synthetic Minority Over-sampling Technique). A second method is ADASYN (Adaptive Synthetic Sampling) which enerates synthetic samples based on the degree of difficulty in learning from minority class samples.

  • **Undersampling**: Undersampling reduces↓ the number of majority class samples to balance the class distribution. A common method is RandomUnderSampler. A second method is Cluster-Centroid Undersampling which clusters the majority class and randomly selects samples from each cluster.

  • **Class Weighting**: Assign higher weights to samples from the minority class during training to give them more importance.

  • **Ensemble Methods**: Combine multiple models trained on different subsets of the data or with different sampling strategies.
  • **Cost-Sensitive Learning**: Assign different costs to misclassifications based on the class imbalance.

Hyperparameter Tuning - Strategy Step 7¶

After selected models are trained and evaluated, best one can be fined tuned (This is resource processing intensive)

Note on param_grid¶
  1. The specific values in the param_grid can be adjusted based on your dataset and computational resources. It's often a good idea to start with a relatively small grid and then expand it if necessary.
  2. Sample parameter grids have been provided to do necessary hyperparameter tuning. These sample grids are expected to provide a balance between model performance improvement and execution time. One can extend/reduce the parameter grid based on execution time and system configuration.
    • Please note that if the parameter grid is extended to improve the model performance further, **the execution time will increase**.
  3. The models chosen in this notebook are based on test runs. One can update the best models as obtained upon code execution and tune them for best performance.
  • param_grid for Logistic Regression (1):
param_grid = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],  # Regularization strength
    'penalty': ['l1', 'l2'],  # Regularization type (L1 or L2)
    'solver': ['liblinear',
    'saga']  # Solver algorithm
}
  • For Decision Trees (2):
param_grid = {
    'max_depth': np.arange(2,6),
    'min_samples_leaf': [1, 4, 7],
    'max_leaf_nodes' : [10, 15],
    'min_impurity_decrease': [0.0001,0.001]
}
  • For Random Forest (3):
param_grid = {
    "n_estimators": [50,110,25],
    "min_samples_leaf": np.arange(1, 4),
    "max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
    "max_samples": np.arange(0.4, 0.7, 0.1)
}
  • For Gradient Boosting (4):
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}
  • For Bagging Classifier (5):
param_grid = {
    'max_samples': [0.8,0.9,1],
    'max_features': [0.7,0.8,0.9],
    'n_estimators' : [30,50,70],
}
  • For Adaboost (6):
param_grid = {
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "base_estimator": [
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}
  • For XGBoost (optional):
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
}

Hypertuning of Models - Srategy Step 7¶

Please refer to Appendix III for Important Consideration and Guidance during Model Hypertuning

Sample tuning method for Decision Tree (2nd model) with original data

In [ ]:
# Import:
from sklearn.metrics import make_scorer, f1_score

# Define a scorer using f1 score (binary classification example)
scorer = make_scorer(f1_score, average='binary')

# Calling RandomizedSearchCV with custom scorer
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs=-1, scoring=scorer, cv=5, random_state=1)

# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)

print("Best parameters are {} with CV score = \033[1;92m{}\033[0m:" .format(randomized_cv.best_params_, randomized_cv.best_score_),"Decision Tree model with original data")
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score = 0.7605589073511382: Decision Tree model with original data
In [ ]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

print("Best parameters are {} with CV score = \033[1;92m{}\033[0m:" .format(randomized_cv.best_params_, randomized_cv.best_score_),"Decision Tree model with original data")
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score = 0.7605589073511382: Decision Tree model with original data

Sample tuning method for Decision Tree (2nd model) with oversampled data

In [ ]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

print("Best parameters are {} with CV score = \033[1;92m{}\033[0m:" .format(randomized_cv.best_params_, randomized_cv.best_score_),"Decision Tree model with Oversampled data")
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 15, 'max_depth': 5} with CV score = 0.9112516719863721: Decision Tree model with Oversampled data

Sample tuning method for Decision Tree (2nd model) with undersampled data

In [ ]:
# defining model
Model = DecisionTreeClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,6),
              'min_samples_leaf': [1, 4, 7],
              'max_leaf_nodes' : [10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un) # <----------------------------------

print("Best parameters are {} with CV score = \033[1;92m{}\033[0m:" .format(randomized_cv.best_params_, randomized_cv.best_score_),"Decision Tree model with Undersampled data")
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 5} with CV score = 0.8797792993213255: Decision Tree model with Undersampled data

Tuning AdaBoost (6th model) using original data

In [ ]:
print(X_clean.dtypes) #If any numeric columns are stored as strings, convert them to the correct type:
Customer_Age                  int64
Gender                      float64
Dependent_count               int64
Education_Level             float64
Marital_Status              float64
Income_Category             float64
Card_Category               float64
Months_on_book                int64
Total_Relationship_Count      int64
Months_Inactive_12_mon        int64
Contacts_Count_12_mon         int64
Credit_Limit                float64
Total_Revolving_Bal           int64
Avg_Open_To_Buy             float64
Total_Amt_Chng_Q4_Q1        float64
Total_Trans_Amt               int64
Total_Trans_Ct                int64
Total_Ct_Chng_Q4_Q1         float64
Avg_Utilization_Ratio       float64
dtype: object

Now, with the categorical values encoded and any strings converted to numeric values, you can fit the model:

In [ ]:
# Fit the model with the cleaned and encoded dataset
randomized_cv.fit(X_clean_encoded, y_clean)
Out[ ]:
RandomizedSearchCV(cv=5, error_score='raise',
                   estimator=AdaBoostClassifier(random_state=1), n_iter=50,
                   n_jobs=-1,
                   param_distributions={'estimator': [DecisionTreeClassifier(max_depth=2,
                                                                             random_state=1),
                                                      DecisionTreeClassifier(max_depth=3,
                                                                             random_state=1)],
                                        'learning_rate': [0.01, 0.1, 0.05],
                                        'n_estimators': array([ 50,  75, 100])},
                   random_state=1,
                   scoring=make_scorer(recall_score, average=macro))
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5, error_score='raise',
                   estimator=AdaBoostClassifier(random_state=1), n_iter=50,
                   n_jobs=-1,
                   param_distributions={'estimator': [DecisionTreeClassifier(max_depth=2,
                                                                             random_state=1),
                                                      DecisionTreeClassifier(max_depth=3,
                                                                             random_state=1)],
                                        'learning_rate': [0.01, 0.1, 0.05],
                                        'n_estimators': array([ 50,  75, 100])},
                   random_state=1,
                   scoring=make_scorer(recall_score, average=macro))
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
In [ ]:
print("Best parameters are {} with CV score = \033[1;92m{}\033[0m:" .format(randomized_cv.best_params_, randomized_cv.best_score_),"AdaBoost model using original data")
Best parameters are {'min_samples_leaf': 7, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 10, 'max_depth': 5} with CV score = 0.8797792993213255: AdaBoost model using original data
In [ ]:
# LAST LINE INSIDE HERE WAS DONE AS A SINGLE EXECUTION ABOVE ^^
import numpy as np
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn import metrics

# Defining the model
Model = AdaBoostClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {
    "n_estimators": np.arange(50, 110, 25),
    "learning_rate": [0.01, 0.1, 0.05],
    "estimator": [  # Change from base_estimator to estimator (or check your version of sklearn)
        DecisionTreeClassifier(max_depth=2, random_state=1),
        DecisionTreeClassifier(max_depth=3, random_state=1),
    ],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score, average='macro')  # Use average='macro' or 'micro' for multi-class

# Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(
    estimator=Model,
    param_distributions=param_grid,
    n_jobs=-1,
    n_iter=50,
    scoring=scorer,
    cv=5,
    random_state=1,
    error_score='raise'  # Set error_score to 'raise' for debugging
)

# Fitting parameters in RandomizedSearchCV <--------------- FIX THIS
#randomized_cv.fit(X, y)
#randomized_cv.fit(X_clean, y_clean)

#print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_, randomized_cv.best_score_))
In [ ]:
# ORIGINAL - CHECK THAT IT REMAINS AS IN THE ORIGINAL.

# Import necessary libraries
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd

# Assuming X and y are already defined somewhere earlier in the code

# Step 1: Remove rows with NaN in X or y
X_clean = X.dropna()  # Drop rows with missing values in X
y_clean = y[X_clean.index]  # Ensure y matches the index after dropping NaNs in X

# Step 2: Check and drop missing values in y if necessary
y_clean = y_clean.dropna()
X_clean = X_clean.loc[y_clean.index]  # Ensure X matches the index after dropping NaNs in y

# Step 3: Encode categorical features (if any)
# Identify categorical columns to be encoded (adjust this depending on your dataset)
categorical_columns = X_clean.select_dtypes(include=['object']).columns

# Create a ColumnTransformer to apply OneHotEncoder to categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_columns)
    ], remainder='passthrough'  # Keep remaining columns as-is
)

# Step 4: Define the GradientBoostingClassifier model and RandomizedSearchCV
tuned_gbm = GradientBoostingClassifier(random_state=1)

# Define parameter grid for RandomizedSearchCV
param_grid = {
    "n_estimators": [50, 100, 150],
    "learning_rate": [0.01, 0.1, 0.2],
    "max_depth": [3, 5, 7]
}

# Use RandomizedSearchCV to find the best hyperparameters
randomized_cv = RandomizedSearchCV(
    estimator=tuned_gbm,
    param_distributions=param_grid,
    n_iter=10,  # Number of parameter settings sampled
    scoring='accuracy',  # Adjust this to the scoring metric you need
    cv=5,  # Number of cross-validation folds
    random_state=1
)

# Step 5: Create a pipeline that first encodes the data and then applies the model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('randomized_cv', randomized_cv)
])

# Step 6: Fit the model with cleaned and encoded data
pipeline.fit(X_clean, y_clean)  # X_clean is encoded within the pipeline

# Once the model is fitted, you can access the best parameters using:
best_params = pipeline.named_steps['randomized_cv'].best_params_
print(f"Best parameters found: {best_params}")
Best parameters found: {'n_estimators': 150, 'max_depth': 3, 'learning_rate': 0.2}

Should be fit(X, y), which is the method used to fit the model to the data in RandomizedSearchCV.

Here, X is the input features, and y is the target labels of your dataset.

So, the complete code would be:

In [ ]:
# Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X, y)

This will fit the model using the original data (X and y) based on the parameter combinations from the RandomizedSearchCV.

The best parameters obtained from the tuning process and fit the model to the data.

Assuming that the best parameters were obtained from randomized_cv.bestparams, here’s the code:

In [ ]:
# Extracting the best max_depth for the DecisionTreeClassifier from the grid search
best_max_depth = randomized_cv.best_params_['estimator'].max_depth

# Creating a new pipeline with the best parameters
tuned_adb = AdaBoostClassifier(
    random_state=1,  # Using the same random_state
    n_estimators=randomized_cv.best_params_['n_estimators'],  # Best n_estimators
    learning_rate=randomized_cv.best_params_['learning_rate'],  # Best learning_rate
    base_estimator=DecisionTreeClassifier(max_depth=best_max_depth, random_state=1)  # Using the best max_depth
)

# Fitting the model on the original data
#tuned_adb.fit(X, y) # <-----------IT CONTAINS NaN
tuned_adb.fit(X_clean_encoded, y_clean)
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
                                                         random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)

Explanation:

•   randomized_cv.best_params_['n_estimators'] extracts the best value for n_estimators.
•   randomized_cv.best_params_['learning_rate'] extracts the best value for learning_rate.
•   randomized_cv.best_params_['base_estimator'].max_depth gets the max_depth of the best base_estimator, which is a DecisionTreeClassifier.

Finally, you use tuned_adb.fit(X, y) to fit the model on the original data.

The code for checking the performance of the tuned_adb model on the training set, you need to pass the training data (features X_train and labels y_train) to the function model_performance_classification_sklearn.

In [ ]:
# Model erformance Metrics
adb_train = model_performance_classification_sklearn(tuned_adb, X_clean_encoded, y_clean)
adb_train
Out[ ]:
Accuracy Recall Precision F1
0 0.984 0.928 0.968 0.948

To check the performance of the tuned_adb model on the validation set, you need to pass the validation features (X_val) and labels (y_val) to the function model_performance_classification_sklearn.

Explanation:

•   X_val is the input features for the validation set.
•   y_val is the corresponding target labels for the validation set.
•   model_performance_classification_sklearn will compute the performance metrics for the tuned_adb model on the validation data.

Tuning Ada Boost (6th model) using undersampled data

To complete the code for the new pipeline (tuned_ada2) with the best parameters obtained from tuning and fitting the model on undersampled data:

1.  Insert the best parameters from randomized_cv.best_params_.
2.  Fit the model on the undersampled dataset, assuming you have undersampled data (X_undersample, y_undersample).
In [ ]:
# Import:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Extract the best parameters from RandomizedSearchCV
best_params = randomized_cv.best_params_

# Check the extracted best parameters for correctness
print(best_params)

# Create the best base estimator using the best parameters found
# Assuming 'base_estimator' was used, replace 'base_estimator' with 'estimator' if needed
if 'base_estimator__max_depth' in best_params:
    best_max_depth = best_params['base_estimator__max_depth']
else:
    # Handle the case where 'base_estimator' might not be in the parameters
    best_max_depth = None

# Create the base estimator with the best max_depth
best_base_estimator = DecisionTreeClassifier(max_depth=best_max_depth, random_state=1)

# Create the AdaBoostClassifier with the best parameters and the base estimator
tuned_ada2 = AdaBoostClassifier(
    random_state=1,
    n_estimators=best_params.get('n_estimators', 50),  # Default to 50 if not found
    learning_rate=best_params.get('learning_rate', 1.0),  # Default to 1.0 if not found
    base_estimator=best_base_estimator  # Use the base estimator
)

# Fit the model on the undersampled data
# tuned_ada2.fit(X_undersample, y_undersample)# <---------------------- FIX WITH UNDERSAMPLE
tuned_ada2.fit(X_clean_encoded, y_clean)
{'n_estimators': 100, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)
In [ ]:
# ANWSER 3:
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

# Define the undersampler
undersampler = RandomUnderSampler(random_state=1)

# Fit and transform the data to create undersampled datasets
X_undersample, y_undersample = undersampler.fit_resample(X_clean_encoded, y_clean)

# Extract the best parameters from RandomizedSearchCV
best_params = randomized_cv.best_params_

# Check the extracted best parameters for correctness
print(best_params)

# Create the best base estimator using the best parameters found
# Assuming 'base_estimator' was used, replace 'base_estimator' with 'estimator' if needed
if 'base_estimator__max_depth' in best_params:
    best_max_depth = best_params['base_estimator__max_depth']
else:
    # Handle the case where 'base_estimator' might not be in the parameters
    best_max_depth = None

# Create the base estimator with the best max_depth
best_base_estimator = DecisionTreeClassifier(max_depth=best_max_depth, random_state=1)

# Create the AdaBoostClassifier with the best parameters and the base estimator
tuned_ada2 = AdaBoostClassifier(
    random_state=1,
    n_estimators=best_params.get('n_estimators', 50),  # Default to 50 if not found
    learning_rate=best_params.get('learning_rate', 1.0),  # Default to 1.0 if not found
    base_estimator=best_base_estimator  # Use the base estimator
)

# Fit the model on the undersampled data
tuned_ada2.fit(X_undersample, y_undersample)
{'n_estimators': 100, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
Out[ ]:
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
                   learning_rate=0.1, n_estimators=100, random_state=1)
DecisionTreeClassifier(random_state=1)
DecisionTreeClassifier(random_state=1)

Explanation:

•   randomized_cv.best_params_['n_estimators']: Extracts the best value for the n_estimators parameter.
•   randomized_cv.best_params_['learning_rate']: Extracts the best learning rate.
•   randomized_cv.best_params_['base_estimator'].max_depth: Retrieves the best max_depth of the base estimator (decision tree).
•   tuned_ada2.fit(X_undersample, y_undersample): Fits the tuned model on the undersampled dataset.
In [ ]:
# When choosing Accuracy.
from sklearn.metrics import accuracy_score
'''
The following code calculates the accuracy of the tuned_ada2 model on
the undersampled training data.
It predicts the class labels for the undersampled data using the trained model
and then compares those predictions with the true labels.
The accuracy_score function computes the percentage of correct predictions,
which is stored in adb2_train.
'''
adb2_train = accuracy_score(y_undersample, tuned_ada2.predict(X_undersample))

'''
1. accuracy_score(y_undersample, tuned_ada2.predict(X_undersample)):
This line calculates the accuracy of the tuned_ada2 model on the undersampled training data.

2. tuned_ada2.predict(X_undersample):

This part predicts the class labels for the X_undersample data using the trained tuned_ada2 model.
The model applies its learned decision rules to the input data and returns the predicted class labels.
3. accuracy_score(y_undersample, ...):

This part calculates the accuracy of the model's predictions.
It compares the predicted labels from tuned_ada2.predict(X_undersample) with the true labels y_undersample.
The accuracy_score function calculates the percentage of correct predictions.
4. adb2_train = ...:

'''
adb2_train
'''
The calculated accuracy score is assigned to the variable adb2_train above.
This variable now holds the accuracy of the tuned_ada2 model on the undersampled training data.
'''
Out[ ]:
'\nThe calculated accuracy score is assigned to the variable adb2_train above.\nThis variable now holds the accuracy of the tuned_ada2 model on the undersampled training data.\n'

We might consider using a separate validation set (not part of the undersampled data) for a more reliable assessment of performance on unseen data.

In [ ]:
# Evaluation on Unseen Data : This is really important.
from sklearn.metrics import accuracy_score

adb2_val = accuracy_score(y_val, tuned_ada2.predict(X_val))

'''
This code calculates the accuracy score for the tuned_ada2 model on the validation set
(X_val, y_val). It first predicts labels for the validation data using tuned_ada2.predict(X_val),
and then compares those predictions with the true labels (y_val) using accuracy_score.

3. Interpretation:

The adb2_val variable will now hold the accuracy score of the model on the validation set.
This score provides a more reliable estimate of the model's performance on unseen data,
as it's evaluated on data the model hasn't seen during training.
'''
adb2_val
Out[ ]:
0.8915187376725838

Tuning Gradient Boosting (4th model) using undersampled data

This line calls the fit method on the randomized_cv object, passing the undersampled data (X_undersample and y_undersample) as arguments. This will initiate the randomized search process to find the best hyperparameter combination for the GradientBoostingClassifier model using the specified parameters and scoring metric.

In [ ]:
# Randomized search process
randomized_cv.fit(X_undersample, y_undersample)
In [ ]:
# MORE COMPLETE CODE for Reference
%%time

#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_undersample, y_undersample)  # <-- This line fits the model


print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 1, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.765876177828369:
CPU times: user 1.84 s, sys: 312 ms, total: 2.15 s
Wall time: 1min 6s

Best parameters are {'subsample': 0.9, 'n_estimators': 75, 'max_features': 1, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.765876177828369: CPU times: user 1.84 s, sys: 312 ms, total: 2.15 s Wall time: 1min 6s

In [ ]:
# Creating new pipeline with best parameters
tuned_gbm1 = GradientBoostingClassifier(
    max_features=randomized_cv.best_params_['max_features'],  # Access best max_features
    init=AdaBoostClassifier(random_state=1),
    random_state=1,
    learning_rate=randomized_cv.best_params_['learning_rate'],  # Access best learning_rate
    n_estimators=randomized_cv.best_params_['n_estimators'],  # Access best n_estimators
    subsample=randomized_cv.best_params_['subsample'],  # Access best subsample
)

tuned_gbm1.fit(X_train_un, y_train_un)  # Fit the model on the un-undersampled training data
Out[ ]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)

Explanation:

randomized_cv.bestparams: This attribute of randomized_cv holds a dictionary containing the best hyperparameter values found during the search. We access the best values for each parameter in the param_grid defined earlier:

  • max_features
  • learning_rate
  • n_estimators
  • subsample

These values are then used to create a new GradientBoostingClassifier instance (tuned_gbm1) with the optimal hyperparameter configuration.

Finally, the tuned_gbm1 model is fitted on the **un-undersampled** training data (X_train_un, y_train_un).

Note:

It's generally recommended to use a separate validation set (not part of the undersampled or un-undersampled training data) to evaluate the final model's performance and avoid overfitting. By using the best hyperparameters obtained from the randomized search, we are creating a more optimized model that could potentially perform better on unseen data.

Oversampling and Undersampling Code Example:

In [ ]:
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

# Define the resampling techniques
oversample = SMOTE(random_state=1)
undersample = RandomUnderSampler(random_state=1)

# Define the classifier
model = GradientBoostingClassifier(random_state=1)

# Create an imbalanced pipeline with oversampling and undersampling
pipeline = Pipeline([
    ('o', oversample),
    ('u', undersample),
    ('m', model)
])

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Use the fitted model for predictions or evaluation
predictions = pipeline.predict(X_test)

Explanation:

1.  Oversampling with SMOTE:
•   SMOTE generates synthetic samples for the minority class to balance the class distribution.
•   You can adjust the sampling_strategy parameter if needed (e.g., sampling_strategy='auto').
2.  Undersampling with RandomUnderSampler:
•   RandomUnderSampler randomly selects samples from the majority class to reduce its size.
•   Adjust the sampling_strategy parameter to specify the desired balance.
3.  Pipeline:
•   Create a Pipeline to sequentially apply the oversampling, undersampling, and model fitting steps.
•   This approach ensures that the data resampling is correctly applied during training.
4.  Fitting and Predicting:
•   Fit the pipeline on the training data.
•   Use the trained model for predictions or further evaluation.

Custom Approach:

If you only want to use either oversampling or undersampling, you can do it separately:

Oversampling Only:

In [ ]:
from imblearn.over_sampling import SMOTE

# Define the oversampler
oversample = SMOTE(random_state=1)

# Fit and transform the data
X_train_oversample, y_train_oversample = oversample.fit_resample(X_train, y_train)

# Fit the model on the oversampled data
tuned_gbm1.fit(X_train_oversample, y_train_oversample)

Undersampling Only:

In [ ]:
from imblearn.under_sampling import RandomUnderSampler

# Define the undersampler
undersample = RandomUnderSampler(random_state=1)

# Fit and transform the data
X_train_undersample, y_train_undersample = undersample.fit_resample(X_train, y_train)

# Fit the model on the undersampled data
tuned_gbm1.fit(X_train_undersample, y_train_undersample)

Summary:

•   Oversampling increases the number of minority class samples.
•   Undersampling reduces the number of majority class samples.
•   Use SMOTE for oversampling and RandomUnderSampler for undersampling.
•   You can combine both techniques using a Pipeline or apply them separately based on your needs.

Feel free to adjust the parameters and methods based on your specific dataset and problem.

==================== END OF IMPORTANT ===============================

To calculate the performance of the tuned_gbm1 model on the undersampled training set, we can use the accuracy_score function from sklearn.metrics.

In [ ]:
# Import necessary libraries
from imblearn.under_sampling import RandomUnderSampler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import recall_score

# Assuming you already have X_train and y_train from the original dataset

# Step 1: Apply undersampling to the training data
rus = RandomUnderSampler(random_state=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)  # This creates X_train_un and y_train_un

# Step 2: Define the model (here we're using GradientBoostingClassifier)
tuned_gbm1 = GradientBoostingClassifier(random_state=1)

# Step 3: Train the model on the undersampled training data
tuned_gbm1.fit(X_train_un, y_train_un)  # Ensure the model is trained on the undersampled data

# Step 4: Calculate the recall on the undersampled training set
gbm1_train_recall = recall_score(y_train_un, tuned_gbm1.predict(X_train_un))  # Check performance on undersampled train set

# Step 5: Output the recall score
print(f"Recall on undersampled training set: {gbm1_train_recall:.4f}")

'''
This line below calculates the recall score for the tuned_gbm1 model
on the X_train_un data and compares it to the true labels y_train_un.
The recall_score function returns the recall, which is the ability of the model
to capture the positive class correctly.

In summary, it compares the predicted labels with the true labels and returns
the recall value.
'''
gbm1_train_recall = recall_score(y_train_un, tuned_gbm1.predict(X_train_un))  # Check performance on undersampled train set

'''
Predict on Training Data:

tuned_gbm1.predict(X_train_un)<-- This method above predicts the class labels for
the X_train_un data using the trained tuned_gbm1 model. The predict method applies
the model's learned decision rules to the input data and returns the predicted class labels.

Calculate Recall:

recall_score(y_train_un, tuned_gbm1.predict(X_train_un)): This line calculates
the recall of the model's predictions on the training data.

The recall_score function takes two arguments:

1. y_train_un: The true labels of the training data.
2. tuned_gbm1.predict(X_train_un): The predicted labels from the model.
'''
gbm1_train_recall

'''
Assign to gbm1_train_recall:

gbm1_train_recall = recall_score(...): The calculated recall score is assigned
to the variable gbm1_train_recall. This variable now holds the recall of the tuned_gbm1
model on the undersampled training data.
'''

# Output recall score
print(f"Recall on undersampled training set: {gbm1_train_recall:.4f}")
Recall on undersampled training set: 0.9785
Recall on undersampled training set: 0.9785

In summary:

The code above calculates the accuracy of the tuned_gbm1 model on the undersampled training data. It first predicts the class labels using the model and then compares them to the true labels. The accuracy_score function is used to compute the percentage of correct predictions, which is stored in the gbm1_train variable.

To calculate the performance of the tuned_gbm1 model on the validation set, you can use the accuracy_score function from sklearn.metrics:

In [ ]:
# Accuracy, Recall and all other performance metrics to consider - Start with Accuracy
from sklearn.metrics import accuracy_score

'''
This line below calculates the accuracy score for the tuned_gbm1 model on the validation data (X_val, y_val).
It first predicts labels for the validation data using tuned_gbm1.predict(X_val),
and then compares those predictions with the true labels (y_val) using accuracy_score.
'''

gbm1_val = accuracy_score(y_val, tuned_gbm1.predict(X_val))

Tuning Gradient Boosting (4th model) using original data

In [ ]:
# Using Original Data:
randomized_cv.fit(X_train, y_train)
'''
This line calls the fit method on the randomized_cv object,
passing the original training data (X_train and y_train) as arguments.
This will initiate the randomized search process to find the best hyperparameter
combination for the GradientBoostingClassifier model using
the specified parameters and scoring metric.
'''
In [ ]:
# MORE COMPLETE CODE - For Reference during Code Development
%%time

#defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)  # <-- This line fits the model on the original data


print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

Setting up a new GradientBoostingClassifier instance (tuned_gbm2) with the best hyperparameters obtained from the randomized_cv object.

In [ ]:
# MORE COMPLETE CODE - For Reference during Code Development
%%time

#defining model
Model = GradientBoostingClassifier(random_state=1)

#Parameter grid to pass in RandomSearchCV
param_grid = {
    "init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
    "n_estimators": np.arange(50,110,25),
    "learning_rate": [0.01,0.1,0.05],
    "subsample":[0.7,0.9],
    "max_features":[0.5,0.7,1],
}

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)  # <-- This line fits the model on the original data


print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))

# Creating new pipeline with best parameters
tuned_gbm2 = GradientBoostingClassifier(
    max_features=randomized_cv.best_params_.get('max_features', None),  # Access best max_features, defaulting to None if not present
    init=AdaBoostClassifier(random_state=1),  # This works, `random_state` is passed to AdaBoost
    random_state=1,  # Ensure reproducibility
    learning_rate=randomized_cv.best_params_.get('learning_rate', 0.1),  # Default learning_rate if missing
    n_estimators=randomized_cv.best_params_.get('n_estimators', 100),  # Default n_estimators if missing
    subsample=randomized_cv.best_params_.get('subsample', 1.0)  # Default subsample if missing
)

# Fit the model on the original training data
tuned_gbm2.fit(X_train, y_train)
Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8384615384615385:
CPU times: user 8.07 s, sys: 601 ms, total: 8.67 s
Wall time: 3min 58s
Out[ ]:
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
                           max_features=0.5, random_state=1, subsample=0.9)
AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)

Best parameters are {'subsample': 0.9, 'n_estimators': 100, 'max_features': 0.5, 'learning_rate': 0.1, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.8384615384615385: CPU times: user 8.07 s, sys: 601 ms, total: 8.67 s Wall time: 3min 58s

Explanation:

randomized_cv.bestparams: This attribute of randomized_cv holds a dictionary containing the best hyperparameter values found during the search. We access the best values for each parameter in the param_grid defined earlier:

  • max_features
  • learning_rate
  • n_estimators
  • subsample

These values are then used to create a new GradientBoostingClassifier instance (tuned_gbm2) with the **optimal** hyperparameter configuration.

Finally, the tuned_gbm2 model is fitted on the original training data (X_train, y_train).

Note:

It's important to use a separate validation set (not part of the training data) to evaluate the final model's performance and avoid **overfitting**. By using the best hyperparameters obtained from the randomized search, we are creating a **more optimized model that could potentially perform better on unseen data**.

Tuning Gradient Boosting (4th model) using oversampled data

In [ ]:
# Use Performance Metrics
from sklearn.metrics import accuracy_score
'''
This line calculates the accuracy score for the tuned_gbm2 model on the
X_train_over data (oversampled data) and compares it to the true labels y_train_over.
The accuracy_score function returns the percentage of correct predictions.
'''
gbm2_train = accuracy_score(y_train_over, tuned_gbm2.predict(X_train_over))
gbm2_train

END 1/2

In [ ]:
# Use Performance Metrics
from sklearn.metrics import accuracy_score
'''
To calculate the performance of the tuned_gbm2 model on the validation set,
WE can use the accuracy_score function from sklearn.metrics.
This function compares the predicted labels with the true labels and returns
the percentage of correct predictions.
'''
gbm2_val = accuracy_score(y_val, tuned_gbm2.predict(X_val))
'''
This line calculates the accuracy score for the tuned_gbm2 model on the validation data (X_val, y_val).
It first predicts labels for the validation data using tuned_gbm2.predict(X_val),
and then compares those predictions with the true labels (y_val) using accuracy_score.
'''
gbm2_val
Out[ ]:
0.9664694280078896

Tuning XGBoost Model (Optional model) with original data

Note: This section is optional. You can choose not to build XGBoost if you are facing issues with installation or if it is taking more time to execute.

In [ ]:
# Optional - but useful to understand
randomized_cv.fit(X_train, y_train)
'''
This line calls the fit method on the randomized_cv object, passing the original
training data (X_train and y_train) as arguments. This will initiate the randomized
search process to find the best hyperparameter combination for the
XGBClassifier model using the specified parameters and scoring metric.
'''
Out[ ]:
'\nThis line calls the fit method on the randomized_cv object, passing the original\ntraining data (X_train and y_train) as arguments. This will initiate the randomized\nsearch process to find the best hyperparameter combination for the\nXGBClassifier model using the specified parameters and scoring metric.\n'
In [ ]:
# MORE COMPLETE CODE - For Reference during Code Development
%%time

# defining model
Model = XGBClassifier(random_state=1,eval_metric='logloss')

#Parameter grid to pass in RandomSearchCV
param_grid={'n_estimators':np.arange(50,110,25),
            'scale_pos_weight':[1,2,5],
            'learning_rate':[0.01,0.1,0.05],
            'gamma':[1,3],
            'subsample':[0.7,0.9]
           }
from sklearn import metrics

# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train, y_train)  # <-- This line fits the model on the original data

print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.7, 'scale_pos_weight': 5, 'n_estimators': 75, 'learning_rate': 0.05, 'gamma': 3} with CV score=0.9346153846153846:
CPU times: user 4.08 s, sys: 321 ms, total: 4.4 s
Wall time: 1min 17s
In [ ]:
# Tuning
    random_state=1,
    eval_metric="logloss",
    #subsample=randomized_cv.best_params_['subsample'],  # Access best subsample
    #scale_pos_weight=randomized_cv.best_params_['scale_pos_weight'],  # Access best scale_pos_weight
    #n_estimators=randomized_cv.best_params_['n_estimators'],  # Access best n_estimators
    #learning_rate=randomized_cv.best_params_['learning_rate'],  # Access best learning_rate
    gamma=1,  # You kept gamma fixed at 1
)

tuned_xgb.fit(X_train, y_train)  # Fit the model on the original training data
Out[ ]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=1, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, gamma=1, grow_policy=None,
              importance_type=None, interaction_constraints=None,
              learning_rate=None, max_bin=None, max_cat_threshold=None,
              max_cat_to_onehot=None, max_delta_step=None, max_depth=None,
              max_leaves=None, min_child_weight=None, missing=nan,
              monotone_constraints=None, multi_strategy=None, n_estimators=None,
              n_jobs=None, num_parallel_tree=None, random_state=1, ...)

Explanation:

randomized_cv.bestparams:

This attribute of randomized_cv holds a dictionary containing the best hyperparameter values found during the search.

We access the best values for each parameter in the param_grid we defined earlier, except for gamma which we kept at 1.

  • subsample
  • scale_pos_weight
  • n_estimators
  • learning_rate

These values are then used to create a new XGBClassifier instance (tuned_xgb) with the optimal hyperparameter configuration.

Finally, the tuned_xgb model is fitted on the original training data (X_train, y_train).

Note:

It's important to use a separate validation set (not part of the training data) to evaluate the final model's performance and avoid **overfitting**.

By using the best hyperparameters obtained from the randomized search, we are creating a more optimized model that could potentially perform better on **unseen data**.

In [ ]:
# Trained
from sklearn.metrics import accuracy_score
'''
To calculate the performance of the tuned_xgb model on the original training set,
we can use the accuracy_score function from sklearn.metrics.
'''
xgb_train = accuracy_score(y_train, tuned_xgb.predict(X_train))
'''
This line calculates the accuracy score for the tuned_xgb model on the
X_train data (original data) and compares it to the true labels y_train.
The accuracy_score function returns the percentage of correct predictions.
'''
xgb_train
Out[ ]:
0.9946920133316874
In [ ]:
# Tuned
from sklearn.metrics import accuracy_score
'''
To calculate the performance of the tuned_xgb model on the validation set,
we can use the accuracy_score function from sklearn.metrics.
'''
xgb_val = accuracy_score(y_val, tuned_xgb.predict(X_val))
'''
This line above calculates the accuracy score for the tuned_xgb model on the
validation data (X_val, y_val). It first predicts labels for the validation data
using tuned_xgb.predict(X_val), and then compares those predictions with
the true labels (y_val) using accuracy_score.
'''
xgb_val
Out[ ]:
0.9704142011834319

Model Comparison and Final Model Selection - Strategy Step 8¶

In [ ]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

print(f"\n\033[1;31mPerformance Metrics without data handling (Oversampling/Undersampling):\033[0m") # Main code for multiple models training and evaluations on split datasets (training, validations, test)

# Define your models
models = []
models.append(("Logistic Regression", LogisticRegression(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("XGBoost", XGBClassifier(random_state=1)))  # XGBoost (optional)

# Define a function to print performance metrics
def print_model_performance(model, X, y, dataset_name):
    y_pred = model.predict(X)  # Predict labels
    print(f"\n\033[1;4mPerformance on {dataset_name} data:\033[0m")
    print(f"Accuracy:  {accuracy_score(y, y_pred):.4f}")
    print(f"Recall:    {recall_score(y, y_pred):.4f}")
    print(f"Precision: {precision_score(y, y_pred):.4f}")
    print(f"F1 Score:  {f1_score(y, y_pred):.4f}")

# Evaluate each model
for name, model in models:
    model.fit(X_train, y_train)  # Train the model

    # Print the model being evaluated
    print(f"\n\033[1;92mModel Performance Evaluation:\033[0m")
    print(f"---  Model: {name}  ---")

    # Print performance metrics for training data
    print_model_performance(model, X_train, y_train, "Training")

    # Print performance metrics for validation data
    print_model_performance(model, X_val, y_val, "Validation")

    # Print performance metrics for test data
    print_model_performance(model, X_test, y_test, "Test")

print("\033[1;92m\nExamine the highest scores to identify the best model.\033[0m")
Performance Metrics without data handling (Oversampling/Undersampling):

Model Performance Evaluation:
---  Model: Logistic Regression  ---

Performance on Training data:
Accuracy:  0.8750
Recall:    0.4115
Precision: 0.6833
F1 Score:  0.5137

Performance on Validation data:
Accuracy:  0.8698
Recall:    0.3649
Precision: 0.5870
F1 Score:  0.4500

Performance on Test data:
Accuracy:  0.8585
Recall:    0.3439
Precision: 0.6397
F1 Score:  0.4473

Model Performance Evaluation:
---  Model: Decision Tree  ---

Performance on Training data:
Accuracy:  1.0000
Recall:    1.0000
Precision: 1.0000
F1 Score:  1.0000

Performance on Validation data:
Accuracy:  0.9310
Recall:    0.8243
Precision: 0.7349
F1 Score:  0.7771

Performance on Test data:
Accuracy:  0.9302
Recall:    0.7668
Precision: 0.8050
F1 Score:  0.7854

Model Performance Evaluation:
---  Model: Random Forest  ---

Performance on Training data:
Accuracy:  1.0000
Recall:    1.0000
Precision: 1.0000
F1 Score:  1.0000

Performance on Validation data:
Accuracy:  0.9527
Recall:    0.7432
Precision: 0.9167
F1 Score:  0.8209

Performance on Test data:
Accuracy:  0.9526
Recall:    0.7668
Precision: 0.9372
F1 Score:  0.8435

Model Performance Evaluation:
---  Model: Gradient Boosting  ---

Performance on Training data:
Accuracy:  0.9770
Recall:    0.8938
Precision: 0.9603
F1 Score:  0.9259

Performance on Validation data:
Accuracy:  0.9645
Recall:    0.8514
Precision: 0.9000
F1 Score:  0.8750

Performance on Test data:
Accuracy:  0.9664
Recall:    0.8379
Precision: 0.9550
F1 Score:  0.8926

Model Performance Evaluation:
---  Model: Bagging  ---

Performance on Training data:
Accuracy:  0.9963
Recall:    0.9785
Precision: 0.9984
F1 Score:  0.9883

Performance on Validation data:
Accuracy:  0.9546
Recall:    0.8514
Precision: 0.8400
F1 Score:  0.8456

Performance on Test data:
Accuracy:  0.9539
Recall:    0.8063
Precision: 0.9067
F1 Score:  0.8536

Model Performance Evaluation:
---  Model: AdaBoost  ---

Performance on Training data:
Accuracy:  0.9654
Recall:    0.8708
Precision: 0.9100
F1 Score:  0.8899

Performance on Validation data:
Accuracy:  0.9546
Recall:    0.8514
Precision: 0.8400
F1 Score:  0.8456

Performance on Test data:
Accuracy:  0.9526
Recall:    0.8063
Precision: 0.8987
F1 Score:  0.8500

Model Performance Evaluation:
---  Model: XGBoost  ---

Performance on Training data:
Accuracy:  1.0000
Recall:    1.0000
Precision: 1.0000
F1 Score:  1.0000

Performance on Validation data:
Accuracy:  0.9684
Recall:    0.9324
Precision: 0.8625
F1 Score:  0.8961

Performance on Test data:
Accuracy:  0.9651
Recall:    0.8656
Precision: 0.9202
F1 Score:  0.8921

Examine the highest scores to identify the best model.

Note: If you want to include XGBoost model for final model selection, you need to add xgb_train.T in the training performance comparison list and xgb_val.T in the validation performance comparison list below.

To check the performance of your final model on the unseen test data, you can use the accuracy_score function from sklearn.metrics.

Feature Importances - Strategy Step 9¶

Our code depends on the type of model identified as the best model. Here's how to complete the code for two common scenarios to understand the process of Feature Importance (Design of Experiments):

In [ ]:
# Initialize the Gradient Boosting Classifier
tuned_gbm = GradientBoostingClassifier(random_state=1)

# Fit the model with training data
tuned_gbm.fit(X_train, y_train)

# Now you can access the feature importances
importances = tuned_gbm.feature_importances_  # Get the feature importances from the fitted model

# Get the feature names from the training set
feature_names = X_train.columns

# Sort the feature importances in ascending order
indices = np.argsort(importances)

# Plot the feature importances
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])  # Match feature names with importances
plt.xlabel("Relative Importance")
plt.show()

In case our best model is a Gradient Boosting model (e.g., tuned_gbm1 or tuned_gbm2), we can access feature importance using the featureimportances attribute:

Explanation:

Replace tuned_gbm with the actual name of our best Gradient Boosting model.

The featureimportances attribute in Gradient Boosting models stores the importance scores for each feature.

For XGBoost models, the feature_importances method requires specifying the importance_type argument. Here, 'gain' is used, which considers the feature's contribution to reducing impurity.

The rest of the code remains the same:

indices = np.argsort(importances):

Sorts the feature importance scores in descending order.

The plot visualizes the sorted feature importance scores along the y-axis (bar plot) and corresponding feature names along the x-axis.

This code snippet helps identify which features contribute most to our best model's predictions.

Side Note:

**Feature Importance** is closely related to the **Design of Experiments (DOE)** .

Both concepts are crucial for understanding the impact of different factors or variables on a given outcome. Here's how they relate:

Feature Importance:

  • Identifies the relative significance of different features in a machine learning model.
  • Helps determine which features are most influential in predicting the target variable.
  • Can be used to simplify models by removing less important features.

Design of Experiments:

  • A structured approach to planning and conducting experiments to investigate the effects of factors on a response variable.
  • Involves designing experiments to efficiently collect data and analyze the results.
  • Helps identify significant factors and their interactions. Connection:

Factor Selection:

Both Feature Importance and DOE can both help identify the most influential factors in a system.

Feature importance can guide the selection of factors to include in experiments, while DOE can determine the optimal experimental design to study those factors.

Model Building:

Feature importance can be used to select relevant features for a machine learning model, while DOE can provide insights into the relationships between factors and the response variable, which can inform the model's structure.

Interpretation:

Both feature importance and DOE can help interpret the results of an analysis.

Feature importance can reveal which factors are most important for predicting the outcome, while DOE can help understand ***how*** different factors interact and influence the response.

In essence, Feature Importance can be seen as a way to analyze the results of experiments and identify the most influential factors, while DOE provides a framework for designing experiments to gather those data in the first place.

By combining Feature Importance with DOE, you can gain a deeper understanding of the underlying relationships between factors and the target variable, leading to more accurate and informative models.

**Business Insights and Conclusions**

  • Computational Time vs. Model Performance

    • More Hyperparameter Options: Searching over a large hyperparameter space can improve the model’s performance but at the cost of longer computation times. • Less Time, Fewer Hyperparameters: Reducing the hyperparameter search space (e.g., by using RandomizedSearchCV instead of GridSearchCV) can save time but might result in suboptimal performance.
  • Model Selection is time consuming but is a business investment to prevent financial losses or increase revenue.
  • Once a model is identified it can be furhter tuned to balance computational resources versus prediction efficiency.
  • The strongest and most important contributors that can predict and anticipate costly Credit Card Customer Attrition or Attrition_Flag are: (Top Ranking at the top)
    • Total_Trans_Ct (#19): Total Transaction Count in the last 12 months.
    • Total_Trans_Amt (#18): Total Transaction Amount (Last 12 months)
    • Total_Revolving_Bal (#15): Total Revolving Balance on the Credit Card
    • Total_Ct_Chnge_QA_Q1 (#20): Change in Transaction Count (Q4 over Q1)
    • Total_Relationshiip_Count (#11): Total no. of products held by the customer
  • (#nn) Represents the data column number or Feature number and they require constant monitoring and threshold level flags for early detection.
  • Another important Correlation indentified are the following:
    • Credit_Limit (#14) together with Avg_Open_To_Buy (#16).
    • Months_on_book (#10) together with Customer_Age (#3)
    • Total_Revolving_Bal (#15) together with Ave_Utilization_Ration(#21). Together, these represent a customer churn risk factor and thresholds should be monitored.¶

APPENDIX I: Addressing Bias - Variance and Missclassification - Model Comparisons (PROS & Cons)- Strategy Step 10 for Extra Credit¶

image.png

Imagine you're having a pizza party with your friends. You want to decide what toppings to put on the pizza.

Instead of asking each friend individually, you could divide your friends into groups and let each group decide on their own favorite toppings. Then, you could combine the results from all the groups to get a final decision.

This is kind of like what a Bagging Classifier does. It's like dividing your friends into groups (called "bootstrap samples") and letting each group (or "model") decide on its own answer. Then, it combines all the answers to get a final decision.

This helps to avoid making mistakes, because if one group (or model) makes a mistake, the other groups (or models) might be able to correct it.


image.png

Imagine you're trying to decide what to wear on a sunny day. You might ask your friends for advice.

A Random Forest Classifier is like asking a bunch of friends for advice and then taking a majority vote on what to wear. Each friend (or "decision tree") gives their own opinion, and the final decision is based on what most of them say.

This helps to avoid making mistakes because one friend might be wrong, but if most of them agree, it's probably a good choice.


image.png

Imagine you're trying to decide whether to go to the beach or stay home. You might consider factors like the weather, how crowded it might be, and how much time you have.

Logistic Regression is like weighing these factors and deciding whether to go to the beach or stay home based on the combined weight of all the factors. It's like putting each factor on a scale and seeing which side is heavier.

If the "go to the beach" side is heavier, the Logistic Regression would predict that you should go to the beach. If the "stay home" side is heavier, it would predict that you should stay home.


image.png

Imagine you're trying to decide whether to eat an apple or an orange. You might first check if the fruit is red. If it's red, you might then check if it's round or oblong.

A Decision Tree Classifier is like making a series of decisions based on different conditions. If the fruit is red, it might go to the "red fruit" branch. Then, if it's round, it might go to the "round red fruit" branch, which might lead to the decision to eat an apple.

It's like creating a tree-like structure with branches for different conditions and decisions at the end of each branch.


image.png

Imagine you're trying to learn how to ride a bike. You start with a basic bike and learn the basics. Then, you get a slightly harder bike with training wheels. You keep practicing and getting better, and eventually, you can ride a bike without training wheels.

Gradient Boosting is like that. It starts with a simple model (like a bike with training wheels) and gradually improves it by learning from its mistakes. It's like adding more training wheels and making the bike harder to ride until you're a pro!


image.png

Imagine you're trying to learn a new language. You start by learning basic words and phrases. Then, you try to speak the language with native speakers. If you make mistakes, they correct you and help you improve.

AdaBoost is like that. It starts with a simple model (like learning basic words and phrases). Then, it tries to use the model to make predictions. If it makes mistakes, it gives more weight to the examples where it made mistakes, just like your language teacher might focus on helping you with words you find difficult. This helps the model improve over time.


image.png

Imagine you're trying to learn how to play chess. You start by learning the basic rules. Then, you play against a friend who's a bit better than you. You learn from your mistakes and get better at playing.

XGBClassifier is like that. It starts with a simple model (like learning the basic rules of chess). Then, it plays against a "stronger" version of itself. It learns from its mistakes and gets better at making predictions. It's like playing chess against a grandmaster and getting better with each game.


Examples where certain models are applicable to predict outcomes usng Performance Metrics:

Healthcare

  • Decision Tree Classifier: Predicting the likelihood of a patient developing a disease based on various medical factors.
    • Recall is crucial to minimize false negatives (misdiagnosing patients).
  • Random Forest Classifier: Identifying fraudulent insurance claims by analyzing patterns in patient data.
    • Precision is important to avoid false positives (flagging legitimate claims as fraudulent).
  • Gradient Boosting Classifier: Predicting patient outcomes in critical care units using electronic health records.
    • F1-score is a balanced metric that considers both precision and recall, which are both important in critical care settings.

Finance

  • Logistic Regression: Predicting customer churn in banking based on factors like account activity and demographics.
    • Recall is important to minimize false negatives (failing to identify customers at risk of churning).
  • AdaBoostClassifier: Detecting fraudulent credit card transactions by analyzing patterns in transaction data.
    • Precision is important to avoid false positives (flagging legitimate transactions as fraudulent).
  • XGBClassifier: Predicting loan default risk based on borrower characteristics and financial history.
    • F1-score is a balanced metric that considers both precision and recall, which are both important in risk assessment.

Retail

  • Bagging Classifier: Recommending products to customers based on their purchase history and preferences.
    • Recall is important to avoid missing out on potential sales opportunities.
  • XGBClassifier: Predicting customer lifetime value to identify high-value customers.
    • Precision is important to avoid targeting low-value customers with marketing efforts.
  • Gradient Boosting Classifier: Optimizing inventory management by forecasting product demand.
    • F1-score is a balanced metric that considers both precision and recall, which are both important in inventory management.
In [ ]:
# get_metrics_score function can be defined as shown below:

from sklearn.metrics import accuracy_score, recall_score, precision_score

def get_metrics_score(model, X_train, y_train, X_test, y_test, return_train_score=False):
    """
    Calculate accuracy, precision, and recall for a given model.

    :param model: Trained model
    :param X_train: Training data
    :param y_train: True labels for training data
    :param X_test: Test data
    :param y_test: True labels for test data
    :param return_train_score: If True, calculate metrics for training data as well
    :return: List of accuracy, precision, and recall for train and test sets
    """
    # Predictions
    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    # Metrics for training data
    accuracy_train = accuracy_score(y_train, y_pred_train) if return_train_score else None
    precision_train = precision_score(y_train, y_pred_train, average='weighted') if return_train_score else None
    recall_train = recall_score(y_train, y_pred_train, average='weighted') if return_train_score else None

    # Metrics for test data
    accuracy_test = accuracy_score(y_test, y_pred_test)
    precision_test = precision_score(y_test, y_pred_test, average='weighted')
    recall_test = recall_score(y_test, y_pred_test, average='weighted')

    return [accuracy_train, accuracy_test, precision_train, precision_test, recall_train, recall_test]

To use the get_metrics_score function with a list of models and collect metrics such as accuracy, precision, and recall, you can follow this structure:

1.  Ensure that models is a list of trained models.
2.  Call get_metrics_score for each model in the list.
3.  Store and process the metrics.

Here’s how you can modify and use the code properly:

Explanation:

1.  Initialize Lists: Create empty lists (acc_test, precision_test, recall_test) to store the metrics for each model.
2.  Loop Through Models: Iterate over each model in the models list.
3.  Calculate Metrics: Call get_metrics_score to get the metrics for each model. Metrics for the test set are extracted from the returned list:
•   metrics[1] corresponds to test accuracy.
•   metrics[3] corresponds to test precision.
•   metrics[5] corresponds to test recall.
4.  Round and Append: Use np.round() to round the metrics to two decimal places and append them to the respective lists.
5.  Print Metrics: Optionally, print the metrics to review the results.

Make sure that X_train, y_train, X_test, and y_test are defined and correctly preprocessed before using this code. Also, replace model1, model2, model3, etc., with your actual trained models.

Yes, you can modify the code to highlight the highest values in bold and yellow when printing them, and also list the highest value together with the corresponding model. To achieve this, you can use ANSI escape codes for terminal text formatting. Here’s how you can do it:

Modified Code:

In [ ]:
import numpy as np

# Define ANSI escape codes for text formatting
BOLD_YELLOW = '\033[1;33m'
RESET = '\033[0m'

# Example usage of the get_metrics_score function
models = [model1, model2, model3]  # Replace with your list of trained models
model_names = ['Model 1', 'Model 2', 'Model 3']  # Replace with actual model names or identifiers
acc_test = []
precision_test = []
recall_test = []

# Store metrics in a list of tuples for comparison
metrics_list = []

# Loop through each model
for model, name in zip(models, model_names):
    # Get metrics for the current model
    metrics = get_metrics_score(model, X_train, y_train, X_test, y_test, return_train_score=False)

    # Append rounded metrics and model name to the metrics_list
    acc_test_value = np.round(metrics[1], 2)
    precision_test_value = np.round(metrics[3], 2)
    recall_test_value = np.round(metrics[5], 2)

    acc_test.append(acc_test_value)
    precision_test.append(precision_test_value)
    recall_test.append(recall_test_value)

    metrics_list.append((acc_test_value, precision_test_value, recall_test_value, name))

# Find the highest values and corresponding models
max_acc = max(acc_test)
max_precision = max(precision_test)
max_recall = max(recall_test)

# Highlight highest values in bold and yellow
for metric, values, name in zip(['Accuracy', 'Precision', 'Recall'],
                                 [acc_test, precision_test, recall_test],
                                 model_names):
    max_value = max(values)
    print(f"{metric} Values:")
    for value, model in zip(values, model_names):
        if value == max_value:
            print(f"{BOLD_YELLOW}{value} (Highest - {metric}){RESET} - {model}")
        else:
            print(f"{value} - {model}")

print("\nHighest values with corresponding models:")

# Print the highest values and the corresponding model
for acc, precision, recall, model_name in metrics_list:
    if acc == max_acc:
        print(f"Highest Test Accuracy: {BOLD_YELLOW}{acc}{RESET} - {model_name}")
    if precision == max_precision:
        print(f"Highest Test Precision: {BOLD_YELLOW}{precision}{RESET} - {model_name}")
    if recall == max_recall:
        print(f"Highest Test Recall: {BOLD_YELLOW}{recall}{RESET} - {model_name}")

Explanation:

1.  ANSI Escape Codes:
•   BOLD_YELLOW and RESET are used to format text with bold and yellow color in the terminal.
2.  Metrics Collection:
•   Store the metrics and corresponding model names in metrics_list for easy comparison.
3.  Find Maximum Values:
•   Use max() to find the highest values for accuracy, precision, and recall.
4.  Highlight and Print Values:
•   Iterate through each metric to compare and highlight the highest values.
•   Print the highest values with formatting.
5.  Print Highest Values with Models:
•   Compare the highest values against each model’s metrics to list the highest value and its corresponding model.

Notes:

•   This code assumes that you are running it in a terminal or environment that supports ANSI escape codes for text formatting.
•   Replace model1, model2, model3, etc., with your actual models and model_names with descriptive names or identifiers for the models.

APPENDIX II - Improving Model Performance¶

For tuning and improving model performance, you can consider adding several other models and techniques to your list. These will give you more diverse options to compare and select the best model. Here are some additional models and techniques you could consider:

Additional Models:

1.  Random Forest:
      • Try using Random Forest classifiers with the original, oversampled, and undersampled data. Random Forest can provide better generalization by reducing overfitting compared to Decision Trees.
    •   Variants:
      • Random Forest with original data.
      • Random Forest with oversampled data.
      • Random Forest with undersampled data.
2.  Logistic Regression:
      • Logistic Regression with regularization (L2 or L1 penalty) can be a strong baseline model.
    •   Variants:
      • Logistic Regression with original data.
      • Logistic Regression with oversampled data.
      • Logistic Regression with undersampled data.
3.  Support Vector Machine (SVM):
      • Consider tuning an SVM model for classification. It can be useful for high-dimensional data, although it can be slow with large datasets.
    •   Variants:
      • SVM with original data.
      • SVM with oversampled data.
      • SVM with undersampled data.
4.  LightGBM (LGBM):
      • LightGBM is an efficient gradient boosting algorithm, often faster than XGBoost, especially with larger datasets. It handles categorical features well.
    •   Variants:
      • LightGBM with original data.
      • LightGBM with undersampled data.
5.  CatBoost:
      • Another gradient boosting algorithm that is highly optimized for categorical features and often performs well with little tuning.
    •   Variants:
      • CatBoost with original data.
      • CatBoost with undersampled data.
6.  Ensemble Voting Classifier:
      • Combine multiple models using a voting classifier (e.g., combining Decision Trees, Logistic Regression, Random Forest, and XGBoost). This can help leverage the strengths of multiple models.
7.  Stacking Classifier:
      • Use a stacking model to combine predictions from multiple models (e.g., Decision Tree, AdaBoost, Gradient Boosting, etc.) and feed them into a meta-learner (e.g., Logistic Regression) for the final prediction.

Additional Techniques:

1.  Hyperparameter Tuning:
      • Use GridSearchCV or RandomizedSearchCV to perform hyperparameter tuning for each model. This can significantly improve performance.
2.  Cross-Validation:
      • Make sure to use cross-validation (e.g., K-fold cross-validation) to get more reliable estimates of model performance and avoid overfitting.
3.  Feature Engineering:
      • Experiment with creating new features (e.g., interaction terms or derived variables), scaling features (e.g., using StandardScaler or MinMaxScaler), or reducing dimensionality (e.g., using PCA).
4.  Feature Selection:
      • Use techniques like Recursive Feature Elimination (RFE) or L1-based feature selection to improve model performance by removing irrelevant features.
5.  Class Imbalance Techniques:
      • In addition to oversampling (SMOTE) and undersampling, you can explore hybrid methods like SMOTEENN or SMOTETomek, which combine both approaches.

By including these additional models and techniques in your analysis, you can comprehensively compare results and select the best-performing model for predicting customer churn for Thera bank.

APPENDIX III Hyperparameter tuning trade-offs¶

When performing model hyperparameter tuning, there are several trade-offs to consider. These trade-offs impact model performance, time, and generalization ability. Here are the key trade-offs to be aware of:

  1. Bias-Variance Trade-off

    • High Bias (Underfitting): If the model is too simple or has poor hyperparameters, it might not capture the underlying patterns in the data, leading to underfitting. This results in low training and validation performance. • High Variance (Overfitting): If the model is too complex or over-tuned on the training data, it might memorize the training data, causing it to perform well on the training set but poorly on unseen data (validation/test set). This leads to overfitting.

Consideration:

•   Finding the right balance between bias and variance is key during tuning. Regularization parameters (e.g., alpha, lambda in Lasso or Ridge, max_depth in trees) can help control this balance.

  1. Computational Time vs. Model Performance

    • More Hyperparameter Options: Searching over a large hyperparameter space can improve the model’s performance but at the cost of longer computation times. • Less Time, Fewer Hyperparameters: Reducing the hyperparameter search space (e.g., by using RandomizedSearchCV instead of GridSearchCV) can save time but might result in suboptimal performance.

Consideration:

•   Use RandomizedSearchCV for faster searches in large hyperparameter spaces and GridSearchCV for exhaustive searches when time is not a critical factor.

  1. Cross-Validation: Number of Folds

    • More Folds (e.g., K=10): Provides more robust performance estimates by evaluating the model across multiple splits, but it increases computation time since the model is trained K times. • Fewer Folds (e.g., K=3 or 5): Reduces computation time but may lead to less reliable performance estimates.

Consideration:

•   Use more folds (like 10) for small datasets or when accuracy is critical, but fewer folds for large datasets to speed up tuning.

  1. Complexity of the Model vs. Interpretability

    • Complex Models (e.g., XGBoost, Neural Networks): These models might provide better predictive performance but are harder to interpret. • Simpler Models (e.g., Logistic Regression, Decision Trees): Easier to interpret and explain, but they may not capture complex relationships in the data as effectively.

Consideration:

•   Choose complex models when prediction accuracy is paramount, and simpler models when interpretability is a key concern.

  1. Generalization vs. Optimizing on Training Data

    • Over-Tuning on Training Data: Over-tuning hyperparameters might lead to a model that fits the training data too well, causing poor generalization to new data. • Generalization: A model that generalizes well performs consistently on training, validation, and test sets without excessive tuning.

Consideration:

•   Use early stopping or validation scores to prevent overfitting. Focus on tuning hyperparameters that help generalize the model rather than maximize training performance.

  1. Regularization Strength

    • High Regularization: Can prevent overfitting by penalizing large coefficients but may lead to underfitting if the penalty is too strong. • Low Regularization: Allows the model to fit the data more flexibly but can lead to overfitting.

Consideration:

•   Regularization parameters like alpha (Lasso) or C (Logistic Regression) need careful tuning to ensure the right balance between flexibility and overfitting.

  1. Hyperparameter Interaction

    • Dependent Hyperparameters: Some hyperparameters interact with others. For instance, in tree-based models, max_depth and min_samples_split can work together, so tuning them independently might not give the best results. • Independent Hyperparameters: Hyperparameters that can be tuned independently can speed up the process.

Consideration:

•   Consider the interaction between hyperparameters when designing a grid or random search space.

  1. Training Data Size vs. Model Complexity

    • Larger Training Data: Allows for more complex models and reduces overfitting, but the time required for training increases significantly. • Smaller Training Data: Faster training but more prone to overfitting, especially with complex models.

Consideration:

•   Use techniques like cross-validation and early stopping to handle smaller datasets efficiently while tuning.

  1. Evaluation Metrics Trade-off

    • Accuracy vs. Precision/Recall: Depending on the problem (e.g., imbalanced datasets), focusing on accuracy might not be the best strategy. You may want to optimize for precision, recall, or F1 score. • Multiple Metrics: Optimizing one metric (e.g., recall) can negatively impact another (e.g., precision), so a balance is needed based on the problem requirements.

Consideration:

•   Choose the evaluation metric(s) based on the business objective. For example, use precision/recall for imbalanced data (like churn prediction) and accuracy for balanced data.

By balancing these trade-offs carefully, hyperparameter tuning can greatly improve model performance while avoiding pitfalls like overfitting and excessive computational costs.

APPENDIX IV - Code used as a reference with constant improvements¶

In [ ]:
# Model Performance Scores - Code used for reference - no need to run

# Checking recall score on train and validation set
print("Recall on train and validation set")
print(recall_score(y_train, rf.predict(X_train)))
print(recall_score(y_val, rf.predict(X_val)))

# Single line:
# Checking Recall score on train and validation set
recall_train = recall_score(y_train, rf.predict(X_train))
recall_val = recall_score(y_val, rf.predict(X_val))
print(f"Recall on train set: {precision_train}, Recall on validation set: {recall_val}")

print("-" * 30)

# Checking Precision score on train and validation set
print("Precision on train and validation set")
print(precision_score(y_train, rf.predict(X_train)))
print(precision_score(y_val, rf.predict(X_val)))

# Single line:
# Checking Precision score on train and validation set
precision_train = precision_score(y_train, rf.predict(X_train))
precision_val = precision_score(y_val, rf.predict(X_val))
print(f"Precision on train set: {precision_train}, Precision on validation set: {precision_val}")

print("-" * 30)

# Checking Accuracy score on train and validation set
print("Accuracy on train and validation set")
print(accuracy_score(y_train, rf.predict(X_train)))
print(accuracy_score(y_val, rf.predict(X_val)))

# Single line:
# Checking Accuracy score on train and validation set
accuracy_train = accuracy_score(y_train, rf.predict(X_train))
accuracy_val = accuracy_score(y_val, rf.predict(X_val))
print(f"Precision on train set: {accuracy_train}, Precision on validation set: {accuracy_val}")

print("-" * 30)

# Checking F1 score on train and validation set
print("F1 on train and validation set")
print(f1_score(y_train, rf.predict(X_train)))
print(f1_score(y_val, rf.predict(X_val)))

# Single line:
# Checking Precision score on train and validation set
f1_train = f1_score(y_train, rf.predict(X_train))
f1_val = f1_score(y_val, rf.predict(X_val))
print(f"Precision on train set: {f1_train}, Precision on validation set: {f1_val}")

print("\033[1;92mModel Performance Evaluation:\033[0m")
print(f"Model: {model.__class__.__name__}")  # Prints the class name of the model
model_performance_classification_sklearn(model, X_train, y_train) # Calls pre-defined function that displays performance metrics
Recall on train and validation set
1.0
0.7432432432432432
Recall on train set: 1.0, Recall on validation set: 0.7432432432432432
------------------------------
Precision on train and validation set
1.0
0.9166666666666666
Precision on train set: 1.0, Precision on validation set: 0.9166666666666666
------------------------------
Accuracy on train and validation set
1.0
0.9526627218934911
Precision on train set: 1.0, Precision on validation set: 0.9526627218934911
------------------------------
F1 on train and validation set
1.0
0.8208955223880596
Precision on train set: 1.0, Precision on validation set: 0.8208955223880596
Model Performance Evaluation:
Model: XGBClassifier
Out[ ]:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
In [ ]:
# ----------------------------- BASELINE VERSION WITH FULL EXPLANATIONS Fore reference - No need to execute - Useful to build up the code

# Import SMOTE from imblearn
from imblearn.over_sampling import SMOTE

# ANSI escape codes for bold and yellow, original data showing minority data.
print("Before Oversampling, counts of label 'Yes': \033[1;33m{}\033[0m ".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'No': {} \n".format(sum(y_train == 0)))

'''
The code below effectively addresses class imbalance by generating synthetic data points
for the minority class, ensuring a more balanced distribution between the classes.
This can be beneficial for improving the performance of machine learning models,
especially when dealing with imbalanced datasets.

'''
# ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
# S_ynthetic M_inority O_ver-Sampling Te_chnique:
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
'''
    sampling_strategy=1:
    This indicates that the minority class should be oversampled to have an
    equal number of samples as the majority class.
    k_neighbors=5:
    This specifies the number of neighbors used to generate new
    synthetic data points.
    random_state=1:
    This sets a random seed for reproducibility.
'''
# Oversampled data - minority class increased to match majority.
# Code adjusts the training dataset so that the minority and majority classes in y_train are more balanced,
# improving model performance in cases where class imbalance is a problem (e.g., our churn prediction).
# ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)

# ANSI escape codes for bold and yellow, oversampled data.
print("After Oversampling, counts of label 'Yes': \033[1;33m{}\033[0m ".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'No': {} \n".format(sum(y_train_over == 0)))

# ANSI escape codes for bold and yellow, oversampled data shape.
print("After Oversampling, the shape of X_train: \033[1;33m{}\033[0m ".format(X_train_over.shape))
print("After Oversampling, the shape of y_train: {}\n".format(y_train_over.shape))