Titanic Fatality Prediction Model¶

Introduction¶

Purpose:¶

This notebook is going to explore the Titanic Fatality data provided by Kaggle to determine the best model to predict fatality outcomes based on the feature data documented. This notebook aims to bring a much deeper depth of exploratory knowledge into the predictions, giving hypothetical narratives that are supported by the data that will be processed, explored and modeled.

Goal Model:¶

Our goal model will utilize the Gradient Boosting method using XGBoost to boost the models being produced using the Sckit-Learn library. The Gradient boosting method aims to reduce inherited biases produced by subsequent models while maximizing accuracy rates of the predictions.

In [1]:
import warnings 

import pandas as pd
import numpy as npy
import matplotlib.pyplot as mpl
import seaborn as sn
from xgboost import XGBClassifier

print('All standard libraries available, setup successful.')
All standard libraries available, setup successful.

Data Exploration:¶

In [2]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")

Summary of the data we are going to be working with to train the model.¶

In [3]:
df_train.info()
df_train.shape
display(df_train.head())
df_train.tail()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q

It looks like that the sample size is n=891, and there are a total of 10 feature columns. In the "Age", "Cabin" and "Embarked" columns, there are missing values that would need to be imputed during data configuration and sanitization. Names will be normalized for tokenization of key parts of the name, such as title (e.g Mr, Mrs, Miss, etc.) Family names could be utilize to determine an unlabeled aspect of the nodes and may give slight indications to social hiearchy classifications.

Preprocessing Data Frame:¶

In [4]:
def preprocess(df):
    df = df.copy()

    def normalize_name(x):
            return " ".join([v.strip(",()[].\"'/") for v in x.split(" ")])
    def name_title(x):
        found_keyword = []
        keywords = {"Mr.", "Mrs.", "Miss.", "Rev.", "Master"}
        for keyword in keywords:
            if keyword in x:
                found_keyword.append(keyword)
                return found_keyword
        else: 
            return "NaN"
        
        
    def ticket_number(x):
        if x == "LINE":
            return "NaN"
        return x.split(" ")[-1]
    def ticket_item(x):
        items = x.split(" ")
        if len(items) == 1:
            return "NONE"
        return "_".join(items[0:-1])

    df["Title"] = df["Name"].apply(name_title)
    df["Name"] = df["Name"].apply(normalize_name)
    df["Ticket_number"] = df["Ticket"].apply(ticket_number)
    df["Ticket_item"] = df["Ticket"].apply(ticket_item)
    return df.drop("Ticket", axis=1)

preprocessed_train_df = preprocess(df_train)
preprocessed_serving_df = preprocess(df_test)
print("preprocessing completed") #Debugging phrase 
preprocessing completed
In [5]:
preprocessed_train_df.info()
display(preprocessed_train_df.head())
display(preprocessed_train_df.tail())
preprocessed_serving_df.head()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 14 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PassengerId    891 non-null    int64  
 1   Survived       891 non-null    int64  
 2   Pclass         891 non-null    int64  
 3   Name           891 non-null    object 
 4   Sex            891 non-null    object 
 5   Age            714 non-null    float64
 6   SibSp          891 non-null    int64  
 7   Parch          891 non-null    int64  
 8   Fare           891 non-null    float64
 9   Cabin          204 non-null    object 
 10  Embarked       889 non-null    object 
 11  Title          891 non-null    object 
 12  Ticket_number  891 non-null    object 
 13  Ticket_item    891 non-null    object 
dtypes: float64(2), int64(5), object(7)
memory usage: 97.6+ KB
PassengerId Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Title Ticket_number Ticket_item
0 1 0 3 Braund Mr Owen Harris male 22.0 1 0 7.2500 NaN S [Mr.] 21171 A/5
1 2 1 1 Cumings Mrs John Bradley Florence Briggs Thayer female 38.0 1 0 71.2833 C85 C [Mrs.] 17599 PC
2 3 1 3 Heikkinen Miss Laina female 26.0 0 0 7.9250 NaN S [Miss.] 3101282 STON/O2.
3 4 1 1 Futrelle Mrs Jacques Heath Lily May Peel female 35.0 1 0 53.1000 C123 S [Mrs.] 113803 NONE
4 5 0 3 Allen Mr William Henry male 35.0 0 0 8.0500 NaN S [Mr.] 373450 NONE
PassengerId Survived Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Title Ticket_number Ticket_item
886 887 0 2 Montvila Rev Juozas male 27.0 0 0 13.00 NaN S [Rev.] 211536 NONE
887 888 1 1 Graham Miss Margaret Edith female 19.0 0 0 30.00 B42 S [Miss.] 112053 NONE
888 889 0 3 Johnston Miss Catherine Helen Carrie female NaN 1 2 23.45 NaN S [Miss.] 6607 W./C.
889 890 1 1 Behr Mr Karl Howell male 26.0 0 0 30.00 C148 C [Mr.] 111369 NONE
890 891 0 3 Dooley Mr Patrick male 32.0 0 0 7.75 NaN Q [Mr.] 370376 NONE
Out[5]:
PassengerId Pclass Name Sex Age SibSp Parch Fare Cabin Embarked Title Ticket_number Ticket_item
0 892 3 Kelly Mr James male 34.5 0 0 7.8292 NaN Q [Mr.] 330911 NONE
1 893 3 Wilkes Mrs James Ellen Needs female 47.0 1 0 7.0000 NaN S [Mrs.] 363272 NONE
2 894 2 Myles Mr Thomas Francis male 62.0 0 0 9.6875 NaN Q [Mr.] 240276 NONE
3 895 3 Wirz Mr Albert male 27.0 0 0 8.6625 NaN S [Mr.] 315154 NONE
4 896 3 Hirvonen Mrs Alexander Helga E Lindqvist female 22.0 1 1 12.2875 NaN S [Mrs.] 3101298 NONE

We now have split up the "Name" feature into more stringent categories. The name normalization might come into play as family names at the time were a label of class and subclass. Titles is a more stringent indicator of class an will possibly have a much more influential part when building the XGBoosted model.

Exploratory modeling¶

In [6]:
print('Correlation heatmap of numerical variables:')

#Only looking at the numeric or binary values to find outliers and correlations that are of interest for data cleaning and refinement.

#Select for columns with numeric values, those with string values will be omitted. 
num_df = preprocessed_train_df.drop("PassengerId", axis = 1).select_dtypes(include=[npy.number])

sn.color_palette("colorblind")

if num_df.shape[1] >= 4:
    mpl.figure(figsize=(10,8)) 
    sn.heatmap(num_df.corr(), annot=True, fmt='.2f', cmap='rocket')
    mpl.title('correlation heatmap of numeric features')
    mpl.tight_layout()
    mpl.show()
else:
    print('Not enough numeric features for correlation analysis')
Correlation heatmap of numerical variables:
No description has been provided for this image

There is some correlations found between the target variable and the numerical features presented in the data-set, most of them seem to be negative correlations.

Survived vs Fare¶

The single comparatively significant positive correlation is between "Survived" and "Fare" feature. This could slightly indicate that if a passenger had a higher fare price, their survival rate was much higher comparatively to those who had signficantly less costly fares.

Fare vs PClass¶

To recap on the data being presented here, the "PClass" feature had been numarized to represent the different Passenger Classes, 0 being the highest class and 3 being the lowest class. As seen on this heatmap, Fare has a negative correlation with Passenger Class, indicating those in a lower class, Pclass=3, had significantly lower fairs than those in the higher Passenger Class, PClass=0.

PClass vs Survived¶

As seen with Fare vs PClass inverse correlation, the lower class that had lower cost fares were at the upper end of the range of integers being presented for Passenger classes. The law of transitive properties relating PClass with Survived through the Fare vs PClass correlation, the inverse correlation between PCLass and survived, indicates that the lower the passenger class, PClass = 3, the less likely the person is going to be Survived=1 for the Survived category. Sematically, this means that those in the lower class have a lower survival rate as the correlation here presents.

Age Distribution Investigation¶

We are going to take a look at what the "Age" distribution looks like to determine if there are any outliers and/or skews that appear which could affect our overall model performance when taking into account of the feature. If there is a skew in either direction of the age distribution, this can potentially disrupt the accuracy of the model and require more resources to run if it is not appropriately weighted. The libraries being used in this notebook are able to take into account for the inaccuracies, we are only touching upon distribution normality weighing from an educational standpoint. The final model would not have to be directly tabulated to take into account for skewed distributions.

In [7]:
display(preprocessed_train_df["Age"].describe())
sn.set_palette("colorblind")
sn.catplot(data = preprocessed_train_df, x="Age", kind="box")
sn.displot(data = preprocessed_train_df, x="Age")
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64
Out[7]:
<seaborn.axisgrid.FacetGrid at 0x21ad3301e50>
No description has been provided for this image
No description has been provided for this image

Data Model¶

In [8]:
#Tokenize Names for model compatible format. 
from sklearn.preprocessing import LabelEncoder 
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
warnings.filterwarnings('ignore')

cols = ["Name","Title","Pclass","Sex","Cabin","Embarked", "Ticket_item"]
encoder = LabelEncoder()
for col in cols:
    preprocessed_train_df[col] = encoder.fit_transform(preprocessed_train_df[col].astype(str))
    preprocessed_serving_df[col] = encoder.fit_transform(preprocessed_serving_df[col].astype(str))
    
y_train = preprocessed_train_df["Survived"]
X_train = preprocessed_train_df.drop(["Survived", "PassengerId"], axis=1)

X_test = preprocessed_serving_df.drop('PassengerId', axis = 1)
    
In [12]:
#debug output for Data troubleshooting. Not for Production Use  
#ts_output = preprocessed_train_df
#ts_output.to_csv('debug.csv', index=False)
In [13]:
#Check for Numpy arrays
X = X_train.values if isinstance(X_train, pd.DataFrame) else X_train
y = y_train.values if isinstance(X_train, pd.Series) else y_train
X_test_npy = X_test.values if isinstance(X_test, pd.DataFrame) else X_test

#Configure Stratified K Fold function for 5-fold cross-validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) #Answer to life
test_preds = npy.zeros(len(X_test_npy))
val_scores = []

for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):
    print(f"Training fold {fold + 1}...")

    X_tr, X_val = X[train_idx], X[val_idx]
    y_tr, y_val = y[train_idx], y[val_idx]

    model = XGBClassifier(n_estimators=100, max_depth=5, learning_rate=0.2, subsample=0.8, colsample_bytree=0.8, tree_method='gpu_hist', 
        predictor='gpu_predictor', random_state=42, use_label_encoder=False, eval_metric='logloss')

    model.fit(X_tr, y_tr)

    val_pred = model.predict(X_val)
    val_acc = accuracy_score(y_val, val_pred)
    val_scores.append(val_acc)

    test_preds += model.predict(X_test_npy)

final_preds = (test_preds >= 3).astype(int)

output = pd.DataFrame({'PassengerId': df_test.PassengerId, "Survived": final_preds})
output.to_csv('submission.csv', index=False)

print(f"Your submission was successfully saved! CV Scores: {val_scores}")
print(f"Average CV Accuracy: {npy.mean(val_scores):.4f}")
Training fold 1...
Training fold 2...
Training fold 3...
Training fold 4...
Training fold 5...
Your submission was successfully saved! CV Scores: [0.8268156424581006, 0.8539325842696629, 0.8033707865168539, 0.848314606741573, 0.8314606741573034]
Average CV Accuracy: 0.8328

Titanic - Machine Learning from Disaster Competition¶

Results:¶

Public Score : 0.74401