Introduction

As an introductory data science project, I have chosen to explore the data provided by the Titanic: Machine Learning from Disaster competition hosted by Kaggle. The competition is to build the best model that can predict whether a given passenger survived the sinking of the Titanic. As a first step, I performed introductory data analysis to learn more about the passengers on board. In this second part, I will compare different machine learning algorithms and submit my solution to Kaggle.

We begin by loading the required packages, and performing the data munging steps described in Part 1. For the current prediction task, we are given a training dataset, and a testing dataset. Both datasets have missing data, and we will combine them prior to performing the data munging steps details in part 1.

library(plyr)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(caret)
library(rattle)
library(rpart)
library(rpart.plot)
library(ranger)
library(e1071)
library(caTools)
library(pROC)
library(randomForest)
library(glmnet)
library(knitr)
library(kernlab)
library(party)
library(ggmosaic)
library(ggbiplot)

# Set seed for reproducible results
set.seed(234233343)

# load the training dataset
train_data <- read_csv(file.path('..','data','train.csv'))
test_data <- read_csv(file.path('..','data','test.csv'))
# Preprocessing, data cleanup the same as for exploratory analysis
# initial pre-processing:
train_data <- train_data %>% 
  mutate(Survived = factor(Survived, levels = c(1, 0), labels = c("yes", "no")))

titanic_data <- train_data %>% 
  bind_rows(test_data) %>% 
  mutate(Title = factor(str_extract(Name, "[a-zA-z]+\\.")))

# Convert variable names to lowercase
names(titanic_data) <- tolower(names(titanic_data))

# Fill in missing values:
#look for missing data
summary(titanic_data)
sapply(titanic_data, function(df){mean(is.na(df))})

# Have missing data for age, embarked, cabin.
# ignore cabin (missing 77% of data)

# impute Embarked:
table(titanic_data$embarked, useNA = "always")
# set missing data to S, as the most common.
titanic_data$embarked[which(is.na(titanic_data$embarked))] <- "S"

# impute ages to be the mean of people with same title:
tb <- cbind(titanic_data$age, titanic_data$title)
table(tb[is.na(tb[,1]),2])

# get the mean ages for each title
age_dist <- titanic_data %>% 
  group_by(title) %>% 
  summarize(n = n(),
            n_missing = sum(is.na(age)),
            perc_missing = 100*n_missing/n,
            mean_age = mean(age, na.rm = TRUE),
            sd_age = sd(age, na.rm = TRUE))

age_dist

# missing data for Dr, Master, Miss, Mr, Mrs, Ms.
# because so many values are missing, impute with values taken from 
# normal distribution, rather than just imputing the mean age

for (key in c("Dr.", "Master.", "Miss.", "Mr.", "Mrs.")) { 
  idx_na <- which(titanic_data$title == key & is.na(titanic_data$age))
  age_idx <- which(age_dist$title == key)
  titanic_data$age[idx_na] <- rnorm(length(idx_na), 
                                    age_dist$mean_age[age_idx], 
                                    age_dist$sd_age[age_idx])
}

# Only 2 passengers with title of "Ms." and one is missing the age. Use the existing age to impute.
idx_na <- which(titanic_data$title == "Ms." & is.na(titanic_data$age))
age_idx <- which(age_dist$title == "Ms.")
titanic_data$age[idx_na] <- age_dist$mean_age[age_idx]

# Impute missing fares with the mean fare:
titanic_data <- titanic_data %>% 
  mutate(fare = ifelse(is.na(fare), mean(fare, na.rm = TRUE), fare)) %>% 
  select(-cabin)

Further Data Cleanup

We have already seen that there are many different titles assigned to the passengers:

title sex Num_Passengers
Capt. male 1
Col. male 4
Countess. female 1
Don. male 1
Dona. female 1
Dr. female 1
Dr. male 7
Jonkheer. male 1
Lady. female 1
Major. male 2
Master. male 61
Miss. female 260
Mlle. female 2
Mme. female 1
Mr. male 757
Mrs. female 197
Ms. female 2
Rev. male 8
Sir. male 1

While these titles were informative, they add a level of complexity that may have a detremental effect on the machine learning models that we will be looking at. We will reduce the number of titles to 4: Mr., Mrs., Master., and Miss. These titles capture both gender and age information.

titanic_data <- titanic_data %>% 
  mutate(title = as.character(title),
         title = ifelse((title == "Dr.") & sex == "female", "Mrs.", title),
         title = ifelse(title == "Mlle.", "Miss.", title),
         title = ifelse((title != "Miss.") & sex == "female", "Mrs.", title),
         title = ifelse((title != "Master.") & sex == "male", "Mr.", title),
         title = factor(title))

Having finished the data munging steps, we will again divide our data into the training and testing datasets.

test_idx <- which(is.na(titanic_data$survived))
training_data <- titanic_data[-test_idx,]
testing_data <- titanic_data[test_idx,]

Data Exploration

Before building our predictive model, we want to explore the data and see which variables might be most important. We begin by examining the impact of each variable on the survival rate.

From these plots, it appears that pclass, sex, and title have strong correlations with survival, while sibsp, parch, and embarked have a lesser (though noticible) correlation.

The distributions of age and fare do not appear significantly different for the suviving passengers and those that perished. As we saw previously, there are a number of passengers in 3rd class who paid more for their tickets than some first class passengers. However once onboard passengers would have been treated according to their travel class.

Predictive Models

Using the insight we have gained in the previous section, we will construct a number of predictive models and compare their performance. To insure a fair comparison, we will split the provided data into test and training sets and use these to evaluate how well our models do on unseen data. Using the caret package, we create train, and test data sets, and define a common training control object.

train_idx <- createDataPartition(training_data$survived, p = 0.7, list = FALSE)

train <- training_data[train_idx,]
test <- training_data[-train_idx,]

train_control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 10,
                              summaryFunction = twoClassSummary,
                              classProbs = TRUE,
                              verboseIter = FALSE)
data  <- list("train" = train, "test" = test)

Using this data, we will test the algorithms covered in every basic machine learning course: Logistic Regression, Decision Trees, and Random Forests.

Logistic Regression

We begin with a logistic regression model, initially including all the variables we determined to be very or somewhat important.

glm.model <- train(survived ~ pclass + sex + sibsp + parch + embarked + title, 
                   data = data$train,
                   method = "glm",
                   metric = "ROC",
                   trControl = train_control,
                   family = binomial(link = "logit"))

Because the logistic regression model provides probabilities of a passenger surviving or perishing, we need to determine the probability threshold that will be used to classify passengers.

# Determine the optimal threshold
glm.pred <- predict(glm.model, data$test, type = "prob")
glm.analysis <- roc(response = data$test$survived, predictor = glm.pred$yes)
glm.error <- cbind(glm.analysis$thresholds, glm.analysis$sensitivities + glm.analysis$specificities)
glm.thresh <- subset(glm.error, glm.error[,2] == max(glm.error[,2]))[,1]

We find that the optimal threshold is 0.457319. We can plot the ROC curve:

#Plot ROC Curve
plot(1 - glm.analysis$specificities,glm.analysis$sensitivities,type = "l",
     ylab = "Sensitiviy",
     xlab = "1-Specificity",
     col = "black",
     lwd = 2,
     main = "ROC Curve for Simulated Data")
abline(a = 0,b = 1)
abline(v = glm.thresh) #add optimal t to ROC curve

If we use the determined threshold, we can convert the probabilities to classes, and determine the model’s accuracy:

# Turn probabilities into classes, and look at frequencies:
glm.pred$survived <- ifelse(glm.pred[['yes']] > glm.thresh, 1, 0)
glm.pred$survived <- factor(glm.pred$survived, levels = c(1, 0), labels = c("yes", "no"))
#table(glm.pred$survived)
#table(glm.pred$survived, data.test[['survived']])

# Use caret's confusion matrix:
glm.conf <- confusionMatrix(glm.pred$survived, test$survived)

#colAUC(as.integer(glm.pred$survived), test$survived, plotROC = TRUE)
glm.perf <- c(glm.conf$overall["Accuracy"],
              glm.conf$byClass["Sensitivity"],
              glm.conf$byClass["Specificity"])

we obtain:

Metric Value
Accuracy 0.8270677
Sensitivity 0.8039216
Specificity 0.8414634

We can also investigate the variables’ importance as determined by the logistic regression model:

# look at variable importance:
glm.imp <- varImp(glm.model, scale = FALSE)
plot(glm.imp)

summary(glm.model)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6014  -0.5686   0.3674   0.6272   2.7083  
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.1607     0.5423  -9.517  < 2e-16 ***
## pclass        1.1354     0.1455   7.803 6.07e-15 ***
## sexmale      -0.1709     0.5847  -0.292  0.77012    
## sibsp         0.4038     0.1345   3.003  0.00267 ** 
## parch         0.3174     0.1556   2.040  0.04133 *  
## embarkedQ     0.7122     0.4826   1.476  0.13999    
## embarkedS     0.8322     0.2859   2.911  0.00360 ** 
## titleMiss.    0.3836     0.3873   0.991  0.32189    
## titleMr.      3.7547     0.5790   6.484 8.91e-11 ***
## titleMrs.         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 832.49  on 624  degrees of freedom
## Residual deviance: 513.66  on 616  degrees of freedom
## AIC: 531.66
## 
## Number of Fisher Scoring iterations: 5

From this, we see that only 3 variables have statistical significance. Let’s look at a simplified model using only pclass, sibsp, and title.

glm_reduced.model <- train(survived ~ pclass + sibsp  + title, 
                   data = data$train,
                   method = "glm",
                   metric = "ROC",
                   trControl = train_control,
                   family = binomial(link = "logit"))


# Determine the optimal threshold
glm_reduced.pred <- predict(glm_reduced.model, data$test, type = "prob")
glm_reduced.analysis <- roc(response = data$test$survived, predictor = glm_reduced.pred$yes)
glm_reduced.error <- cbind(glm_reduced.analysis$thresholds, glm_reduced.analysis$sensitivities + glm_reduced.analysis$specificities)
glm_reduced.thresh <- subset(glm_reduced.error, glm_reduced.error[,2] == max(glm_reduced.error[,2]))[,1]
glm_reduced.thresh
## [1] 0.4866997
#Plot ROC Curve
plot(1 - glm_reduced.analysis$specificities,glm_reduced.analysis$sensitivities,type = "l",
     ylab = "Sensitiviy",
     xlab = "1-Specificity",
     col = "black",
     lwd = 2,
     main = "ROC Curve for Simulated Data")
abline(a = 0,b = 1)
abline(v = glm_reduced.thresh) #add optimal t to ROC curve

# Turn probabilities into classes, and look at frequencies:
glm_reduced.pred$survived <- ifelse(glm_reduced.pred[['yes']] > glm_reduced.thresh, 1, 0)
glm_reduced.pred$survived <- factor(glm_reduced.pred$survived, levels = c(1, 0), labels = c("yes", "no"))

# Use caret's confusion matrix:
glm_reduced.conf <- confusionMatrix(glm_reduced.pred$survived, test$survived)

#colAUC(as.integer(glm_reduced.pred$survived), test$survived, plotROC = TRUE)
glm_reduced.perf <- c(glm_reduced.conf$overall["Accuracy"],
              glm_reduced.conf$byClass["Sensitivity"],
              glm_reduced.conf$byClass["Specificity"])

And we obtain:

Metric Value
Accuracy 0.8421053
Sensitivity 0.7745098
Specificity 0.8841463

We see that the reduced model actually provides better overall accuracy (0.8421053) than the model incorporating more variables (0.8270677).

As a first solution to the Kaggle competition, we will submit the results obtained with the reduced model.

# Train the model on the full training data set
glm_submit.model <- train(survived ~ pclass + sibsp  + title, 
                   data = training_data,
                   method = "glm",
                   metric = "ROC",
                   trControl = train_control,
                   family = binomial(link = "logit"))
glm_submit.pred <- predict(glm_submit.model, testing_data, type = "prob")

data.frame(PassengerId = testing_data$passengerid,
           Survived = ifelse(glm_submit.pred$yes > glm_reduced.thresh, 1, 0)) %>% 
  write_csv(file.path("logistic_regression.csv"))

Submitting this to the competition, we obtain an accuracy of 78.947%, which at the time of writing places us in 2239th place.

Decision Tree

We will use the recursive partitioning algorithm to create a decision tree model. We begin with all of the variables and allow the algorithm to determine which are the most important.

rpart_model <- train( survived ~ pclass + sex + age + sibsp + parch + fare + embarked + title,
                     data = data$train,
                     method = "rpart",
                     metric = "ROC",
                     trControl = train_control)

rpart_model.pred <- predict(rpart_model, data$test, type = "prob")
rpart_model.pred$survived <- ifelse(rpart_model.pred[['yes']] > 0.5, 1, 0)
rpart_model.pred$survived <- factor(rpart_model.pred$survived, levels = c(1, 0), labels = c("yes", "no"))

rpart_model.conf <- confusionMatrix(rpart_model.pred$survived, data$test$survived)

With our training and testing partitions, the rpart decision tree model achieves the following performance metrics:

Metric Value
Accuracy 0.8421053
Sensitivity 0.7745098
Specificity 0.8841463

And by plotting the decision tree, we can see visually the variables that the model used in creating the decision tree:

# look at variable importance:
rpart.imp <- varImp(rpart_model, scale = FALSE)
plot(rpart.imp)

This model doesn’t provide significantly different performance than the logistic regression model on our train/test partition. When we use the full training set, and submit to the competition:

# Train the model on the full training data set
rpart_submit.model <- train(survived ~ pclass + sex + age + sibsp + parch + fare + embarked + title,
                   data = training_data,
                   method = "rpart",
                   metric = "ROC",
                   trControl = train_control)
rpart_submit.pred <- predict(rpart_submit.model, testing_data, type = "prob")

data.frame(PassengerId = testing_data$passengerid,
           Survived = ifelse(rpart_submit.pred$yes > 0.5, 1, 0)) %>% 
  write_csv(file.path("rpart.csv"))

We obtain an accuracy of 79.426% which moves us up 464 places to 1775th place.

Random Forest

The final model we will consider is a random forest model, using the ranger package’s implementation.

ranger_control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 10,
                              classProbs = TRUE,
                              verboseIter = FALSE)

ranger_model <- train( survived ~ pclass + sex + age + sibsp + parch + fare + embarked + title,
                       data = data$train,
                       method = "ranger",
                       tuneLength = 10,
                       trControl = ranger_control)

ranger_model.pred <- predict(ranger_model, data$test, type = "prob")
ranger_model.pred$survived <- ifelse(ranger_model.pred[['yes']] > 0.5, 1, 0)
ranger_model.pred$survived <- factor(ranger_model.pred$survived, levels = c(1, 0), labels = c("yes", "no"))

ranger_model.conf <- confusionMatrix(ranger_model.pred$survived, data$test$survived)

With our training and testing partitions, the rpart decision tree model achieves the following performance metrics:

Metric Value
Accuracy 0.8721805
Sensitivity 0.8039216
Specificity 0.9146341

And we can examine the important variables:

# look at variable importance:
rpart.imp <- varImp(rpart_model, scale = FALSE)
plot(rpart.imp)

And we again see the same variables are the most important. The random forest model provides slightly higher accuracy on our training and testing partitions, and we submit our solution to the Kaggle competition:

# Train the model on the full training data set
ranger_submit.model <- train(survived ~ pclass + sex + age + sibsp + parch + fare + embarked + title,
                       data = training_data,
                       method = "ranger",
                       tuneLength = 10,
                       trControl = ranger_control)

ranger_submit.pred <- predict(ranger_submit.model, testing_data, type = "prob")

data.frame(PassengerId = testing_data$passengerid,
           Survived = ifelse(ranger_submit.pred$yes > 0.5, 1, 0)) %>% 
  write_csv(file.path("ranger.csv"))

Interestingly, when presented with the full training set, the ranger model selected a different hyper parameter (mtry=4) as the best fit. When submitting this model, Kaggle assigns a score of 77.990% accuracy, which is not an improvement over the decision tree model.

If we instead make a submission using the previously trained model:

ranger_submit2.pred <- predict(ranger_model, testing_data, type = "prob")

data.frame(PassengerId = testing_data$passengerid,
           Survived = ifelse(ranger_submit2.pred$yes > 0.5, 1, 0)) %>% 
  write_csv(file.path("ranger_mtry2.csv"))

This model achieves a higher accuracy of 79.904%, and moves us into 1272nd place which is in the top 19% of entries for this competition.

Summary and Conclusions

In this project, we have demonstrated three of the common machine learning models, and have submitted the results to the Kaggle compeition, ranking within the to 19% of almost 7000 submissions.

Where can we go from here? One of the best ways to improve the performance of our model is through feature engineering, extracting more information from the data we are given. In this project, getting the passenger titles is one example of feature engineering. We could also pull out information about family sizes to see how the family size affects survival rates.

I have not done everything that could be done with this dataset. Perhaps in the future, I will return to this dataset and see how much of an improvement feature engineering can provide.