Introduction

As an introductory data science project, I have chosen to explore the data provided by the Titanic: Machine Learning from Disaster competition hosted by Kaggle. The competition is to build the best model that can predict whether a given passenger survived the sinking of the Titanic. As a first step, I performed introductory data analysis to learn more about the passengers on board. In this second part, I will compare different machine learning algorithms and submit my solution to Kaggle.

We begin by loading the required packages, and performing the data munging steps described in Part 1. For the current prediction task, we are given a training dataset, and a testing dataset. Both datasets have missing data, and we will combine them prior to performing the data munging steps details in part 1.

library(plyr)
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(caret)
library(rattle)
library(rpart)
library(rpart.plot)
library(ranger)
library(e1071)
library(caTools)
library(pROC)
library(randomForest)
library(glmnet)
library(knitr)
library(kernlab)
library(party)
library(ggmosaic)
library(ggbiplot)

# Set seed for reproducible results
set.seed(234233343)

# load the training dataset
train_data <- read_csv(file.path('..','data','train.csv'))
test_data <- read_csv(file.path('..','data','test.csv'))
# Preprocessing, data cleanup the same as for exploratory analysis
# initial pre-processing:
train_data <- train_data %>% 
  mutate(Survived = factor(Survived, levels = c(1, 0), labels = c("yes", "no")))

titanic_data <- train_data %>% 
  bind_rows(test_data) %>% 
  mutate(Title = factor(str_extract(Name, "[a-zA-z]+\\.")))

# Convert variable names to lowercase
names(titanic_data) <- tolower(names(titanic_data))

# Fill in missing values:
#look for missing data
summary(titanic_data)
sapply(titanic_data, function(df){mean(is.na(df))})

# Have missing data for age, embarked, cabin.
# ignore cabin (missing 77% of data)

# impute Embarked:
table(titanic_data$embarked, useNA = "always")
# set missing data to S, as the most common.
titanic_data$embarked[which(is.na(titanic_data$embarked))] <- "S"

# impute ages to be the mean of people with same title:
tb <- cbind(titanic_data$age, titanic_data$title)
table(tb[is.na(tb[,1]),2])

# get the mean ages for each title
age_dist <- titanic_data %>% 
  group_by(title) %>% 
  summarize(n = n(),
            n_missing = sum(is.na(age)),
            perc_missing = 100*n_missing/n,
            mean_age = mean(age, na.rm = TRUE),
            sd_age = sd(age, na.rm = TRUE))

age_dist

# missing data for Dr, Master, Miss, Mr, Mrs, Ms.
# because so many values are missing, impute with values taken from 
# normal distribution, rather than just imputing the mean age

for (key in c("Dr.", "Master.", "Miss.", "Mr.", "Mrs.")) { 
  idx_na <- which(titanic_data$title == key & is.na(titanic_data$age))
  age_idx <- which(age_dist$title == key)
  titanic_data$age[idx_na] <- rnorm(length(idx_na), 
                                    age_dist$mean_age[age_idx], 
                                    age_dist$sd_age[age_idx])
}

# Only 2 passengers with title of "Ms." and one is missing the age. Use the existing age to impute.
idx_na <- which(titanic_data$title == "Ms." & is.na(titanic_data$age))
age_idx <- which(age_dist$title == "Ms.")
titanic_data$age[idx_na] <- age_dist$mean_age[age_idx]

# Impute missing fares with the mean fare:
titanic_data <- titanic_data %>% 
  mutate(fare = ifelse(is.na(fare), mean(fare, na.rm = TRUE), fare)) %>% 
  select(-cabin)

Further Data Cleanup

We have already seen that there are many different titles assigned to the passengers:

title sex Num_Passengers
Capt. male 1
Col. male 4
Countess. female 1
Don. male 1
Dona. female 1
Dr. female 1
Dr. male 7
Jonkheer. male 1
Lady. female 1
Major. male 2
Master. male 61
Miss. female 260
Mlle. female 2
Mme. female 1
Mr. male 757
Mrs. female 197
Ms. female 2
Rev. male 8
Sir. male 1

While these titles were informative, they add a level of complexity that may have a detremental effect on the machine learning models that we will be looking at. We will reduce the number of titles to 4: Mr., Mrs., Master., and Miss. These titles capture both gender and age information.

titanic_data <- titanic_data %>% 
  mutate(title = as.character(title),
         title = ifelse((title == "Dr.") & sex == "female", "Mrs.", title),
         title = ifelse(title == "Mlle.", "Miss.", title),
         title = ifelse((title != "Miss.") & sex == "female", "Mrs.", title),
         title = ifelse((title != "Master.") & sex == "male", "Mr.", title),
         title = factor(title))

Having finished the data munging steps, we will again divide our data into the training and testing datasets.

test_idx <- which(is.na(titanic_data$survived))
training_data <- titanic_data[-test_idx,]
testing_data <- titanic_data[test_idx,]

Data Exploration

Before building our predictive model, we want to explore the data and see which variables might be most important. We begin by examining the impact of each variable on the survival rate.