This notebook analyses the chances of survival of Titanic passengers, based on their socio-economic status (taking ticket class as a proxy), gender, age, and port of embarkation.
The dataset we're using is named titanic_data.csv
and its specifications are available here.
Our base dataset will have the following columns:
Pclass
- (ticket class) Sex
- (gender) Age
- (age) embarked
- (port of embarkation) Precisely, we will investigate the following questions:
We will first reindex the DataFrame, clean the data if necessary, and identify possible NaN
values. We will also remove outliers during an analysis if necessary.
We will also define two helper functions, get_count
and gen_plot
, that will prove helpful and save time all along the analysis.
import pandas as pd
import numpy as np
from scipy import stats
df = pd.read_csv('data/titanic_data.csv')
df.columns
df.head()
Our DataFrame df
is indexed with the default method. Let's use the PassengerId
as an index instead.
df = df.set_index(['PassengerId'])
df.head()
Let's also check df
for missing values in the columns we're interested in, and let's then decide what to do with them.
def get_nan_count(column):
'''
column - the column for which we want the NaN value count.
This function returns the number of NaN values in a specific column.
'''
nan_count = column.isnull().sum()
return nan_count
get_nan_count(df['Survived'])
get_nan_count(df['Pclass'])
get_nan_count(df['Sex'])
get_nan_count(df['Age'])
get_nan_count(df['Embarked'])
Overall, our analysis will not have to deal with missing values for Ticket Class and Sex. We will ignore the missing values for the Age and Embarked columns.
The Age column having 177 null values over 891 passengers, the results might be a little less reliable.
Because we're looking at survival rates with 4 different factors, we should define a function that facilitates getting these rates.
survivor_count = df['Survived'].sum()
def get_survival_rate(dataframe, factor):
'''
df - the dataframe on which to apply the analysis
factor - the factor / column for which we want the survival rate. Should be of type string.
This function takes in a column and returns the survival rate of Titanic passengers according to this factor.
'''
by_factor = dataframe.groupby(factor)
count_by_factor = by_factor['Survived'].sum()
survival_rate = count_by_factor / survivor_count * 100
print('Survival rates:', survival_rate, '\n \n', 'Counts: ', count_by_factor)
return survival_rate, count_by_factor
First, let's see how socio-economic status, proxied with the ticket class, impacted survival.
Before anything, we should get a rough idea of how the ticket classes were divided in the total of passengers.
total_passenger_count = len(df['Pclass'])
first_class_count = (df['Pclass'] == 1).sum()
second_class_count = (df['Pclass'] == 2).sum()
third_class_count = (df['Pclass'] == 3).sum()
per_first_class = first_class_count / total_passenger_count * 100
per_second_class = second_class_count / total_passenger_count * 100
per_third_class = third_class_count / total_passenger_count * 100
print('First class percentage = ', per_first_class, ' | Count: ', first_class_count)
print('Second class percentage = ', per_second_class, ' | Count: ', second_class_count)
print('Third class percentage = ', per_third_class, ' | Count: ', third_class_count)
We see that:
Let's now compare these percentages with the survival ones.
get_survival_rate(df, 'Pclass')
With these survival rates, we can see that:
Let's visualize these results to get a better understanding.
Precisely, let's see how survival and death rates evolve according to ticket class, in absolute count and percentage.
We're first going to define the function get_counts
that allows us to get the precise count of deaths and survivals according to a specific quality. We will reuse this function throughout our analysis (except for our age anaysis).
def get_counts(dataframe, factor, quality, status):
'''
dataframe - the dataframe on which to apply the function (e.g. 'df' for the Titanic dataframe)
factor - the column on which to perform the analysis (e.g. 'Pclass' for the ticket class)
quality - the quality on which to perform the analysis (e.g '3' for the third class)
status - the status you want to count: survival or death. Should be a Boolean (1 for survival, 0 for death).
'''
target = dataframe[factor].where(dataframe[factor] == quality)
class_count = target.where(dataframe['Survived'] == status).count()
return class_count
class1_survival_count = get_counts(df, 'Pclass', 1, 1)
class2_survival_count = get_counts(df, 'Pclass', 2, 1)
class3_survival_count = get_counts(df, 'Pclass', 3, 1)
class1_death_count = get_counts(df, 'Pclass', 1, 0)
class2_death_count = get_counts(df, 'Pclass', 2, 0)
class3_death_count = get_counts(df, 'Pclass', 3, 0)
print('SURVIVAL COUNT BY CLASS')
print('First class survival count:', class1_survival_count)
print('Second class survival count:', class2_survival_count)
print('Third class survival count:', class3_survival_count)
print('\nCASUALTIES COUNT BY CLASS')
print('First class death count:', class1_death_count)
print('Second class death count:', class2_death_count)
print('Third class death count:', class3_death_count)
Now that we have our counts, we're ready to visualize the results. We're going to define our second helper function, gen_plot
, which will generate two bar charts: one plotting absolute counts for the factor being analyzed, and the other plotting percentages.
import matplotlib.pyplot as plt
import seaborn as sns
def gen_plot(survival_array, death_array, by_factor, x_ticks):
"""
survival_array: a list providing the survival data being analyzed (eg. [class1_survival_count, class2_survival_count, class3_survival_count])
death_array: a list providing the death data being analyzed (eg. [class1_death_count, class2_death_count, class3_death_count])
by_title: the factor that is the focus of the analysis (eg. 'by ticket class')
x_ticks: (eg. ['First class', 'Second class', 'Third class'])
"""
abs_survival_list = np.array(survival_array)
abs_death_list = np.array(death_array)
N = len(abs_survival_list)
ind = np.arange(N)
width = 1 / N
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(10,5))
# Absolute count
ax1.bar(ind, abs_survival_list, width, label='Survival', alpha=0.8)
ax1.bar(ind, abs_death_list, width, color='#d62728', label='Death', alpha=0.8, bottom=abs_survival_list)
plt.sca(ax1)
plt.xticks(ind, x_ticks)
ax1.set_title('Absolute count ' + by_factor)
ax1.set_ylabel('Count')
ax1.legend(loc='upper left')
plt.setp(plt.gca().get_xticklabels(), rotation=45)
# Percentage
per_survival_list = (abs_survival_list / (abs_survival_list + abs_death_list)) * 100
per_death_list = (abs_death_list / (abs_survival_list + abs_death_list)) * 100
ax2.bar(ind, per_survival_list, width, label='Survival percentage', alpha=0.8)
ax2.bar(ind, per_death_list, width, color='#d62728', label='Death percentage', alpha=0.8, bottom=per_survival_list)
plt.sca(ax2)
plt.xticks(ind, x_ticks)
ax2.set_title('Percentage ' + by_factor)
ax2.set_ylabel('Percentage')
plt.setp(plt.gca().get_xticklabels(), rotation=45)
return plt.show()
We can now plot the bar chart for our ticket class analysis.
gen_plot([class1_survival_count, class2_survival_count, class3_survival_count],
[class1_death_count, class2_death_count, class3_death_count],
'by ticket class',
['First class', 'Second class', 'Third class'])
We can see that there is a clear correlation between death and ticket class: the higher your ticket class, the higher your chances of survival.
stats.chisquare(df['Survived'], df['Pclass'])
Let's now analyze what impact gender had on chances or survival. First, we're going to define a function similar to the one we used for ticket class, to get the count of survivals and deaths by gender.
female_survival_count = get_counts(df, 'Sex', 'female', 1)
male_survival_count = get_counts(df, 'Sex', 'male', 1)
female_death_count = get_counts(df, 'Sex', 'female', 0)
male_death_count = get_counts(df, 'Sex', 'male', 0)
print('SURVIVAL COUNT BY GENDER')
print('Female survival count: ', female_survival_count)
print('Male survival count: ', male_survival_count)
print('\nCASUALTIES COUNT BY GENDER')
print('Female death count: ', female_death_count)
print('Male death count: ', male_death_count)
Let's now visualize the data.
gen_plot([female_survival_count, male_survival_count],
[female_death_count, male_death_count],
'by gender',
['Female', 'Male'])
Let's put a number on this.
First, we need to create a new DataFrame where we will replace non-int values: female
will be 1 and male
will be 0.
gender_df = df[['Sex', 'Survived']]
gender_df.replace(['male', 'female'], [0, 1], inplace=True)
gender_df[['Sex', 'Survived']].corr(method='pearson')
According to our analysis, there's a moderate correlation between survival and gender.
Let's pause here. We know that:
How do the chances of survival of a first class woman compare to those of a third class man?
df.groupby(['Sex', 'Pclass'])['Survived'].mean() * 100
Chances of survival of a first class woman: ≈ 96.80%
Chances of survival of a third class man: ≈ 13.54%
This definitely explains how the movie ended...
Now we're going to study the impact of age on chances of survival.
We're going to look at the survival rates of children, compared to older people.
Remember that we have 177 NaN
(Not a Number) values here. In other words, we're going to have to cut our data by 20%. This is not ideal and you should always try to find a solution to save as many datum as possible, but in cases like this one we don't really have a choice.
First, let's get rid of NaN
values.
df_age = df[['Age' , 'Survived']].dropna(how='any')
df_age['Age'] = (np.floor(df_age['Age'])).astype(int)
df_age.shape
df_age.head()
We counted 177 missing values.
After deleting them, we get a new dataframe df_age
with 714 rows.
print('Yougest passenger: ' + str(df_age['Age'].min() * 12) + ' months old')
print('\nOldest passenger: ' + str(df_age['Age'].max()) + ' years old')
We rounded the ages up for the sake of this analysis, so the youngest passenger on board was a few months old, and the oldest was 80 years old.
Now let' order our data and create a new DataFrame df_ages_survival
. It will Age
indexed. Four each age, 4 columns will give us the following information:
ages_list = df_age['Age'].unique()
ages_list.sort()
ages_list
df_ages_survival = pd.DataFrame(index=ages_list, columns=['Survived', 'Deaths', 'Total', 'Percentage'])
df_ages_survival['Survived'] = df_age.groupby('Age')['Survived'].sum()
df_ages_survival['Total'] = df_age.groupby('Age').count()
df_ages_survival['Deaths'] = df_ages_survival['Total'] - df_ages_survival['Survived']
df_ages_survival['Percentage'] = round(df_age.groupby('Age')['Survived'].mean() * 100, 2)
# for age in ages_list:
# df_ages_survival.loc[age]['Survived'] = (df_age['Age'] == age).where(df_age['Survived'] == 1).sum()
# df_ages_survival.loc[age]['Total'] = (df_age['Age'] == age).sum()
# df_ages_survival.loc[age]['Deaths'] = df_ages_survival.loc[age]['Total'] - df_ages_survival.loc[age]['Survived']
# df_ages_survival.loc[age]['Percentage'] = (df_ages_survival.loc[age]['Survived'] / df_ages_survival.loc[age]['Total']) * 100
#
# df_ages_survival.head()
df_ages_survival.head()
Let's take a look at a scatter plot of our data, and see if we can identify any outliers.
x = df_ages_survival['Percentage'].index
y = df_ages_survival['Percentage']
plt.scatter(x, y)
m, b = np.polyfit(x, y, 1)
plt.plot(x, y, '.')
plt.title('Survival rates by age')
plt.xlabel('Age in years')
plt.ylabel('Percentage of survivors')
plt.show()
We can identify a tendency where the lower y values have high x coordinates, and high y values have low x coordinates. This means that apparently, the younger you are, the more chances you have to survive.
However, there are three dots on the upper right side of the graph that seem to be outliers. They are not representative of the general tendency: let's identify them, remove them from the DataFrame, plot it again and fit a regression line.
df_ages_survival = df_ages_survival.drop(df_ages_survival[(df_ages_survival['Percentage'] == 100) & (df_ages_survival.index > 50)].index)
x = df_ages_survival['Percentage'].index
y = df_ages_survival['Percentage']
plt.scatter(x, y)
m, b = np.polyfit(x, y, 1)
plt.plot(x, y, '.')
plt.plot(x, m*x + b, '-')
plt.title('Survival rates by age')
plt.xlabel('Age in years')
plt.ylabel('Percentage of survivors')
plt.show()
The regression line clearly shows that the younger you were, the higher your chances to survive. Let's get information about the regression line (especially the slope):
slope, intercept, r_value, p_value, std_err = stats.linregress(x, y)
print('Slope: ', slope)
print('Intercept: ', intercept)
print('r_value: ', r_value)
print('r_squared: ', r_value ** 2)
print('p_value: ', p_value)
print('std_error: ', std_err)
Now this is enough to get some insight, but maybe there's a better way to visualize the data. Let's plot a kernel density estimate of ages according to survival status:
age_bins = np.arange(0, 100, 4)
sns.distplot(df.loc[(df['Survived']==0) & (~df['Age'].isnull()),'Age'], bins=age_bins, color='#d62728')
sns.distplot(df.loc[(df['Survived']==1) & (~df['Age'].isnull()),'Age'], bins=age_bins)
plt.title('Age distribution among survival classes')
plt.ylabel('Frequency')
plt.legend(['Did not survive', 'Survived'])
plt.show()
We can definitely see a spike in the distribution of survival passengers when the age is small, indicating children had a higher survival rate.
We're going to carry on and strengthen our analysis by bringing more light on two specific ages, so that we also have a benchmark. We will use a bar chart.
We will analyze ages 6 and 12, but you can change the age
argument in the function we're going to define with the age of your choice.
Below is a new function similar to get_counts
. It will fit the inferior or equal
condition of our analysis and return the survival and death counts for both configurations, taking a user defined age
as an argument.
def get_age_counts(age):
'''
age - the age limit on which to apply the analysis
This function returns, in order, the survival count for people under or equal to the age specified and people older than the age specified,
and the death count people under or equal to the age specified and people older than the age specified.
'''
younger_target = df['Age'].where(df['Age'] <= age)
older_target = df['Age'].where(df['Age'] > age)
younger_survival_count = younger_target.where(df['Survived'] == 1).count()
older_survival_count = older_target.where(df['Survived'] == 1).count()
younger_death_count = younger_target.where(df['Survived'] == 0).count()
older_death_count = older_target.where(df['Survived'] == 0).count()
return younger_survival_count, older_survival_count, younger_death_count, older_death_count
up_to_6_survival_count, over_6_survival_count, up_to_6_death_count, over_6_death_count = get_age_counts(6)
up_to_12_survival_count, over_12_survival_count, up_to_12_death_count, over_12_death_count = get_age_counts(12)
print('AGE LIMIT = 6')
print('Survival count for children up to 6: ', up_to_6_survival_count)
print('Survival count for people over 6: ', over_6_survival_count)
print('Death count for children up to 6: ', up_to_6_death_count)
print('Death count for people over 6: ', over_6_death_count)
print('\nAGE LIMIT = 12')
print('Survival count for children up to 12: ', up_to_12_survival_count)
print('Survival count for people over 12: ', over_12_survival_count)
print('Death count for children up to 12: ', up_to_12_death_count)
print('Death count for people over 12: ', over_12_death_count)
Now we're ready to build our bar charts. Looking at the graphs, our over 6 and over 12 counts and percentages will probably look identical.
# Plot for age 6
gen_plot([up_to_6_survival_count, over_6_survival_count],
[up_to_6_death_count, over_6_death_count],
'- 6 years old',
['Up to 6', 'Over 6'])
# Plot for age 12
gen_plot([up_to_12_survival_count, over_12_survival_count],
[up_to_12_death_count, over_12_death_count],
'- 12 years old',
['Up to 12', 'Over 12'])
It would seem that children up to 6 years old had a higher chance of survival than people over 6 years old.
It would also seem that children up to 12 years old had a higher chance of survival than people over 12 years old, but lower than people up to 6 years old.
This comforts our first hypothesis: the younger you were, the higher your chances to survive.
Our final analysis will look at chances of survival based on the port of embarkation.
Titanic passengers embarked form three different ports:
Before anything, we should get a rough idea of how the ticket classes were divided in the total of passengers.
total_passenger_count = len(df['Embarked'])
cherbourg_count = (df['Embarked'] == 'C').sum()
queenstown_count = (df['Embarked'] == 'Q').sum()
southampton_count = (df['Embarked'] == 'S').sum()
per_cherbourg = cherbourg_count / total_passenger_count * 100
per_queenstown = queenstown_count / total_passenger_count * 100
per_southampton = southampton_count / total_passenger_count * 100
print('Cherbourg percentage = ', per_cherbourg, ' | Count: ', cherbourg_count)
print('Queenstown percentage = ', per_queenstown, ' | Count: ', queenstown_count)
print('Southampton percentage = ', per_southampton, ' | Count: ', southampton_count)
We see that:
get_survival_rate(df, 'Embarked')
With these survival rates, we can see that:
Let's visualize these results to get a better understanding.
Precisely, let's see how survival and death rates evolve according to port of embarkation, in absolute count and percentage. We'll use the get_counts
function once more.
cherbourg_survival_count = get_counts(df, 'Embarked', 'C', 1)
queenstown_survival_count = get_counts(df, 'Embarked', 'Q', 1)
southampton_survival_count = get_counts(df, 'Embarked', 'S', 1)
cherbourg_death_count = get_counts(df, 'Embarked', 'C', 0)
queenstown_death_count = get_counts(df, 'Embarked', 'Q', 0)
southampton_death_count = get_counts(df, 'Embarked', 'S', 0)
print('SURVIVAL COUNT BY PORT OF EMBARKATION')
print('Cherbourg survival count:', cherbourg_survival_count)
print('Queenstown class survival count:', queenstown_survival_count)
print('Southampton survival count:', southampton_survival_count)
print('\nCASUALTIES COUNT BY PORT OF EMBARKATION')
print('Cherbourg death count:', cherbourg_death_count)
print('Queenstown death count:', queenstown_death_count)
print('Southampton death count:', southampton_death_count)
gen_plot([cherbourg_survival_count, queenstown_survival_count, southampton_survival_count],
[cherbourg_death_count, queenstown_death_count, southampton_death_count],
'by port of embarkation',
['Cherbourg', 'Queenstown', 'Southampton'])
I don't suppose women and children were over represented in either port of embarkation. We can make two hypotheses then:
We can test the first hypothesis (class distribution).
# Cherbourg distribution per class
cherbourg_df = df[df['Embarked'] == 'C']
southampton_df = df[df['Embarked'] == 'S']
count_cherbourg_first = (cherbourg_df['Pclass'] == 1).sum()
count_cherbourg_second = (cherbourg_df['Pclass'] == 2).sum()
count_cherbourg_third = (cherbourg_df['Pclass'] == 3).sum()
per_cherbourg_first = count_cherbourg_first / cherbourg_count * 100
per_cherbourg_second = count_cherbourg_second / cherbourg_count * 100
per_cherbourg_third = count_cherbourg_third / cherbourg_count * 100
# Southampton distribution per class
count_southampton_first = (southampton_df['Pclass'] == 1).sum()
count_southampton_second = (southampton_df['Pclass'] == 2).sum()
count_southampton_third = (southampton_df['Pclass'] == 3).sum()
per_southampton_first = count_southampton_first / southampton_count * 100
per_southampton_second = count_southampton_second / southampton_count * 100
per_southampton_third = count_southampton_third / southampton_count * 100
print('Cherbourg first class percentage = ', per_cherbourg_first)
print('Cherbourg second class percentage = ', per_cherbourg_second)
print('Cherbourg third class percentage = ', per_cherbourg_third)
print('\nSouthampton first class percentage = ', per_southampton_first)
print('Southampton second class percentage = ', per_southampton_second)
print('Southampton third class percentage = ', per_southampton_third)
It looks like our first hypothesis was right:
This explains why the survival rate is higher for Cherbourg passengers than from Southampton passengers.
Let's reiterate the combined analysis (gender and ticket class) we did previously, adding the port of embarkation factor in the equation.
df.groupby(['Sex', 'Pclass', 'Embarked'])['Survived'].mean() * 100
Chances of survival of a first class woman from Cherbourg: ≈ 97.67%
Chances of survival of a third class man from Southampton: ≈ 12.83%
From this analysis, we would be inclined to conclude that passengers had higher chances of survival if:
On the contrary, being a third class old man from Southampton lowered your chances of survival.
However, there are a few things we should keep in mind: