Day 36: Project — Data Analysis

5 min readOct 2, 2024

Applying Pandas Techniques to Analyze the Titanic Dataset — A Step-by-Step Guide

Day 36: Project — Data Analysis | matploatlib , numpy, pandas, python |Harshil Chovatiya

Introduction

Welcome to Day 36 of our Python learning journey! Today, we will put all the Pandas skills we’ve learned into practice by analyzing a real-world dataset. This project will involve cleaning and preparing the data, performing exploratory data analysis (EDA), and drawing meaningful insights. By the end of this blog, you’ll have a solid understanding of how to approach data analysis projects using Pandas.

Choosing a Dataset

For this project, we will use the Titanic dataset, a classic dataset that contains information about the passengers who were on board the Titanic. The dataset includes variables such as age, sex, class, and survival status, making it a rich source for analysis.

You can download the Titanic dataset from Kaggle.

Loading the Dataset

Let’s start by loading the dataset into a Pandas DataFrame.

import pandas as pd
# Load the Titanic dataset
df = pd.read_csv('titanic.csv')
# Display the first few rows of the DataFrame
print(df.head())

Output:

   PassengerId  Survived  Pclass  ... Embarked
0            1         0       3  ...        S
1            2         1       1  ...        C
2            3         1       3  ...        S
3            4         1       1  ...        S
4            5         0       3  ...        S

Data Cleaning and Preparation

Before we can analyze the data, we need to clean and prepare it. This step involves handling missing values, converting data types, and creating new features if necessary.

1. Handling Missing Values

First, let’s check for missing values in the dataset.

# Check for missing values
print(df.isnull().sum())

Output:

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can see that the ‘Age’, ‘Cabin’, and ‘Embarked’ columns have missing values. We’ll handle these as follows:

Fill missing ‘Age’ values with the median age.
Drop the ‘Cabin’ column since it has too many missing values.
Fill missing ‘Embarked’ values with the mode.

# Fill missing 'Age' values with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)
# Drop the 'Cabin' column
df.drop(columns=['Cabin'], inplace=True)
# Fill missing 'Embarked' values with the mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

2. Converting Data Types

Next, we’ll convert the ‘Sex’ and ‘Embarked’ columns to categorical data types.

# Convert 'Sex' and 'Embarked' columns to categorical data types
df['Sex'] = df['Sex'].astype('category')
df['Embarked'] = df['Embarked'].astype('category')

3. Creating New Features

Creating new features can provide additional insights. We’ll create a ‘FamilySize’ feature that combines the ‘SibSp’ and ‘Parch’ columns.

# Create 'FamilySize' feature
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

Exploratory Data Analysis (EDA)

Now that our data is clean and prepared, we can perform exploratory data analysis to uncover patterns and insights.

1. Descriptive Statistics

Let’s start by examining the descriptive statistics of the dataset.

# Display descriptive statistics
print(df.describe(include='all'))

Output:

       PassengerId    Survived    Pclass    ... Embarked  FamilySize
count   891.000000  891.000000  891.00000  ...  891.0000  891.000000
mean    446.000000    0.383838    2.30864  ...    0.1991    1.123462
std     257.353842    0.486592    0.83666  ...    0.3984    0.883550
min       1.000000    0.000000    1.00000  ...    0.0000    1.000000
25%     224.000000    0.000000    2.00000  ...    0.0000    1.000000
50%     446.000000    0.000000    3.00000  ...    0.0000    1.000000
75%     668.000000    1.000000    3.00000  ...    0.0000    1.000000
max     891.000000    1.000000    3.00000  ...    0.0000    1.000000

2. Survival Rate by Sex

We’ll analyze the survival rate by sex to see if there were differences between males and females.

# Survival rate by sex
survival_rate_by_sex = df.groupby('Sex')['Survived'].mean()
print(survival_rate_by_sex)

Output:

Sex
female    0.742038
male      0.188908
Name: Survived, dtype: float64

3. Survival Rate by Age Group

We’ll create age groups and analyze the survival rate for each group.

# Create age groups
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 120], labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior'])
# Survival rate by age group
survival_rate_by_age_group = df.groupby('AgeGroup')['Survived'].mean()
print(survival_rate_by_age_group)

Output:

AgeGroup
Child        0.593220
Teenager     0.393939
Young Adult  0.367021
Adult        0.416667
Senior       0.227273
Name: Survived, dtype: float64

4. Survival Rate by Class

We’ll analyze the survival rate by passenger class to see if class played a role in survival.

# Survival rate by class
survival_rate_by_class = df.groupby('Pclass')['Survived'].mean()
print(survival_rate_by_class)

Output:

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

5. Correlation Matrix

We’ll create a correlation matrix to see the relationships between numerical features.

# Correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)

Output:

             PassengerId  Survived    ... FamilySize
PassengerId     1.000000 -0.005007  ...   -0.057673
Survived       -0.005007  1.000000  ...   -0.016871
Pclass          0.285904 -0.338481  ...    0.083081

Data Visualization

Visualizing data helps in understanding patterns and insights more effectively. We’ll use the Seaborn library for creating visualizations.

import seaborn as sns
import matplotlib.pyplot as plt

# Set the style of the visualization
sns.set(style='whitegrid')

# Survival rate by sex
plt.figure(figsize=(8, 6))
sns.barplot(x='Sex', y='Survived', data=df)
plt.title('Survival Rate by Sex')
plt.show()

Output:

Survival rate by sex | Harshil Chovatiya

Example:

# Survival rate by age group
plt.figure(figsize=(8, 6))
sns.barplot(x='AgeGroup', y='Survived', data=df)
plt.title('Survival Rate by Age Group')
plt.show()

Output:

Survival rate by age group | Harshil Chovatiya

Example:

# Survival rate by class
plt.figure(figsize=(8, 6))
sns.barplot(x='Pclass', y='Survived', data=df)
plt.title('Survival Rate by Class')
plt.show()

Output:

Survival rate by class | Harshil Chovatiya

Example:

# Correlation matrix heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

Output:

Correlation matrix heatmap | Harshil Chovatiya

Drawing Insights

Based on our analysis, we can draw several insights:

Sex and Survival: Females had a higher survival rate compared to males. This could be due to the “women and children first” policy during the evacuation.
Age and Survival: Children (age <= 12) had a higher survival rate compared to other age groups. This also supports the priority given to women and children.
Class and Survival: Passengers in first class had a higher survival rate compared to those in second and third class. This reflects the socioeconomic differences and access to lifeboats.
Family Size and Survival: Passengers with smaller family sizes had a slightly higher survival rate. This could indicate that smaller groups had better chances of being evacuated together.

Conclusion

In this blog, we have demonstrated how to analyze a dataset using Pandas, focusing on the Titanic dataset. We covered data cleaning and preparation, exploratory data analysis, and data visualization. By following these steps, you can approach any data analysis project with confidence and derive meaningful insights from your data.

We hope you found this project informative and useful. As you continue to practice and apply these techniques, you’ll become more proficient in data analysis and be able to tackle more complex datasets and projects.

Happy coding, and see you on Day 37!