Day 36: Project — Data Analysis
Applying Pandas Techniques to Analyze the Titanic Dataset — A Step-by-Step Guide

Introduction
Welcome to Day 36 of our Python learning journey! Today, we will put all the Pandas skills we’ve learned into practice by analyzing a real-world dataset. This project will involve cleaning and preparing the data, performing exploratory data analysis (EDA), and drawing meaningful insights. By the end of this blog, you’ll have a solid understanding of how to approach data analysis projects using Pandas.
Choosing a Dataset
For this project, we will use the Titanic dataset, a classic dataset that contains information about the passengers who were on board the Titanic. The dataset includes variables such as age, sex, class, and survival status, making it a rich source for analysis.
You can download the Titanic dataset from Kaggle.
Loading the Dataset
Let’s start by loading the dataset into a Pandas DataFrame.
import pandas as pd
# Load the Titanic dataset
df = pd.read_csv('titanic.csv')
# Display the first few rows of the DataFrame
print(df.head())
Output:
PassengerId Survived Pclass ... Embarked
0 1 0 3 ... S
1 2 1 1 ... C
2 3 1 3 ... S
3 4 1 1 ... S
4 5 0 3 ... S
Data Cleaning and Preparation
Before we can analyze the data, we need to clean and prepare it. This step involves handling missing values, converting data types, and creating new features if necessary.
1. Handling Missing Values
First, let’s check for missing values in the dataset.
# Check for missing values
print(df.isnull().sum())
Output:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
We can see that the ‘Age’, ‘Cabin’, and ‘Embarked’ columns have missing values. We’ll handle these as follows:
- Fill missing ‘Age’ values with the median age.
- Drop the ‘Cabin’ column since it has too many missing values.
- Fill missing ‘Embarked’ values with the mode.
# Fill missing 'Age' values with the median age
df['Age'].fillna(df['Age'].median(), inplace=True)
# Drop the 'Cabin' column
df.drop(columns=['Cabin'], inplace=True)
# Fill missing 'Embarked' values with the mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
2. Converting Data Types
Next, we’ll convert the ‘Sex’ and ‘Embarked’ columns to categorical data types.
# Convert 'Sex' and 'Embarked' columns to categorical data types
df['Sex'] = df['Sex'].astype('category')
df['Embarked'] = df['Embarked'].astype('category')
3. Creating New Features
Creating new features can provide additional insights. We’ll create a ‘FamilySize’ feature that combines the ‘SibSp’ and ‘Parch’ columns.
# Create 'FamilySize' feature
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
Exploratory Data Analysis (EDA)
Now that our data is clean and prepared, we can perform exploratory data analysis to uncover patterns and insights.
1. Descriptive Statistics
Let’s start by examining the descriptive statistics of the dataset.
# Display descriptive statistics
print(df.describe(include='all'))
Output:
PassengerId Survived Pclass ... Embarked FamilySize
count 891.000000 891.000000 891.00000 ... 891.0000 891.000000
mean 446.000000 0.383838 2.30864 ... 0.1991 1.123462
std 257.353842 0.486592 0.83666 ... 0.3984 0.883550
min 1.000000 0.000000 1.00000 ... 0.0000 1.000000
25% 224.000000 0.000000 2.00000 ... 0.0000 1.000000
50% 446.000000 0.000000 3.00000 ... 0.0000 1.000000
75% 668.000000 1.000000 3.00000 ... 0.0000 1.000000
max 891.000000 1.000000 3.00000 ... 0.0000 1.000000
2. Survival Rate by Sex
We’ll analyze the survival rate by sex to see if there were differences between males and females.
# Survival rate by sex
survival_rate_by_sex = df.groupby('Sex')['Survived'].mean()
print(survival_rate_by_sex)
Output:
Sex
female 0.742038
male 0.188908
Name: Survived, dtype: float64
3. Survival Rate by Age Group
We’ll create age groups and analyze the survival rate for each group.
# Create age groups
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 120], labels=['Child', 'Teenager', 'Young Adult', 'Adult', 'Senior'])
# Survival rate by age group
survival_rate_by_age_group = df.groupby('AgeGroup')['Survived'].mean()
print(survival_rate_by_age_group)
Output:
AgeGroup
Child 0.593220
Teenager 0.393939
Young Adult 0.367021
Adult 0.416667
Senior 0.227273
Name: Survived, dtype: float64
4. Survival Rate by Class
We’ll analyze the survival rate by passenger class to see if class played a role in survival.
# Survival rate by class
survival_rate_by_class = df.groupby('Pclass')['Survived'].mean()
print(survival_rate_by_class)
Output:
Pclass
1 0.629630
2 0.472826
3 0.242363
Name: Survived, dtype: float64
5. Correlation Matrix
We’ll create a correlation matrix to see the relationships between numerical features.
# Correlation matrix
correlation_matrix = df.corr()
print(correlation_matrix)
Output:
PassengerId Survived ... FamilySize
PassengerId 1.000000 -0.005007 ... -0.057673
Survived -0.005007 1.000000 ... -0.016871
Pclass 0.285904 -0.338481 ... 0.083081
Data Visualization
Visualizing data helps in understanding patterns and insights more effectively. We’ll use the Seaborn library for creating visualizations.
import seaborn as sns
import matplotlib.pyplot as plt
# Set the style of the visualization
sns.set(style='whitegrid')
# Survival rate by sex
plt.figure(figsize=(8, 6))
sns.barplot(x='Sex', y='Survived', data=df)
plt.title('Survival Rate by Sex')
plt.show()
Output:

Example:
# Survival rate by age group
plt.figure(figsize=(8, 6))
sns.barplot(x='AgeGroup', y='Survived', data=df)
plt.title('Survival Rate by Age Group')
plt.show()
Output:

Example:
# Survival rate by class
plt.figure(figsize=(8, 6))
sns.barplot(x='Pclass', y='Survived', data=df)
plt.title('Survival Rate by Class')
plt.show()
Output:

Example:
# Correlation matrix heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()
Output:

Drawing Insights
Based on our analysis, we can draw several insights:
- Sex and Survival: Females had a higher survival rate compared to males. This could be due to the “women and children first” policy during the evacuation.
- Age and Survival: Children (age <= 12) had a higher survival rate compared to other age groups. This also supports the priority given to women and children.
- Class and Survival: Passengers in first class had a higher survival rate compared to those in second and third class. This reflects the socioeconomic differences and access to lifeboats.
- Family Size and Survival: Passengers with smaller family sizes had a slightly higher survival rate. This could indicate that smaller groups had better chances of being evacuated together.
Conclusion
In this blog, we have demonstrated how to analyze a dataset using Pandas, focusing on the Titanic dataset. We covered data cleaning and preparation, exploratory data analysis, and data visualization. By following these steps, you can approach any data analysis project with confidence and derive meaningful insights from your data.
We hope you found this project informative and useful. As you continue to practice and apply these techniques, you’ll become more proficient in data analysis and be able to tackle more complex datasets and projects.
Happy coding, and see you on Day 37!