Exploratory Data Analysis(EDA) With Python (2024)

Exploratory Data Analysis is method which is used by statisticians to show the patterns and some important results. This is mainly done by visualizing various graphs. In Data Analysis EDA is very important step to monitor and recognize the valuable patterns within the data.

Requirements:-
  1. Python
  2. Pandas
  3. Seaborn sns library
  4. matplotlib library
  5. NumPy

In this Post I am going to perform a Simple EDA on Titanic Datasets available on Kaggle. You can download it from here.

Importing Required Python Libraries.
import numpy as npimport pandas as pdfrom pandas import Series, DataFrameimport matplotlib as mplimport matplotlib.pyplot as pltimport seaborn as sns %matplotlib inline# Set default matplot figure sizepylab.rcParams['figure.figsize'] = (10.0, 8.0)
Importing the Dataset using Pandas
titanic_df = pd.read_csv('train.csv')
Check first 5 rows of the dataset
titanic_df.head()
Exploratory Data Analysis(EDA) With Python (1)
Check Column Names
titanic_df.columns
Exploratory Data Analysis(EDA) With Python (2)
Information about the dataset.
titanic_df.info()
Exploratory Data Analysis(EDA) With Python (3)
Number of Passengers in each class
titanic_df.groupby('Pclass')['Pclass'].count()
Exploratory Data Analysis(EDA) With Python (4)
Plot the classes

As we can see there are 3 classes of passengers in the Dataset. Class 1 has 216 passengers, Class 2 has 184 passengers and class 3 has 491 passengers. Now let’s plot classes using Seaborn.

Exploratory Data Analysis(EDA) With Python (5)
Plot By Sex
titanic_df.groupby('Sex')['Sex'].count()
Exploratory Data Analysis(EDA) With Python (6)

As we can see the count of males and females in the passengers.

sns.factorplot('Sex', data=titanic_df, kind='count', aspect=1.5)
Exploratory Data Analysis(EDA) With Python (7)

Number of Women and Men in each passenger Class

# Number of men and women in each of the passenger classtitanic_df.groupby(['Sex', 'Pclass'])['Sex'].count()
Exploratory Data Analysis(EDA) With Python (8)
# Again use saeborn to group by Sex and classg = sns.factorplot('Pclass', data=titanic_df, hue='Sex', kind='count', aspect=1.75)g.set_xlabels('Class')
Exploratory Data Analysis(EDA) With Python (9)

Now let’s look at the numbers of males and females who survived in titanic grouped by class

titanic_df.pivot_table('Survived', 'Sex', 'Pclass', aggfunc=np.sum, margins=True)
Exploratory Data Analysis(EDA) With Python (10)
not_survived = titanic_df[titanic_df['Survived']==0]
sns.factorplot('Survived', data=titanic_df, kind='count')
Exploratory Data Analysis(EDA) With Python (11)
len(not_survived)
Exploratory Data Analysis(EDA) With Python (12)

Now let’s look at the Number of people who didn’t survive in each class grouped by sex.

not_survived.pivot_table('Survived', 'Sex', 'Pclass', aggfunc=len, margins=True)
Exploratory Data Analysis(EDA) With Python (13)
Passengers who survived and who didn’t survive grouped by class and sex
table = pd.crosstab(index=[titanic_df.Survived,titanic_df.Pclass], columns=[titanic_df.Sex,titanic_df.Embarked])table.unstack()
Exploratory Data Analysis(EDA) With Python (14)
table.columns, table.index
Exploratory Data Analysis(EDA) With Python (15)
table.columns.set_levels(['Female', 'Male'], level=0, inplace=True)table.columns.set_levels(['Cherbourg','Queenstown','Southampton'], level=1, inplace=True)print('Average and median age of passengers are %0.f and %0.f years old, respectively'%(titanic_df.Age.mean(), titanic_df.Age.median()))
Exploratory Data Analysis(EDA) With Python (16)
my_df[my_df.isnull().any(axis=1)].head()age = titanic_df['Age'].dropna()age_dist = sns.distplot(age)age_dist.set_title("Distribution of Passengers' Ages")
Exploratory Data Analysis(EDA) With Python (17)

Another way to plot a histogram of ages is shown below

titanic_df['Age'].hist(bins=50)
Exploratory Data Analysis(EDA) With Python (18)

Now let’s check the datatypes of different columns in dataset

titanic_df['Parch'].dtype, titanic_df['SibSp'].dtype, len(titanic_df.Cabin.dropna())
Exploratory Data Analysis(EDA) With Python (19)

Now let’s create a function to define those who are children.

def male_female_child(passenger): age, sex = passenger if age < 16: return 'child' else: return sextitanic_df['person'] = titanic_df[['Age', 'Sex']].apply(male_female_child, axis=1)titanic_df[:10]
Exploratory Data Analysis(EDA) With Python (20)

Lets do a factorplot of passengers splitted into sex, children and class

sns.factorplot('Pclass', data=titanic_df, kind='count', hue='person', order=[1,2,3],hue_order=['child','female','male'], aspect=2)
Exploratory Data Analysis(EDA) With Python (21)

Count number of men, women and children

titanic_df['person'].value_counts()
Exploratory Data Analysis(EDA) With Python (22)

Do the same as above, but split the passengers into either survived or not.

sns.factorplot('Pclass', data=titanic_df, kind='count', hue='person', col='Survived', order=[1,2,3], hue_order=['child','female','male'], aspect=1.25, size=5)
Exploratory Data Analysis(EDA) With Python (23)

There are much more children in third class than there are in first and second class. However, one may expect that there woould be more children in 1st and 2nd class than there are in 3rd class.

kde plot, Distribution of Passengers’ Ages

Grouped by Gender
fig = sns.FacetGrid(titanic_df, hue='Sex', aspect=4)fig.map(sns.kdeplot, 'Age', shade=True)oldest = titanic_df['Age'].max()fig.set(xlim=(0,oldest))fig.set(title='Distribution of Age Grouped by Gender')fig.add_legend()
Exploratory Data Analysis(EDA) With Python (24)
fig = sns.FacetGrid(titanic_df, hue='person', aspect=4)fig.map(sns.kdeplot, 'Age', shade=True)oldest = titanic_df['Age'].max()fig.set(xlim=(0,oldest))fig.add_legend()
Exploratory Data Analysis(EDA) With Python (25)
Grouped By Class
fig = sns.FacetGrid(titanic_df, hue='Pclass', aspect=4)fig.map(sns.kdeplot, 'Age', shade=True)oldest = titanic_df['Age'].max()fig.set(xlim=(0,oldest))fig.set(title='Distribution of Age Grouped by Class')fig.add_legend()
Exploratory Data Analysis(EDA) With Python (26)

From the plot above, class 1 has a normal distribution. However, classes 2 and 3 have a skewed distribution towards 20 and 30-year old passengers.

What cabins did the Passengers stay in?
deck = titanic_df['Cabin'].dropna()deck.head()
Exploratory Data Analysis(EDA) With Python (27)

Grab the first letter of the cabin letter

d = []for c in deck: d.append(c[0])d[0:10]
Exploratory Data Analysis(EDA) With Python (28)
from collections import CounterCounter(d)
Exploratory Data Analysis(EDA) With Python (29)

Now lets factorplot the cabins. First transfer the d list into a data frame. Then rename the column Cabin

cabin_df = DataFrame(d)cabin_df.columns=['Cabin']sns.factorplot('Cabin', data=cabin_df, kind='count', order=['A','B','C','D','E','F','G','T'], aspect=2,palette='winter_d')
Exploratory Data Analysis(EDA) With Python (30)
#Drop the T cabincabin_df = cabin_df[cabin_df['Cabin'] != 'T']

Then replot the Cabins factorplot as above

sns.factorplot('Cabin', data=cabin_df, kind='count', order=['A','B','C','D','E','F','G'], aspect=2, palette='Greens_d')
Exploratory Data Analysis(EDA) With Python (31)
# Below is a link to the list of matplotlib colormapsurl = 'http://matplotlib.org/api/pyplot_summary.html?highlight=colormaps#matplotlib.pyplot.colormaps'import webbrowserwebbrowser.open(url)

Where did the passengers come from i.e. Where did the passengers land into the ship from?

sns.factorplot('Embarked', data=titanic_df, kind='count', hue='Pclass', hue_order=range(1,4), aspect=2,order = ['C','Q','S'])
Exploratory Data Analysis(EDA) With Python (32)

From the figure above, one may conclude that almost all of the passengers who boarded from Queenstown were in third class. On the other hand, many who boarded from Cherbourg were in first class. The biggest portion of passengers who boarded the ship came from Southampton, in which 353 passengers were in third class, 164 in second class and 127 passengers were in first class. In such cases, one may need to look at the economic situation at these different towns at that period of time to understand why most passengers who boarded from Queens town were in third class for example.

titanic_df.Embarked.value_counts()
Exploratory Data Analysis(EDA) With Python (33)

For tabulated values, use crosstab pandas method instead of the factorplot in seaborn

port = pd.crosstab(index=[titanic_df.Pclass], columns=[titanic_df.Embarked])port.columns = [['Cherbourg','Queenstown','Southampton']]port
Exploratory Data Analysis(EDA) With Python (34)
port.index
Exploratory Data Analysis(EDA) With Python (35)
port.columns
Exploratory Data Analysis(EDA) With Python (36)
port.index=[['First','Second','Third']]port
Exploratory Data Analysis(EDA) With Python (37)

Who was alone and who was with parents or siblings?

titanic_df[['SibSp','Parch']].head()
Exploratory Data Analysis(EDA) With Python (38)
# Alone dataframe i.e. the passenger has no siblings or parentsalone_df = titanic_df[(titanic_df['SibSp'] == 0) & (titanic_df['Parch']==0)]# Add Alone columnalone_df['Alone'] = 'Alone'# Not alone data frame i.e. the passenger has either a sibling or a parent.not_alone_df = titanic_df[(titanic_df['SibSp'] != 0) | (titanic_df['Parch']!=0)]not_alone_df['Alone'] = 'With family'# Merge the above dataframescomb = [alone_df, not_alone_df]# Merge and sort by indextitanic_df = pd.concat(comb).sort_index()
Exploratory Data Analysis(EDA) With Python (39)
[len(alone_df), len(not_alone_df)]
Exploratory Data Analysis(EDA) With Python (40)
alone_df.head()
Exploratory Data Analysis(EDA) With Python (41)

Not Alone Dataframe

not_alone_df.head()
Exploratory Data Analysis(EDA) With Python (42)
titanic_df.head()
Exploratory Data Analysis(EDA) With Python (43)
""" Another way to perform the abovetitanic_df['Alone'] = titanic_df.SibSp + titanic_df.Parchtitanic_df['Alone'].loc[titanic_df['Alone']>0] = 'With family'titanic_df['Alone'].loc[titanic_df['Alone']==0] = 'Alone'"""
Exploratory Data Analysis(EDA) With Python (44)
fg=sns.factorplot('Alone', data=titanic_df, kind='count', hue='Pclass', col='person', hue_order=range(1,4), palette='Blues')fg.set_xlabels('Status')
Exploratory Data Analysis(EDA) With Python (45)

From the figure above, it is clear that most children traveled with family in third class. For men, most traveled alone in third class. On the other hand, the number of female passengers who traveled either with family or alone among the second and third class is comparable. However, more women traveled with family than alone in first class.

Factors Affecting the Surviving

Now lets look at the factors that help someone survived the sinking. We start this analysis by adding a new
cloumn to the titanic data frame. Use the Survived column to map to the new column with factors 0:no and 1:yes
using the map method

titanic_df['Survivor'] = titanic_df.Survived.map({0:'no', 1:'yes'})titanic_df.head()
Exploratory Data Analysis(EDA) With Python (46)
Class Factor

Survived vs. class Grouped by gender

sns.factorplot('Pclass','Survived', hue='person', data=titanic_df, order=range(1,4),hue_order = ['child','female','male'])
Exploratory Data Analysis(EDA) With Python (47)

From the figure above, being a male or a third class reduce the chance for one to survive.

sns.factorplot('Survivor', data=titanic_df, hue='Pclass', kind='count', palette='Pastel2', hue_order=range(1,4),col='person')
Exploratory Data Analysis(EDA) With Python (48)
Age Factor

Linear plot of age vs. survived

sns.lmplot('Age', 'Survived', data=titanic_df)
Exploratory Data Analysis(EDA) With Python (49)

There seems to be a general linear trend between age and the survived field. The plot shows that the older the passenger is, the less chance he/she would survive.

Survived vs. Age grouped by Sex
sns.lmplot('Age', 'Survived', data=titanic_df, hue='Sex')
Exploratory Data Analysis(EDA) With Python (50)

Older women have higher rate of survival than older men as shown in the figure above. Also, older women has higher rate of survival than younger women; an opposite trend to the one for the male passengers.

Survived vs. Age grouped by class
sns.lmplot('Age', 'Survived', hue='Pclass', data=titanic_df, palette='winter', hue_order=range(1,4))
Exploratory Data Analysis(EDA) With Python (51)

In all three classes, the chance to survive reduced as the passengers got older.

# Create a generation bingenerations = [10,20,40,60,80]sns.lmplot('Age','Survived',hue='Pclass',data=titanic_df,x_bins=generations, hue_order=[1,2,3])
Exploratory Data Analysis(EDA) With Python (52)

Deck Factor

titanic_df.columns
Exploratory Data Analysis(EDA) With Python (53)
titanic_DF = titanic_df.dropna(subset=['Cabin'])d[0:10]
Exploratory Data Analysis(EDA) With Python (54)
len(titanic_DF), len(d)
Exploratory Data Analysis(EDA) With Python (55)
titanic_DF['Deck'] = d
Exploratory Data Analysis(EDA) With Python (56)
titanic_DF = titanic_DF[titanic_DF.Deck != 'T']titanic_DF.head()
Exploratory Data Analysis(EDA) With Python (57)
sns.factorplot('Deck', 'Survived', data=titanic_DF, order=['A','B','C','D','E','F','G'])
Exploratory Data Analysis(EDA) With Python (58)

There does not seem to be any relation between deck and the survival rate as shown in the above figure!

Family Status Factor
sns.factorplot('Alone', 'Survived', data=titanic_df, palette='winter') #hue='person', #hue_order=['child', 'female', 'male'])
Exploratory Data Analysis(EDA) With Python (59)

There seems that the survival rate diminishes significantly for those who were alone. However, lets check if a gender or age play a factor. From the figure below, one may conclude that the survival rate for women and children are much higher than that of men, as was concluded previously and as anticipated. However, the survival rate is not significant for either gender or for children who were with family versus who were alone. Moreover, the survival rate for women and children increases for those who were alone. For men, the survival rate diminishes slightly for those who were alone versus for those who were with family.

sns.factorplot('Alone', 'Survived', data=titanic_df, palette='winter', hue='person', hue_order=['child', 'female', 'male'])
Exploratory Data Analysis(EDA) With Python (60)
# Lets split it by class now!sns.factorplot('Alone', 'Survived', data=titanic_df, palette='summer', hue='person', hue_order=['child', 'female', 'male'], col='Pclass', col_order=[1,2,3])
Exploratory Data Analysis(EDA) With Python (61)
Exploratory Data Analysis(EDA) With Python (2024)

References

Top Articles
Latest Posts
Article information

Author: Rev. Leonie Wyman

Last Updated:

Views: 6714

Rating: 4.9 / 5 (79 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Rev. Leonie Wyman

Birthday: 1993-07-01

Address: Suite 763 6272 Lang Bypass, New Xochitlport, VT 72704-3308

Phone: +22014484519944

Job: Banking Officer

Hobby: Sailing, Gaming, Basketball, Calligraphy, Mycology, Astronomy, Juggling

Introduction: My name is Rev. Leonie Wyman, I am a colorful, tasty, splendid, fair, witty, gorgeous, splendid person who loves writing and wants to share my knowledge and understanding with you.