Lecture 9A - Motivating the need for EDA
Contents
Lecture 9A - Motivating the need for EDA¶
We will begin soon! Until then, feel free to use the chat to socialize, and enjoy the music!

Firas Moosvi
Class Outline¶
Motivating Exploratory Data Analyses (30 mins)
Motivating the need for EDA¶
bullet_data.csv
is available here: https://github.com/firasm/bits/raw/master/bullet_data.csv
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
sns.set_theme(style="white",
font_scale=1.3)
df = pd.read_csv('https://github.com/firasm/bits/raw/master/bullet_data.csv')
df.head()
x | y | bullet | zone | |
---|---|---|---|---|
0 | 0 | 0 | 0.0 | OutsidePlane |
1 | 0 | 1 | 0.0 | OutsidePlane |
2 | 0 | 2 | 0.0 | OutsidePlane |
3 | 0 | 3 | 0.0 | OutsidePlane |
4 | 0 | 4 | 0.0 | OutsidePlane |
# Use our standard tool first:
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
x | 87500.0 | 124.500000 | 72.168619 | 0.0 | 62.0 | 124.5 | 187.0 | 249.0 |
y | 87500.0 | 174.500000 | 101.036462 | 0.0 | 87.0 | 174.5 | 262.0 | 349.0 |
bullet | 68526.0 | 0.008741 | 0.093086 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
# Use the advanced profiling tool next: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/
ProfileReport(df).to_notebook_iframe()
mmm… well that’s not super helpful.
describe()
didn’t quite organize the data like the way we wanted, and profile_report
is just overkill…
Let’s try and figure out some more info manually.
print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")
print("Columns are: {0}".format(list(df.columns)),"\n")
print("Values for 'bullet' column is {0}".format(sorted(df['bullet'].unique())),"\n")
The zones are: ['A', 'B', 'C', 'D', 'E', 'OutsidePlane', 'Unknown']
Columns are: ['x', 'y', 'bullet', 'zone']
Values for 'bullet' column is [0.0, nan, 1.0]
Let’s wrangle the data a bit to try and see what’s going on:
# First, only consider the bullet 'hits':
hits_df = df[df['bullet']==1]
hits_df.sample(5)
x | y | bullet | zone | |
---|---|---|---|---|
40103 | 114 | 203 | 1.0 | B |
41846 | 119 | 196 | 1.0 | B |
41500 | 118 | 200 | 1.0 | B |
27223 | 77 | 273 | 1.0 | C |
28154 | 80 | 154 | 1.0 | B |
# Then, let's groupby the "zone" and look at the resulting dataframe
# I have "reset" the index of the groupby object so we can have a continuous index
summary = hits_df.groupby('zone').count().reset_index()
summary
zone | x | y | bullet | |
---|---|---|---|---|
0 | A | 83 | 83 | 83 |
1 | B | 259 | 259 | 259 |
2 | C | 83 | 83 | 83 |
3 | D | 47 | 47 | 47 |
4 | E | 111 | 111 | 111 |
5 | Unknown | 16 | 16 | 16 |
# Now let's visualize the table above:
sns.countplot(data=hits_df,
y='zone', order = sorted(set(df['zone'])),color='skyblue')
plt.ylabel('')
plt.title('Bullet hit count by Airplane Zone')
plt.xlabel('Bullet hits')
sns.despine()
# Another Visualization
df['outline'] = np.where(df['zone']=='OutsidePlane',0,1)
sns.heatmap(data=df.pivot('x','y','outline'),cmap='Greys')
plt.axis('off')
(0.0, 350.0, 250.0, 0.0)
sns.heatmap(data=df.pivot('x','y','bullet'),cmap='Spectral')
plt.axis('off')
(0.0, 350.0, 250.0, 0.0)
Debrief¶
And that’s it!