Lecture 9A - Motivating the need for EDA

Lecture 9A - Motivating the need for EDA

We will begin soon! Until then, feel free to use the chat to socialize, and enjoy the music!

../../../_images/viz2.jpg


Photo by RODNAE Productions from Pexels
Firas Moosvi

Class Outline

  1. Motivating Exploratory Data Analyses (30 mins)

Motivating the need for EDA

bullet_data.csv is available here: https://github.com/firasm/bits/raw/master/bullet_data.csv

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport

sns.set_theme(style="white",
              font_scale=1.3)
df = pd.read_csv('https://github.com/firasm/bits/raw/master/bullet_data.csv')
df.head()
x y bullet zone
0 0 0 0.0 OutsidePlane
1 0 1 0.0 OutsidePlane
2 0 2 0.0 OutsidePlane
3 0 3 0.0 OutsidePlane
4 0 4 0.0 OutsidePlane
# Use our standard tool first:

df.describe().T
count mean std min 25% 50% 75% max
x 87500.0 124.500000 72.168619 0.0 62.0 124.5 187.0 249.0
y 87500.0 174.500000 101.036462 0.0 87.0 174.5 262.0 349.0
bullet 68526.0 0.008741 0.093086 0.0 0.0 0.0 0.0 1.0
# Use the advanced profiling tool next: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/

ProfileReport(df).to_notebook_iframe()

mmm… well that’s not super helpful.

describe() didn’t quite organize the data like the way we wanted, and profile_report is just overkill…

Let’s try and figure out some more info manually.

print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")

print("Columns are: {0}".format(list(df.columns)),"\n")

print("Values for 'bullet' column is {0}".format(sorted(df['bullet'].unique())),"\n")
The zones are: ['A', 'B', 'C', 'D', 'E', 'OutsidePlane', 'Unknown'] 

Columns are: ['x', 'y', 'bullet', 'zone'] 

Values for 'bullet' column is [0.0, nan, 1.0] 

Let’s wrangle the data a bit to try and see what’s going on:

# First, only consider the bullet 'hits':

hits_df = df[df['bullet']==1]
hits_df.sample(5)
x y bullet zone
40103 114 203 1.0 B
41846 119 196 1.0 B
41500 118 200 1.0 B
27223 77 273 1.0 C
28154 80 154 1.0 B
# Then, let's groupby the "zone" and look at the resulting dataframe
# I have "reset" the index of the groupby object so we can have a continuous index
summary = hits_df.groupby('zone').count().reset_index()
summary
zone x y bullet
0 A 83 83 83
1 B 259 259 259
2 C 83 83 83
3 D 47 47 47
4 E 111 111 111
5 Unknown 16 16 16
# Now let's visualize the table above:

sns.countplot(data=hits_df, 
              y='zone', order = sorted(set(df['zone'])),color='skyblue')
plt.ylabel('')
plt.title('Bullet hit count by Airplane Zone')
plt.xlabel('Bullet hits')
sns.despine()
# Another Visualization

df['outline'] = np.where(df['zone']=='OutsidePlane',0,1)

sns.heatmap(data=df.pivot('x','y','outline'),cmap='Greys')
plt.axis('off')
(0.0, 350.0, 250.0, 0.0)
sns.heatmap(data=df.pivot('x','y','bullet'),cmap='Spectral')
plt.axis('off')
(0.0, 350.0, 250.0, 0.0)

Debrief

And that’s it!