Lecture 9A - Motivating the need for EDA¶

We will begin soon! Until then, feel free to use the chat to socialize, and enjoy the music!

Firas Moosvi

Class Outline¶

Motivating Exploratory Data Analyses (30 mins)

Motivating the need for EDA¶

bullet_data.csv is available here: https://github.com/firasm/bits/raw/master/bullet_data.csv

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport

sns.set_theme(style="white",
              font_scale=1.3)

df = pd.read_csv('https://github.com/firasm/bits/raw/master/bullet_data.csv')
df.head()

	y	zone
0	0	OutsidePlane
1	1	OutsidePlane
2	2	OutsidePlane
3	3	OutsidePlane
4	4	OutsidePlane

# Use our standard tool first:

df.describe().T

	count	mean	std	25%	50%	75%	max
x	87500.0	124.500000	72.168619	62.0	124.5	187.0	249.0
y	87500.0	174.500000	101.036462	87.0	174.5	262.0	349.0
bullet	68526.0	0.008741	0.093086	0.0	0.0	0.0	1.0

# Use the advanced profiling tool next: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/

ProfileReport(df).to_notebook_iframe()

mmm… well that’s not super helpful.

describe() didn’t quite organize the data like the way we wanted, and profile_report is just overkill…

Let’s try and figure out some more info manually.

print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")

print("Columns are: {0}".format(list(df.columns)),"\n")

print("Values for 'bullet' column is {0}".format(sorted(df['bullet'].unique())),"\n")

The zones are: ['A', 'B', 'C', 'D', 'E', 'OutsidePlane', 'Unknown'] 

Columns are: ['x', 'y', 'bullet', 'zone'] 

Values for 'bullet' column is [0.0, nan, 1.0] 

Let’s wrangle the data a bit to try and see what’s going on:

# First, only consider the bullet 'hits':

hits_df = df[df['bullet']==1]
hits_df.sample(5)

	x	y	bullet	zone
40103	114	203	1.0	B
41846	119	196	1.0	B
41500	118	200	1.0	B
27223	77	273	1.0	C
28154	80	154	1.0	B

# Then, let's groupby the "zone" and look at the resulting dataframe
# I have "reset" the index of the groupby object so we can have a continuous index
summary = hits_df.groupby('zone').count().reset_index()
summary

	zone	x	y	bullet
0	A	83	83	83
1	B	259	259	259
2	C	83	83	83
3	D	47	47	47
4	E	111	111	111
5	Unknown	16	16	16

# Now let's visualize the table above:

sns.countplot(data=hits_df, 
              y='zone', order = sorted(set(df['zone'])),color='skyblue')
plt.ylabel('')
plt.title('Bullet hit count by Airplane Zone')
plt.xlabel('Bullet hits')
sns.despine()

# Another Visualization

df['outline'] = np.where(df['zone']=='OutsidePlane',0,1)

sns.heatmap(data=df.pivot('x','y','outline'),cmap='Greys')
plt.axis('off')

(0.0, 350.0, 250.0, 0.0)

sns.heatmap(data=df.pivot('x','y','bullet'),cmap='Spectral')
plt.axis('off')

(0.0, 350.0, 250.0, 0.0)

Debrief¶

And that’s it!

DATA 301

Lecture 9A - Motivating the need for EDA

Contents

Lecture 9A - Motivating the need for EDA¶

Class Outline¶

Motivating the need for EDA¶

Debrief¶