Lecture 9 - Choosing an appropriate visualization & EDA¶

Announcements (5 mins)¶

Schedule changes for next week
- Labs/Office hours on Wed/Thurs/Fri cancelled next week
- Test 4 window will be moved to Sunday 6PM - Tuesday 6 PM
Material for next two weeks is shuffled a bit (see Canvas announcement)
Prepare for next week by requesting your free Tableau for Students License
Heads up: Approaching “end of Term crunch”! Stay on top of your deadlines!

Part 1: Choosing an appropriate data visualization (15 mins)¶

See Lecture Slides below…

Part 2: Motivating the need for EDA¶

bullet_data.csv is available here: https://github.com/firasm/bits/raw/master/bullet_data.csv

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

sns.set_theme(style="white",
              font_scale=1.3)

df = pd.read_csv('https://github.com/firasm/bits/raw/master/bullet_data.csv')
df.head()

	x	y	bullet	zone
0	0	0	0.0	OutsidePlane
1	0	1	0.0	OutsidePlane
2	0	2	0.0	OutsidePlane
3	0	3	0.0	OutsidePlane
4	0	4	0.0	OutsidePlane

# Use our standard tool first:

df.describe().T

	count	mean	std	min	25%	50%	75%	max
x	87500.0	124.500000	72.168619	0.0	62.0	124.5	187.0	249.0
y	87500.0	174.500000	101.036462	0.0	87.0	174.5	262.0	349.0
bullet	68526.0	0.008741	0.093086	0.0	0.0	0.0	0.0	1.0

mmm… well that’s not super helpful. Let’s try and figure out some more info manually

print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")

print("Columns are: {0}".format(list(df.columns)),"\n")

print("Values for 'bullet' column is either 1 or NA","\n")

The zones are: ['A', 'B', 'C', 'D', 'E', 'OutsidePlane', 'Unknown'] 

Columns are: ['x', 'y', 'bullet', 'zone'] 

Values for 'bullet' column is either 1 or NA 

Let’s wrangle the data a bit to try and see what’s going on:

# First, only consider the bullet 'hits':

hits_df = df[df['bullet']==1]
hits_df.sample(5)

	x	y	bullet	zone
55479	158	179	1.0	D
28477	81	127	1.0	B
42537	121	187	1.0	D
27572	78	272	1.0	C
43224	123	174	1.0	D

# Then, let's groupby the "zone" and look at the resulting dataframe
# I have "reset" the index of the groupby object so we can have a continuous index

summary = hits_df.groupby('zone').count().reset_index()
summary

	zone	x	y	bullet
0	A	83	83	83
1	B	259	259	259
2	C	83	83	83
3	D	47	47	47
4	E	111	111	111
5	Unknown	16	16	16

# Now let's visualize the table above:

sns.countplot(data=hits_df, 
              y='zone', order = sorted(set(df['zone'])),color='skyblue')
plt.ylabel('')
plt.title('Bullet hit count by Airplane Zone')
plt.xlabel('Bullet hits')
sns.despine()

../../_images/Lecture9_14_0.png

df['outline'] = np.where(df['zone']=='OutsidePlane',0,1)

sns.heatmap(data=df.pivot('x','y','outline'),cmap='Greys')
plt.axis('off')

(0.0, 350.0, 250.0, 0.0)

../../_images/Lecture9_15_1.png

sns.heatmap(data=df.pivot('x','y','bullet'),cmap='Spectral')
plt.axis('off')

(0.0, 350.0, 250.0, 0.0)

../../_images/Lecture9_16_1.png

Part 3: Judicious use of Colours¶

See Lecture Slides below…

Lecture Slides¶

Videos Note about Labs 5 and 6