Lecture 4A - Introduction to Data visualizations & EDA¶

We will begin at around 12:00 PM! Until then, feel free to use the chat to socialize, and enjoy the music!

Photo by RODNAE Productions from Pexels

July 27, 2021
Firas Moosvi

Announcements¶

Lab 4, 5, and 6 are available for you to look at and accept

From now on, we will try and give feedback as you submit things, rather than wait until after the deadline + grace period.

If you are feeling lost or frustrated about Python, come to Wednesday’s class for a “get back on the bus” class - I will try to catch people up and go over some common issues I’m seeing.
- For those of you that are fine, you’re still welcome to come and ask about seaborn-related questions.

Class Outline¶

Announcements (5 mins)
Final Exam Information
Introduction to Lab 4 (5 mins)
Introduction to Milestone 2 (10 mins)
- Method Chaining
- Importing your own python functions –> Demo Repository
Introduction to Data Visualizations
Importance of Data Visualizations
Introduction to Exploratory Data Analyses

Final Exam Details¶

Here are some details about the final exam.

The final exam will be a take home final exam, delivered as a GitHub Classroom assignment (similar to labs in this course)
There will be a 48 hour window during which you can start the exam at any time, and you must end it within the window.
The final exam window will OPEN Sunday, Aug 15 at 18:00 (on Gradescope).
The final exam window will CLOSE Tuesday, Aug 17 at 18:00.
Format: you will get to work with a dataset you haven’t seen before and be asked to do a comprehensive data analysis, some of it guided, and some of it unguided.
- You will need to do all the steps of a Data Analysis.
- You will be given some research questions to answer, based on the data.
- You will also need to do come up with your own research questions and answer them.
The exam will not be proctored or invigilated, but the same rules as the Test apply.
Remember that you will need to accept a GH Classroom link (just like with the labs) and then submit your repository link once you are finished with the exam.

Final Exam Details Continued¶

You will find the link to accept the GH Classroom link on Gradescope.
You will also need to commit to the repo and push to GitHub at various points during the exam (I will have instructions in the exam for when you should be committing, and pushing).
The exam is designed to be completed in 2 - 2.5 hours, but you will have the full 48 hour window to spend on the exam. I highly recommend that you block out a chunk of time and finish the exam in one sitting. You have other exams to deal with as well, and just because I give you 48 hours, does not mean you have to use all of it!!
You will need to make sure your JupyterLab installation is functioning, Git and Python is working, and all the packages used in the course are installed. This will be a required aspect of the final exam!
You will need to frequently commit your work using the Terminal (i.e. NOT GitHub desktop, or the web uploader!) so please make sure you know how to do that. It will be just like the labs and milestones, if you’ve been keeping up, I don’t expect you’ll have a problem.
The exam will contain everything in the course EXCEPT Tableau and Excel. With git, you will be expected to demonstrate proficiency of the basic commands while you are doing the Exam, but there are no specific questions about git.
I will post an Ed Discussion note about the Final Exam, if you have questions about it, you can ask them there.

General Rules for the Test¶

Read them carefully!

You will have 90 minutes to complete the test (unless you have an accommodation from the DRC).
You must complete the test BY YOURSELF (no friends, no tutors, no classmates, no humans - cats and dogs in the room are fine).
You will not be able to ask us questions during the quiz - do your best with your best interpretation of the question.
The test is open-book, open-notes, open-web EXCEPT you CANNOT use websites that help you cheat such as Chegg, Course Hero, Slader and other similar websites that have tutors answering questions you upload (Stack Overflow is allowed).
Using google to search for concepts is NOT cheating. For example, you can search for definitions of terms, and commands.
If you accidentally come across the same or similar test question on google, resist the temptation to keep reading, and just close your browser tab.
You can also use any code editor, or JupyterLab, etc to run and test your code.
Any form of communication with other humans, terrestrial or extraterrestrial is not allowed (Discord, Slack, WhatsApp, Terminal, Signal, iMessage, SMS, MMS, etc…) and IS CHEATING
Do NOT share test questions with anyone - that IS CHEATING.
Do not be anxious about the test! If you don’t do well - review the material and try again next week - we will take the better of the two scores!
Overall, do not stress! You will be fine :-)

Introduction to Lab 4¶

Lab 4A ¶

Lab 4B ¶

Introduction to Milestone 2 ¶

Task 1. Set up an “Analysis Pipeline” (20%)¶

Common steps of a Data Analysis Pipeline¶

Here are some common steps of an analysis pipeline (the order isn’t set, and not all elements are necessary):

Load Data
- Check file types and encodings.
- Check delimiters (space, comma, tab).
- Skip rows and columns as needed.
Clean Data
- Remove columns not being used.
- Deal with “incorrect” data.
- Deal with missing data.
Process Data
- Create any new columns needed that are combinations or aggregates of other columns (examples include weighted averages, categorizations, groups, etc…).
- Find and replace operations (examples inlcude replacing the string ‘Strongly Agree’ with the number 5).
- Other substitutions as needed.
- Deal with outliers.
Wrangle Data
- Restructure data format (columns and rows).
- Merge other data sources into your dataset.
Exploratory Data Analysis
Data Analysis
Export reports/data analyses and visualizations

Method chaining¶

import pandas as pd
import numpy as np
from sklearn.datasets import load_wine

data = load_wine() # this a data file that gets loaded

# Method chaining begins

df = (   
    pd.DataFrame(data.data,columns=data.feature_names)
    .rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .loc[lambda x: x['alcohol']>14]
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

df

import pandas as pd
import numpy as np
from sklearn.datasets import load_wine

data = load_wine() 

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_1959/3479047496.py in <module>
      1 import pandas as pd
      2 import numpy as np
----> 3 from sklearn.datasets import load_wine
      4 
      5 data = load_wine()

ModuleNotFoundError: No module named 'sklearn'

data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

df = pd.DataFrame(data['data'],columns = data['feature_names'])
df.head()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

df = df.rename(columns={'color_intensity':'ci'})
df

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	ci	hue	od280/od315_of_diluted_wines	proline	color_filter
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0	0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0	0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0	0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0	0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
173	13.71	5.65	2.45	20.5	95.0	1.68	0.61	0.52	1.06	7.70	0.64	1.74	740.0	0
174	13.40	3.91	2.48	23.0	102.0	1.80	0.75	0.43	1.41	7.30	0.70	1.56	750.0	0
175	13.27	4.28	2.26	20.0	120.0	1.59	0.69	0.43	1.35	10.20	0.59	1.56	835.0	0
176	13.17	2.59	2.37	20.0	120.0	1.65	0.68	0.53	1.46	9.30	0.60	1.62	840.0	0
177	14.13	4.10	2.74	24.5	96.0	2.05	0.76	0.56	1.35	9.20	0.61	1.60	560.0	0

178 rows × 14 columns

df['color_filter'] = np.where((df['hue'] > 1) & (df['ci']>7), 1,0)
df.head()
# Inspired from Source:
# https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	ci	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

df = (   
    pd.DataFrame(data.data,columns=data.feature_names)
    .rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .loc[lambda x: x['alcohol']>14]
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

df

	alcohol	ci	hue
0	14.83	5.20	1.08
1	14.75	5.40	1.25
2	14.39	5.25	1.02
3	14.38	4.90	1.04
4	14.38	7.50	1.20
5	14.37	7.80	0.86
6	14.34	13.00	0.57
7	14.30	6.20	1.07
8	14.23	5.64	1.04
9	14.22	5.10	0.89
10	14.22	6.38	0.94
11	14.21	5.24	0.87
12	14.20	6.75	1.05
13	14.19	8.70	1.23
14	14.16	9.70	0.62
15	14.13	9.20	0.61
16	14.12	5.00	1.17
17	14.10	6.20	1.07
18	14.10	5.75	1.25
19	14.06	5.65	1.09
20	14.06	5.05	1.06
21	14.02	4.70	1.04

Import your own functions¶

See example in this Demo Repository

Introduction to Data Visualizations (Back at 1 PM)¶

Slides available for downoad here.

from IPython.display import IFrame
IFrame("../../../Class4A.pdf", width=900, height=800)

Motivating the need for EDA¶

bullet_data.csv is available here: https://github.com/firasm/bits/raw/master/bullet_data.csv

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport

sns.set_theme(style="white",
              font_scale=1.3)

df = pd.read_csv('https://github.com/firasm/bits/raw/master/bullet_data.csv')
df.head()

# Use our standard tool first:

df.describe().T

	count	mean	std	min	25%	50%	75%	max
alcohol	22.0	14.261818	0.204838	14.02	14.1225	14.215	14.3625	14.83
ci	22.0	6.559545	2.043294	4.70	5.2100	5.700	7.3125	13.00
hue	22.0	1.000909	0.197120	0.57	0.9025	1.045	1.0875	1.25

# Use the advanced profiling tool next: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/

ProfileReport(df).to_notebook_iframe()

mmm… well that’s not super helpful.

describe() didn’t quite organize the data like the way we wanted, and profile_report is just overkill…

Let’s try and figure out some more info manually.

print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")

print("Columns are: {0}".format(list(df.columns)),"\n")

print("Values for 'bullet' column is {0}".format(sorted(df['bullet'].unique())),"\n")

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'zone'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-65-86efe1e4449e> in <module>
----> 1 print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")
      2 
      3 print("Columns are: {0}".format(list(df.columns)),"\n")
      4 
      5 print("Values for 'bullet' column is {0}".format(sorted(df['bullet'].unique())),"\n")

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3453             if self.columns.nlevels > 1:
   3454                 return self._getitem_multilevel(key)
-> 3455             indexer = self.columns.get_loc(key)
   3456             if is_integer(indexer):
   3457                 indexer = [indexer]

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'zone'

Let’s wrangle the data a bit to try and see what’s going on:

# First, only consider the bullet 'hits':

hits_df = df[df['bullet']==1]
hits_df.sample(5)

# Then, let's groupby the "zone" and look at the resulting dataframe
# I have "reset" the index of the groupby object so we can have a continuous index
summary = hits_df.groupby('zone').count().reset_index()
summary

	zone	x	y	bullet
0	A	83	83	83
1	B	259	259	259
2	C	83	83	83
3	D	47	47	47
4	E	111	111	111
5	Unknown	16	16	16

# Now let's visualize the table above:

sns.countplot(data=hits_df, 
              y='zone', order = sorted(set(df['zone'])),color='skyblue')
plt.ylabel('')
plt.title('Bullet hit count by Airplane Zone')
plt.xlabel('Bullet hits')
sns.despine()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3360             try:
-> 3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'zone'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-62-e58db1d46184> in <module>
      2 
      3 sns.countplot(data=hits_df, 
----> 4               y='zone', order = sorted(set(df['zone'])),color='skyblue')
      5 plt.ylabel('')
      6 plt.title('Bullet hit count by Airplane Zone')

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
   3453             if self.columns.nlevels > 1:
   3454                 return self._getitem_multilevel(key)
-> 3455             indexer = self.columns.get_loc(key)
   3456             if is_integer(indexer):
   3457                 indexer = [indexer]

~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'zone'

# Another Visualization

df['outline'] = np.where(df['zone']=='OutsidePlane',0,1)

sns.heatmap(data=df.pivot('x','y','outline'),cmap='Greys')
plt.axis('off')

sns.heatmap(data=df.pivot('x','y','bullet'),cmap='Spectral')
plt.axis('off')

Debrief¶

And that’s it!

DATA 301