Lecture 4A - Introduction to Data visualizations & EDA¶
We will begin at around 12:00 PM! Until then, feel free to use the chat to socialize, and enjoy the music!

Firas Moosvi
Announcements¶
Lab 4, 5, and 6 are available for you to look at and accept
From now on, we will try and give feedback as you submit things, rather than wait until after the deadline + grace period.
If you are feeling lost or frustrated about Python, come to Wednesdayâs class for a âget back on the busâ class - I will try to catch people up and go over some common issues Iâm seeing.
For those of you that are fine, youâre still welcome to come and ask about seaborn-related questions.
Class Outline¶
Announcements (5 mins)
Final Exam Information
Introduction to Lab 4 (5 mins)
Introduction to Milestone 2 (10 mins)
Method Chaining
Importing your own python functions â> Demo Repository
Introduction to Data Visualizations
Importance of Data Visualizations
Introduction to Exploratory Data Analyses
Final Exam Details¶
Here are some details about the final exam.
The final exam will be a take home final exam, delivered as a GitHub Classroom assignment (similar to labs in this course)
There will be a 48 hour window during which you can start the exam at any time, and you must end it within the window.
The final exam window will OPEN Sunday, Aug 15 at 18:00 (on Gradescope).
The final exam window will CLOSE Tuesday, Aug 17 at 18:00.
Format: you will get to work with a dataset you havenât seen before and be asked to do a comprehensive data analysis, some of it guided, and some of it unguided.
You will need to do all the steps of a Data Analysis.
You will be given some research questions to answer, based on the data.
You will also need to do come up with your own research questions and answer them.
The exam will not be proctored or invigilated, but the same rules as the Test apply.
Remember that you will need to accept a GH Classroom link (just like with the labs) and then submit your repository link once you are finished with the exam.
Final Exam Details Continued¶
You will find the link to accept the GH Classroom link on Gradescope.
You will also need to commit to the repo and push to GitHub at various points during the exam (I will have instructions in the exam for when you should be committing, and pushing).
The exam is designed to be completed in 2 - 2.5 hours, but you will have the full 48 hour window to spend on the exam. I highly recommend that you block out a chunk of time and finish the exam in one sitting. You have other exams to deal with as well, and just because I give you 48 hours, does not mean you have to use all of it!!
You will need to make sure your JupyterLab installation is functioning, Git and Python is working, and all the packages used in the course are installed. This will be a required aspect of the final exam!
You will need to frequently commit your work using the Terminal (i.e. NOT GitHub desktop, or the web uploader!) so please make sure you know how to do that. It will be just like the labs and milestones, if youâve been keeping up, I donât expect youâll have a problem.
The exam will contain everything in the course EXCEPT Tableau and Excel. With git, you will be expected to demonstrate proficiency of the basic commands while you are doing the Exam, but there are no specific questions about git.
I will post an Ed Discussion note about the Final Exam, if you have questions about it, you can ask them there.
General Rules for the Test¶
Read them carefully!
You will have 90 minutes to complete the test (unless you have an accommodation from the DRC).
You must complete the test BY YOURSELF (no friends, no tutors, no classmates, no humans - cats and dogs in the room are fine).
You will not be able to ask us questions during the quiz - do your best with your best interpretation of the question.
The test is open-book, open-notes, open-web EXCEPT you CANNOT use websites that help you cheat such as Chegg, Course Hero, Slader and other similar websites that have tutors answering questions you upload (Stack Overflow is allowed).
Using google to search for concepts is NOT cheating. For example, you can search for definitions of terms, and commands.
If you accidentally come across the same or similar test question on google, resist the temptation to keep reading, and just close your browser tab.
You can also use any code editor, or JupyterLab, etc to run and test your code.
Any form of communication with other humans, terrestrial or extraterrestrial is not allowed (Discord, Slack, WhatsApp, Terminal, Signal, iMessage, SMS, MMS, etcâŠ) and IS CHEATING
Do NOT share test questions with anyone - that IS CHEATING.
Do not be anxious about the test! If you donât do well - review the material and try again next week - we will take the better of the two scores!
Overall, do not stress! You will be fine :-)
Introduction to Milestone 2¶
Task 1. Set up an âAnalysis Pipelineâ (20%)¶
Common steps of a Data Analysis Pipeline¶
Here are some common steps of an analysis pipeline (the order isnât set, and not all elements are necessary):
Load Data
Check file types and encodings.
Check delimiters (space, comma, tab).
Skip rows and columns as needed.
Clean Data
Remove columns not being used.
Deal with âincorrectâ data.
Deal with missing data.
Process Data
Create any new columns needed that are combinations or aggregates of other columns (examples include weighted averages, categorizations, groups, etcâŠ).
Find and replace operations (examples inlcude replacing the string âStrongly Agreeâ with the number 5).
Other substitutions as needed.
Deal with outliers.
Wrangle Data
Restructure data format (columns and rows).
Merge other data sources into your dataset.
Exploratory Data Analysis
Data Analysis
Export reports/data analyses and visualizations
Method chaining¶
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
data = load_wine() # this a data file that gets loaded
# Method chaining begins
df = (
pd.DataFrame(data.data,columns=data.feature_names)
.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.loc[lambda x: x['alcohol']>14]
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
df
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
data = load_wine()
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
/tmp/ipykernel_1959/3479047496.py in <module>
1 import pandas as pd
2 import numpy as np
----> 3 from sklearn.datasets import load_wine
4
5 data = load_wine()
ModuleNotFoundError: No module named 'sklearn'
data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
df = pd.DataFrame(data['data'],columns = data['feature_names'])
df.head()
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 |
df = df.rename(columns={'color_intensity':'ci'})
df
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | ci | hue | od280/od315_of_diluted_wines | proline | color_filter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173 | 13.71 | 5.65 | 2.45 | 20.5 | 95.0 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740.0 | 0 |
174 | 13.40 | 3.91 | 2.48 | 23.0 | 102.0 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750.0 | 0 |
175 | 13.27 | 4.28 | 2.26 | 20.0 | 120.0 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835.0 | 0 |
176 | 13.17 | 2.59 | 2.37 | 20.0 | 120.0 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840.0 | 0 |
177 | 14.13 | 4.10 | 2.74 | 24.5 | 96.0 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560.0 | 0 |
178 rows Ă 14 columns
df['color_filter'] = np.where((df['hue'] > 1) & (df['ci']>7), 1,0)
df.head()
# Inspired from Source:
# https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | ci | hue | od280/od315_of_diluted_wines | proline | color_filter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0 |
df = (
pd.DataFrame(data.data,columns=data.feature_names)
.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.loc[lambda x: x['alcohol']>14]
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
df
alcohol | ci | hue | |
---|---|---|---|
0 | 14.83 | 5.20 | 1.08 |
1 | 14.75 | 5.40 | 1.25 |
2 | 14.39 | 5.25 | 1.02 |
3 | 14.38 | 4.90 | 1.04 |
4 | 14.38 | 7.50 | 1.20 |
5 | 14.37 | 7.80 | 0.86 |
6 | 14.34 | 13.00 | 0.57 |
7 | 14.30 | 6.20 | 1.07 |
8 | 14.23 | 5.64 | 1.04 |
9 | 14.22 | 5.10 | 0.89 |
10 | 14.22 | 6.38 | 0.94 |
11 | 14.21 | 5.24 | 0.87 |
12 | 14.20 | 6.75 | 1.05 |
13 | 14.19 | 8.70 | 1.23 |
14 | 14.16 | 9.70 | 0.62 |
15 | 14.13 | 9.20 | 0.61 |
16 | 14.12 | 5.00 | 1.17 |
17 | 14.10 | 6.20 | 1.07 |
18 | 14.10 | 5.75 | 1.25 |
19 | 14.06 | 5.65 | 1.09 |
20 | 14.06 | 5.05 | 1.06 |
21 | 14.02 | 4.70 | 1.04 |
Import your own functions¶
See example in this Demo Repository
Introduction to Data Visualizations (Back at 1 PM)¶
Slides available for downoad here.
from IPython.display import IFrame
IFrame("../../../Class4A.pdf", width=900, height=800)
Motivating the need for EDA¶
bullet_data.csv
is available here: https://github.com/firasm/bits/raw/master/bullet_data.csv
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
from pandas_profiling import ProfileReport
sns.set_theme(style="white",
font_scale=1.3)
df = pd.read_csv('https://github.com/firasm/bits/raw/master/bullet_data.csv')
df.head()
# Use our standard tool first:
df.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
alcohol | 22.0 | 14.261818 | 0.204838 | 14.02 | 14.1225 | 14.215 | 14.3625 | 14.83 |
ci | 22.0 | 6.559545 | 2.043294 | 4.70 | 5.2100 | 5.700 | 7.3125 | 13.00 |
hue | 22.0 | 1.000909 | 0.197120 | 0.57 | 0.9025 | 1.045 | 1.0875 | 1.25 |
# Use the advanced profiling tool next: https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/
ProfileReport(df).to_notebook_iframe()
mmm⊠well thatâs not super helpful.
describe()
didnât quite organize the data like the way we wanted, and profile_report
is just overkillâŠ
Letâs try and figure out some more info manually.
print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")
print("Columns are: {0}".format(list(df.columns)),"\n")
print("Values for 'bullet' column is {0}".format(sorted(df['bullet'].unique())),"\n")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'zone'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-65-86efe1e4449e> in <module>
----> 1 print("The zones are: {0}".format(sorted(set(df['zone']))),"\n")
2
3 print("Columns are: {0}".format(list(df.columns)),"\n")
4
5 print("Values for 'bullet' column is {0}".format(sorted(df['bullet'].unique())),"\n")
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
3453 if self.columns.nlevels > 1:
3454 return self._getitem_multilevel(key)
-> 3455 indexer = self.columns.get_loc(key)
3456 if is_integer(indexer):
3457 indexer = [indexer]
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'zone'
Letâs wrangle the data a bit to try and see whatâs going on:
# First, only consider the bullet 'hits':
hits_df = df[df['bullet']==1]
hits_df.sample(5)
# Then, let's groupby the "zone" and look at the resulting dataframe
# I have "reset" the index of the groupby object so we can have a continuous index
summary = hits_df.groupby('zone').count().reset_index()
summary
zone | x | y | bullet | |
---|---|---|---|---|
0 | A | 83 | 83 | 83 |
1 | B | 259 | 259 | 259 |
2 | C | 83 | 83 | 83 |
3 | D | 47 | 47 | 47 |
4 | E | 111 | 111 | 111 |
5 | Unknown | 16 | 16 | 16 |
# Now let's visualize the table above:
sns.countplot(data=hits_df,
y='zone', order = sorted(set(df['zone'])),color='skyblue')
plt.ylabel('')
plt.title('Bullet hit count by Airplane Zone')
plt.xlabel('Bullet hits')
sns.despine()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3360 try:
-> 3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'zone'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-62-e58db1d46184> in <module>
2
3 sns.countplot(data=hits_df,
----> 4 y='zone', order = sorted(set(df['zone'])),color='skyblue')
5 plt.ylabel('')
6 plt.title('Bullet hit count by Airplane Zone')
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/frame.py in __getitem__(self, key)
3453 if self.columns.nlevels > 1:
3454 return self._getitem_multilevel(key)
-> 3455 indexer = self.columns.get_loc(key)
3456 if is_integer(indexer):
3457 indexer = [indexer]
~/.pyenv/versions/3.8.3/lib/python3.8/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
3361 return self._engine.get_loc(casted_key)
3362 except KeyError as err:
-> 3363 raise KeyError(key) from err
3364
3365 if is_scalar(key) and isna(key) and not self.hasnans:
KeyError: 'zone'
# Another Visualization
df['outline'] = np.where(df['zone']=='OutsidePlane',0,1)
sns.heatmap(data=df.pivot('x','y','outline'),cmap='Greys')
plt.axis('off')
sns.heatmap(data=df.pivot('x','y','bullet'),cmap='Spectral')
plt.axis('off')
Debrief¶
And thatâs it!