Lecture 8 - Introduction to Data visualizations
Contents
Lecture 8 - Introduction to Data visualizations#
We will begin soon! Until then, feel free to use the chat to socialize, and enjoy the music!
Class Outline#
Announcements (1 min)
Final Exam Information (1 min)
Method Chaining in Practice (10 mins)
Introduction to Data Visualizations (30 mins)
Exploratory Data Analysis (15 mins)
Principles of Effective Data Visualizations (20 mins)
Choosing an appropriate data visualization (15 mins)
Judicious use of Colours (3 mins)
Announcements#
Reminder: Test 3 will be this week, on Thursday
If you are feeling lost, confused, overwhelmed, or frustrated about Python or this class, come to my Student Hours today and on Thursday and I will stay as long as needed to get people back on track!
For those of you that are fine, you’re still welcome to come and ask about seaborn-related questions.
Final Exam Information#
Here are some details about the final exam.
The final exam will be a take home final exam, delivered as a GitHub Classroom assignment (similar to labs in this course)
There will be a 72 hour window during which you can start the exam at any time, and you must end it within the 72-hour window.
The final exam window will OPEN Sunday December 11th at 14:00 (2 PM)
The final exam window will CLOSE, Dec 14th at 14:00 (2 PM).
The exam must be submitted on Gradescope before the close of the final exam window.
Format: you will get to work with a dataset you have not seen before and be asked to do a comprehensive data analysis, some of it guided, and some of it unguided.
You will need to do all the steps of a Data Analysis.
You will be given some research questions to answer, based on the data.
You will also need to do come up with your own research questions and answer them.
The exam will not be proctored or invigilated, but the same rules as the Test apply: the exam must be done individually, on your own and will be open-book, open-notes, open-web, no communicating with other humans, no cheating websites like Chegg, CourseHero, Slader, etc…
If you have questions about the Exam during the window, you can post them on Ed Discussion as a Private question and I will respond to it. For content-related questions, I will usually not be able to answer them.
Remember that you will need to accept a GH Classroom link (just like with the labs) and then submit your repository link once you are finished with the exam.
The GH Classroom link will be available on Canvas, inside a “Quiz” called “Final Exam”.
You will also need to commit to the repo and push to GitHub at various points during the exam (I will have instructions in the exam for when you should be committing, and pushing).
The exam is designed to be completed in about 2.5 hours, but you will have the full 72-hour window to spend on the exam. I highly recommend that you block out a chunk of time and finish the exam in one sitting. You have other exams to deal with as well, and just because I give you 72 hours, does not mean you have to use all of it!! You have other exams as well, so make sure you budget your time and energies accordingly.
Of course, you’re welcome to take breaks (food, sleep, bathroom breaks) etc as needed, you do not need to do the whole exam in one sitting.
You will need to make sure your JupyterLab installation is functioning, Git and Python is working, and all the packages used in the course are installed. This will be a required aspect of the final exam!
You will need to frequently commit your work using the Terminal (i.e. NOT GitHub desktop, or the web uploader!) so please make sure you know how to do that. It will be just like the labs and milestones, if you’ve been keeping up, I don’t expect you’ll have a problem.
The exam will contain everything in the course EXCEPT Tableau and Excel. With git, you will be expected to demonstrate proficiency of the basic commands while you are doing the Exam, but there are no specific questions about git.
Important: If you believe that the COSC 301 take-home final will conflict with your other scheduled exams, please contact me ASAP (Before Oct. 28th, 2022) so we can work something out. A conflict will occur when you have 3 or more exams (including COSC 301) within the 72-hour window.
Method Chaining in practice#
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
data = load_wine() # this a data file that gets loaded
# Method chaining begins
df = (
pd.DataFrame(data.data,columns=data.feature_names)
.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.loc[lambda x: x['alcohol']>14]
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
df
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine
data = load_wine()
data
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 3
1 import pandas as pd
2 import numpy as np
----> 3 from sklearn.datasets import load_wine
5 data = load_wine()
7 data
ModuleNotFoundError: No module named 'sklearn'
data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
df = pd.DataFrame(data["data"],
columns=data["feature_names"])
df.head()
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 |
df = df.rename(columns={"color_intensity": "ci"})
df
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | ci | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
173 | 13.71 | 5.65 | 2.45 | 20.5 | 95.0 | 1.68 | 0.61 | 0.52 | 1.06 | 7.70 | 0.64 | 1.74 | 740.0 |
174 | 13.40 | 3.91 | 2.48 | 23.0 | 102.0 | 1.80 | 0.75 | 0.43 | 1.41 | 7.30 | 0.70 | 1.56 | 750.0 |
175 | 13.27 | 4.28 | 2.26 | 20.0 | 120.0 | 1.59 | 0.69 | 0.43 | 1.35 | 10.20 | 0.59 | 1.56 | 835.0 |
176 | 13.17 | 2.59 | 2.37 | 20.0 | 120.0 | 1.65 | 0.68 | 0.53 | 1.46 | 9.30 | 0.60 | 1.62 | 840.0 |
177 | 14.13 | 4.10 | 2.74 | 24.5 | 96.0 | 2.05 | 0.76 | 0.56 | 1.35 | 9.20 | 0.61 | 1.60 | 560.0 |
178 rows × 13 columns
df["color_filter"] = np.where((df["hue"] > 1) &
(df["ci"] > 7), 1, 0)
df.head()
# Inspired from Source:
# https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | ci | hue | od280/od315_of_diluted_wines | proline | color_filter | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 | 0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 | 0 |
2 | 13.16 | 2.36 | 2.67 | 18.6 | 101.0 | 2.80 | 3.24 | 0.30 | 2.81 | 5.68 | 1.03 | 3.17 | 1185.0 | 0 |
3 | 14.37 | 1.95 | 2.50 | 16.8 | 113.0 | 3.85 | 3.49 | 0.24 | 2.18 | 7.80 | 0.86 | 3.45 | 1480.0 | 0 |
4 | 13.24 | 2.59 | 2.87 | 21.0 | 118.0 | 2.80 | 2.69 | 0.39 | 1.82 | 4.32 | 1.04 | 2.93 | 735.0 | 0 |
df = (
pd.DataFrame(data.data, columns=data.feature_names)
.rename(columns={"color_intensity": "ci"})
.assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
.loc[lambda x: x["alcohol"] > 14]
.sort_values("alcohol", ascending=False)
.reset_index(drop=True)
.loc[:, ["alcohol", "ci", "hue"]]
)
df
alcohol | ci | hue | |
---|---|---|---|
0 | 14.83 | 5.20 | 1.08 |
1 | 14.75 | 5.40 | 1.25 |
2 | 14.39 | 5.25 | 1.02 |
3 | 14.38 | 4.90 | 1.04 |
4 | 14.38 | 7.50 | 1.20 |
5 | 14.37 | 7.80 | 0.86 |
6 | 14.34 | 13.00 | 0.57 |
7 | 14.30 | 6.20 | 1.07 |
8 | 14.23 | 5.64 | 1.04 |
9 | 14.22 | 5.10 | 0.89 |
10 | 14.22 | 6.38 | 0.94 |
11 | 14.21 | 5.24 | 0.87 |
12 | 14.20 | 6.75 | 1.05 |
13 | 14.19 | 8.70 | 1.23 |
14 | 14.16 | 9.70 | 0.62 |
15 | 14.13 | 9.20 | 0.61 |
16 | 14.12 | 5.00 | 1.17 |
17 | 14.10 | 6.20 | 1.07 |
18 | 14.10 | 5.75 | 1.25 |
19 | 14.06 | 5.65 | 1.09 |
20 | 14.06 | 5.05 | 1.06 |
21 | 14.02 | 4.70 | 1.04 |
Import your own functions#
See example in this Demo Repository
Introduction to Data Visualizations#
Slides available for download here and here.
from IPython.display import IFrame
IFrame("../../../Class8A.pdf", width=900, height=800)
from IPython.display import IFrame
IFrame("../../../Class8C.pdf", width=900, height=800)