Lecture 8A - Introduction to Data visualizations¶

We will begin soon! Until then, feel free to use the chat to socialize, and enjoy the music!

Photo by RODNAE Productions from Pexels

Firas Moosvi

Class Outline¶

Announcements (5 mins)
Final Exam Information (5 mins)
Introduction to Data Visualizations
Importance of Data Visualizations

Announcements¶

Milestone 3 Extension by 48 hours.
- New deadline is Saturday Oct. 30, 2021 (+ 48 hour grace period if you need it)
- Only submit your work once you are ready to receive feedback!
- I encourage you to submit early so Prajeet can start giving you feedback right away!

Reminder: Bonus Test 2 will be available later this week if you missed it or didn’t do well.

If you are feeling lost, confused, overwhelmed, or frustrated about Python or this class, come to my Student Hours today and on Wednesday and I will stay as long as needed to get people back on track!
- For those of you that are fine, you’re still welcome to come and ask about seaborn-related questions.

Final Exam Information¶

Here are some details about the final exam.

The final exam will be a take home final exam, delivered as a GitHub Classroom assignment (similar to labs in this course)
There will be a 24 hour window during which you can start the exam at any time, and you must end it within the window.
The final exam window will OPEN Sunday, Dec 12 at 18:00.
The final exam window will CLOSE Monday, Dec 13 at 18:00.
Format: you will get to work with a dataset you haven’t seen before and be asked to do a comprehensive data analysis, some of it guided, and some of it unguided.
- You will need to do all the steps of a Data Analysis.
- You will be given some research questions to answer, based on the data.
- You will also need to do come up with your own research questions and answer them.
The exam will not be proctored or invigilated, but the same rules as the Test apply.
Remember that you will need to accept a GH Classroom link (just like with the labs) and then submit your repository link once you are finished with the exam.
The GH Classroom link will be available on Canvas, inside a “Quiz” called “Final Exam”.

Final Exam Details Continued¶

You will also need to commit to the repo and push to GitHub at various points during the exam (I will have instructions in the exam for when you should be committing, and pushing).
The exam is designed to be completed in 2.5 - 3 hours, but you will have the full 24 window to spend on the exam. I highly recommend that you block out a chunk of time and finish the exam in one sitting. You have other exams to deal with as well, and just because I give you 24 hours, does not mean you have to use all of it!!
You will need to make sure your JupyterLab installation is functioning, Git and Python is working, and all the packages used in the course are installed. This will be a required aspect of the final exam!
You will need to frequently commit your work using the Terminal (i.e. NOT GitHub desktop, or the web uploader!) so please make sure you know how to do that. It will be just like the labs and milestones, if you’ve been keeping up, I don’t expect you’ll have a problem.
The exam will contain everything in the course EXCEPT Tableau and Excel. With git, you will be expected to demonstrate proficiency of the basic commands while you are doing the Exam, but there are no specific questions about git.
I will post an Ed Discussion note about the Final Exam, if you have questions about it, you can ask them there.

General Rules for the Test¶

Read them carefully!

You must complete the test BY YOURSELF (no friends, no tutors, no classmates, no humans - cats and dogs in the room are fine).
You will not be able to ask us questions during the quiz - do your best with your best interpretation of the question.
The test is open-book, open-notes, open-web EXCEPT you CANNOT use websites that help you cheat such as Chegg, Course Hero, Slader and other similar websites that have tutors answering questions you upload (Stack Overflow is allowed).
Using google to search for concepts is NOT cheating. For example, you can search for definitions of terms, and commands.
If you accidentally come across the same or similar test question on google, resist the temptation to keep reading, and just close your browser tab.
You can also use any code editor, or JupyterLab, etc to run and test your code.
Any form of communication with other humans, terrestrial or extraterrestrial is not allowed (Discord, Slack, WhatsApp, Terminal, Signal, iMessage, SMS, MMS, etc…) and IS CHEATING
Do NOT share test questions with anyone - that IS CHEATING.
Do not be anxious about the test!
Overall, do not stress! You will be fine :-)

Introduction to Data Visualizations¶

Slides available for downoad here.

from IPython.display import IFrame
IFrame("../../../Class8A.pdf", width=900, height=800)

Optional: Method Chaining in practice¶

import pandas as pd
import numpy as np
from sklearn.datasets import load_wine

data = load_wine() # this a data file that gets loaded

# Method chaining begins

df = (   
    pd.DataFrame(data.data,columns=data.feature_names)
    .rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .loc[lambda x: x['alcohol']>14]
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

df

import pandas as pd
import numpy as np
from sklearn.datasets import load_wine

data = load_wine() 

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
/tmp/ipykernel_2121/3479047496.py in <module>
      1 import pandas as pd
      2 import numpy as np
----> 3 from sklearn.datasets import load_wine
      4 
      5 data = load_wine()

ModuleNotFoundError: No module named 'sklearn'

data.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

df = pd.DataFrame(data['data'],columns = data['feature_names'])
df.head()

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	color_intensity	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

df = df.rename(columns={'color_intensity':'ci'})
df

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	ci	hue	od280/od315_of_diluted_wines	proline	color_filter
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0	0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0	0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0	0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0	0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
173	13.71	5.65	2.45	20.5	95.0	1.68	0.61	0.52	1.06	7.70	0.64	1.74	740.0	0
174	13.40	3.91	2.48	23.0	102.0	1.80	0.75	0.43	1.41	7.30	0.70	1.56	750.0	0
175	13.27	4.28	2.26	20.0	120.0	1.59	0.69	0.43	1.35	10.20	0.59	1.56	835.0	0
176	13.17	2.59	2.37	20.0	120.0	1.65	0.68	0.53	1.46	9.30	0.60	1.62	840.0	0
177	14.13	4.10	2.74	24.5	96.0	2.05	0.76	0.56	1.35	9.20	0.61	1.60	560.0	0

178 rows × 14 columns

df['color_filter'] = np.where((df['hue'] > 1) & (df['ci']>7), 1,0)
df.head()
# Inspired from Source:
# https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/

	alcohol	malic_acid	ash	alcalinity_of_ash	magnesium	total_phenols	flavanoids	nonflavanoid_phenols	proanthocyanins	ci	hue	od280/od315_of_diluted_wines	proline
0	14.23	1.71	2.43	15.6	127.0	2.80	3.06	0.28	2.29	5.64	1.04	3.92	1065.0
1	13.20	1.78	2.14	11.2	100.0	2.65	2.76	0.26	1.28	4.38	1.05	3.40	1050.0
2	13.16	2.36	2.67	18.6	101.0	2.80	3.24	0.30	2.81	5.68	1.03	3.17	1185.0
3	14.37	1.95	2.50	16.8	113.0	3.85	3.49	0.24	2.18	7.80	0.86	3.45	1480.0
4	13.24	2.59	2.87	21.0	118.0	2.80	2.69	0.39	1.82	4.32	1.04	2.93	735.0

df = (   
    pd.DataFrame(data.data,columns=data.feature_names)
    .rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .loc[lambda x: x['alcohol']>14]
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

df

	alcohol	ci	hue
0	14.83	5.20	1.08
1	14.75	5.40	1.25
2	14.39	5.25	1.02
3	14.38	4.90	1.04
4	14.38	7.50	1.20
5	14.37	7.80	0.86
6	14.34	13.00	0.57
7	14.30	6.20	1.07
8	14.23	5.64	1.04
9	14.22	5.10	0.89
10	14.22	6.38	0.94
11	14.21	5.24	0.87
12	14.20	6.75	1.05
13	14.19	8.70	1.23
14	14.16	9.70	0.62
15	14.13	9.20	0.61
16	14.12	5.00	1.17
17	14.10	6.20	1.07
18	14.10	5.75	1.25
19	14.06	5.65	1.09
20	14.06	5.05	1.06
21	14.02	4.70	1.04

Import your own functions¶

See example in this Demo Repository

DATA 301

Lecture 8A - Introduction to Data visualizations

Contents