Lecture 8 - Introduction to Data visualizations#

We will begin soon! Until then, feel free to use the chat to socialize, and enjoy the music!

../../../_images/viz.jpg


Photo by RODNAE Productions from Pexels
Firas Moosvi

Class Outline#

  1. Announcements (1 min)

  2. Final Exam Information (1 min)

  3. Method Chaining in Practice (10 mins)

  4. Introduction to Data Visualizations (30 mins)

  5. Exploratory Data Analysis (15 mins)

  6. Principles of Effective Data Visualizations (20 mins)

  7. Choosing an appropriate data visualization (15 mins)

  8. Judicious use of Colours (3 mins)

Announcements#

  • Reminder: Test 3 will be this week, on Thursday

  • If you are feeling lost, confused, overwhelmed, or frustrated about Python or this class, come to my Student Hours today and on Thursday and I will stay as long as needed to get people back on track!

    • For those of you that are fine, you’re still welcome to come and ask about seaborn-related questions.

Final Exam Information#

Here are some details about the final exam.

  • The final exam will be a take home final exam, delivered as a GitHub Classroom assignment (similar to labs in this course)

  • There will be a 72 hour window during which you can start the exam at any time, and you must end it within the 72-hour window.

  • The final exam window will OPEN Sunday December 11th at 14:00 (2 PM)

  • The final exam window will CLOSE, Dec 14th at 14:00 (2 PM).

  • The exam must be submitted on Gradescope before the close of the final exam window.

  • Format: you will get to work with a dataset you have not seen before and be asked to do a comprehensive data analysis, some of it guided, and some of it unguided.

    • You will need to do all the steps of a Data Analysis.

    • You will be given some research questions to answer, based on the data.

    • You will also need to do come up with your own research questions and answer them.

  • The exam will not be proctored or invigilated, but the same rules as the Test apply: the exam must be done individually, on your own and will be open-book, open-notes, open-web, no communicating with other humans, no cheating websites like Chegg, CourseHero, Slader, etc…

  • If you have questions about the Exam during the window, you can post them on Ed Discussion as a Private question and I will respond to it. For content-related questions, I will usually not be able to answer them.

  • Remember that you will need to accept a GH Classroom link (just like with the labs) and then submit your repository link once you are finished with the exam.

  • The GH Classroom link will be available on Canvas, inside a “Quiz” called “Final Exam”.

  • You will also need to commit to the repo and push to GitHub at various points during the exam (I will have instructions in the exam for when you should be committing, and pushing).

  • The exam is designed to be completed in about 2.5 hours, but you will have the full 72-hour window to spend on the exam. I highly recommend that you block out a chunk of time and finish the exam in one sitting. You have other exams to deal with as well, and just because I give you 72 hours, does not mean you have to use all of it!! You have other exams as well, so make sure you budget your time and energies accordingly.

  • Of course, you’re welcome to take breaks (food, sleep, bathroom breaks) etc as needed, you do not need to do the whole exam in one sitting.

  • You will need to make sure your JupyterLab installation is functioning, Git and Python is working, and all the packages used in the course are installed. This will be a required aspect of the final exam!

  • You will need to frequently commit your work using the Terminal (i.e. NOT GitHub desktop, or the web uploader!) so please make sure you know how to do that. It will be just like the labs and milestones, if you’ve been keeping up, I don’t expect you’ll have a problem.

  • The exam will contain everything in the course EXCEPT Tableau and Excel. With git, you will be expected to demonstrate proficiency of the basic commands while you are doing the Exam, but there are no specific questions about git.

Important: If you believe that the COSC 301 take-home final will conflict with your other scheduled exams, please contact me ASAP (Before Oct. 28th, 2022) so we can work something out. A conflict will occur when you have 3 or more exams (including COSC 301) within the 72-hour window.

Method Chaining in practice#

import pandas as pd
import numpy as np
from sklearn.datasets import load_wine

data = load_wine() # this a data file that gets loaded

# Method chaining begins

df = (   
    pd.DataFrame(data.data,columns=data.feature_names)
    .rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .loc[lambda x: x['alcohol']>14]
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)

df
import pandas as pd
import numpy as np
from sklearn.datasets import load_wine

data = load_wine()

data
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import pandas as pd
      2 import numpy as np
----> 3 from sklearn.datasets import load_wine
      5 data = load_wine()
      7 data

ModuleNotFoundError: No module named 'sklearn'
data.keys()
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
df = pd.DataFrame(data["data"], 
                  columns=data["feature_names"])
df.head()
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0
df = df.rename(columns={"color_intensity": "ci"})
df
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins ci hue od280/od315_of_diluted_wines proline
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0
... ... ... ... ... ... ... ... ... ... ... ... ... ...
173 13.71 5.65 2.45 20.5 95.0 1.68 0.61 0.52 1.06 7.70 0.64 1.74 740.0
174 13.40 3.91 2.48 23.0 102.0 1.80 0.75 0.43 1.41 7.30 0.70 1.56 750.0
175 13.27 4.28 2.26 20.0 120.0 1.59 0.69 0.43 1.35 10.20 0.59 1.56 835.0
176 13.17 2.59 2.37 20.0 120.0 1.65 0.68 0.53 1.46 9.30 0.60 1.62 840.0
177 14.13 4.10 2.74 24.5 96.0 2.05 0.76 0.56 1.35 9.20 0.61 1.60 560.0

178 rows × 13 columns

df["color_filter"] = np.where((df["hue"] > 1) &
                              (df["ci"] > 7), 1, 0)
df.head()
# Inspired from Source:
# https://www.dataquest.io/blog/tutorial-add-column-pandas-dataframe-based-on-if-else-condition/
alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins ci hue od280/od315_of_diluted_wines proline color_filter
0 14.23 1.71 2.43 15.6 127.0 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065.0 0
1 13.20 1.78 2.14 11.2 100.0 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050.0 0
2 13.16 2.36 2.67 18.6 101.0 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185.0 0
3 14.37 1.95 2.50 16.8 113.0 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480.0 0
4 13.24 2.59 2.87 21.0 118.0 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735.0 0
df = (
    pd.DataFrame(data.data, columns=data.feature_names)
    .rename(columns={"color_intensity": "ci"})
    .assign(color_filter=lambda x: np.where((x.hue > 1) & (x.ci > 7), 1, 0))
    .loc[lambda x: x["alcohol"] > 14]
    .sort_values("alcohol", ascending=False)
    .reset_index(drop=True)
    .loc[:, ["alcohol", "ci", "hue"]]
)
df
alcohol ci hue
0 14.83 5.20 1.08
1 14.75 5.40 1.25
2 14.39 5.25 1.02
3 14.38 4.90 1.04
4 14.38 7.50 1.20
5 14.37 7.80 0.86
6 14.34 13.00 0.57
7 14.30 6.20 1.07
8 14.23 5.64 1.04
9 14.22 5.10 0.89
10 14.22 6.38 0.94
11 14.21 5.24 0.87
12 14.20 6.75 1.05
13 14.19 8.70 1.23
14 14.16 9.70 0.62
15 14.13 9.20 0.61
16 14.12 5.00 1.17
17 14.10 6.20 1.07
18 14.10 5.75 1.25
19 14.06 5.65 1.09
20 14.06 5.05 1.06
21 14.02 4.70 1.04

Import your own functions#

See example in this Demo Repository

Introduction to Data Visualizations#

Slides available for download here and here.

from IPython.display import IFrame

IFrame("../../../Class8A.pdf", width=900, height=800)
from IPython.display import IFrame

IFrame("../../../Class8C.pdf", width=900, height=800)