Lab 3B: Data analysis with Python¶
In this lab you work on applying your new Python skills to work with data - finally!
This lab is a bit shorter because I wanted to make sure you have enough time to work on your projects. You’ll get lots of practice with Data Analysis using your project data sets.
This lab must be completed individually.
Where provided, try your best to match the Sample Output as best as you can.
Accept the lab¶
To accept this lab on GitHub Classroom, go to Canvas
–> Course Content
–> Lab Links and Solutions
.
Objectives¶
In this lab you will:
Practice working with data using the pandas package
Practice some data wrangling mathematical operations.
Practice working with the seaborn package.
Practice creating data visualizations.
# Usually all the import statements are at the top of the file
import pandas as pd
import seaborn as sns
import numpy as np
import os
import matplotlib.pyplot as plt
Task 1: Working with data using pandas (12 marks)¶
In this part of the lab, we will practice loading in a sample data set using pandas, and doing some basic operations.
1.1: Load in the data (4 marks)¶
There is a CSV file (pokemon.csv
) inside a directory called data
within the lab3A
directory.
Your task is to use the pandas read_csv()
function to read this dataset, assign it to a dataframe called df
, and then print its head
also known as the first 5 lines of the dataframe.
Hint: don’t forget to first import pandas as pd
to use read_csv
and other pandas function.
### Your solution here
1.2: How many total pokemon are there in the dataset? (2 mark)¶
Make sure to use the dataframe.count()
function to print the total number of entries in each column of the dataframe before you answer!
### Your solution here
1.3: Create a new dataframe df2
that only includes the Pokemon from the first generation. (2 marks)¶
Hint: Remember that you can subset dataframes using the []
syntax. More on this here
### Your solution here
1.4: Print ONLY the mean HP, Attack, Defense, and Speed of all pokemon in the first generation using pandas functions (4 marks)¶
### Your solution here
Task 2. Plotting using Seaborn (and matplotlib) (Total: 16 marks)¶
matplotlib
is a Python library for data visualization.
seaborn
is a statistical data visualization library layer that provides a high-level interface for drawing statistical graphics and some convenient functions for plotting data frames.
You may need to install seaborn
and matplotlib
conda install seaborn
conda install matplotlib
and just in case it’s not the latest version, go ahead and update it:
conda update matplotlib
conda update seaborn
# Themes and colours in Seaborn
# There are five preset seaborn themes: darkgrid, whitegrid, dark, white, and ticks.
# They are each suited to different applications and personal preferences.
# You can see what they look like [here](https://seaborn.pydata.org/tutorial/aesthetics.html#seaborn-figure-styles)
# Just for fun, we're going to set the theme to be a nice one:
sns.set_theme(style="ticks",
font_scale=1.3, # This scales the fonts slightly higher
)
# And we're going to remove the top and right axis lines
plt.rc("axes.spines", top=False, right=False)
2.1: Load data (1 mark)¶
Without downloading the csv file to your repo, load the “BCCDC_COVID19.csv” file using the direct URL: “BCCDC_COVID19_Dashboard_Case_Details.csv”.
DO NOT DOWNLOAD THE DATA TO YOUR REPOSITORY!
Use pandas
module/package and the read_csv()
function to load the data by passing in the URL and then save the data in a variable called df
.
# Your Solution here
2.2: Counts of cases by Sex (2 marks)¶
Using sns.countplot()
, plot the number of all female and male cases.
Set the title to be “Number of COVID-19 cases by Sex”.
Hint: The documentation above contains some examples that might help you get started
2.3: Counts of cases by Age Group (2 marks)¶
Plot the counts of cases by age group, and order the y-axis by increasing age (use the order
parameter of the countplot()
function).

# Your Solution here
2.4: Cases by regional health authority (3 marks)¶
Using set()
data type, find the all the different regions in df['HA']
. In the next step, calculate the count of all cases in each of those regions. Finally, using plt.barh
, plot a horizontal bar chart of number of cases based on their regions.
Hint: More information and examples can be find in link
Hint: Your plot doesn’t have to look exactly like this, but please do explore the possible color palettes. You can specify the colour palette by passing in the keyword like this: palette='colorblind'
.
2.5: Data Wrangling I (2 marks)¶
Task: Add a new column to the dataframe to convert the “Reported_Date” column to a datetime object
To do this, first we need to add a new column to our dataset to turn the column “Reported_Date” into a proper datetime object so we can do operations on it.
Hint: Use to to_datetime() function to help you first convert it into a datetime object, and then remove the timezone information and HH:MM:SS using .dt.date
.
# Your Solution here
2.6: Data Wrangling II (1 mark)¶
Task: Find the earliest reported case and the latest reported case of COVID-19 in the dataset
You should use the pandas .min() and .max() functions here, now that your date string is converted to a DateTime object.
Sample Output¶
The earliest reported case of COVID-19 was: 2020-01-26
The latest reported case of COVID-19 was: 2020-10-23
# Your Solution here
2.7: Data Wrangling III (2 marks)¶
Task: Create a new column in the data frame called “days_since”.
This column will be of type integer, and will simply show the days since the first reported case of COVID-19.
Hint: Subtracting the earliest reported date from the Reported_Date_Object
column will get you most of the way there. After that, the only thing left to is to turn the result (a datetime object) into an integer using .dt.days
.
# Your Solution here
2.8: Plot the COVID-19 cases plotted over time by sex (3 mark)¶
Using sns.displot
, plot the histogram of females and males cases over time.
Hint 1: Here is a nice tutorial of all the different options that are possible when creating a histogram.
2.9: BONUS - For a bonus mark, move the legend to the top left of the plot (1 mark)¶
Remember, the maximum mark you can get on a lab is 100% and bonus marks cannot be “moved” to different assignments”
# Your Solution here
2.10: BONUS - For a bonus mark, order the Health authorities on the y-axis (1 mark)¶
You can order it by some metric (either alphabetically, or by total counts, or some other sensible metric).
Remember, the maximum mark you can get on a lab is 100% and bonus marks cannot be “moved” to different assignments”