Task 4: COVID-19 EDA II#
For this task, you’ll do some further analysis on the .csv file you made from Task 3.
Remember to run the following commands below to import all the necessary packages to run this Task.
import pandas as pd
import numpy as np
import matplotlib
4.1: Importing filtered data#
First, import the datafiltered.csv
file back into a dataframe and call the variable df
.
# Your solution here
Then, remove the additional index column that was made when you exported the file.
Sample output#
date |
new_cases |
new_deaths |
new_tests |
incident_rate |
|
---|---|---|---|---|---|
0 |
2020-04-20 |
3047 |
433 |
7841 |
0.388598 |
1 |
2020-04-21 |
2256 |
454 |
28095 |
0.080299 |
2 |
2020-04-22 |
2729 |
534 |
44248 |
0.0616751 |
3 |
2020-04-23 |
3370 |
437 |
37083 |
0.0908772 |
4 |
2020-04-24 |
2646 |
464 |
95273 |
0.0277728 |
5 |
2020-04-25 |
3021 |
420 |
38676 |
0.0781105 |
6 |
2020-04-26 |
2357 |
415 |
24113 |
0.0977481 |
7 |
2020-04-27 |
2324 |
260 |
26678 |
0.087113 |
8 |
2020-04-28 |
1739 |
333 |
37554 |
0.0463067 |
9 |
2020-04-29 |
2091 |
382 |
38589 |
0.0541864 |
# Your solution here
Finally, print out 10 random rows in df
to ensure your data has been imported correctly. (Hint: There is a pandas function that does this)
# Your solution here
4.2: Grouping data#
If we want to know the average number of rates by the months in the year, we can use a groupby
object to better understand how seasonal changes can affect COVID-19.
For this task, you should group the dataframe by month, then select the new_cases
, new_deaths
column, and then calculate the average value by month.
Make sure the output is displayed without using the print()
function, remember that the last cell in a jupyter notebook will always be displayed.
Sample Output#
date |
new_cases |
new_deaths |
---|---|---|
4 |
2515.09 |
405 |
5 |
937.839 |
182.516 |
6 |
259.067 |
46.8 |
7 |
216.839 |
12.5161 |
8 |
679.355 |
11.129 |
9 |
996 |
6 |
For this sub-task, a research question that this table would’ve answered might look something like this:
“What is the mean number of new cases and deaths by month?”
# Your solution here
4.3: Cumulative Sums#
A cumulative sum is a partial sum of each row, growing in total as you go down the list.
Using the original dataframe, df
, create two new columns titled cumulative_new_cases
and cumulative_new_tests
.
Make sure the output is displayed without using the print()
function, remember that the last cell in a jupyter notebook will always be displayed.
Sample Output#
date |
new_cases |
new_deaths |
new_tests |
incident_rate |
cumulative_new_cases |
cumulative_new_tests |
|
---|---|---|---|---|---|---|---|
0 |
2020-04-20 |
3047 |
433 |
7841 |
0.388598 |
3047 |
7841 |
1 |
2020-04-21 |
2256 |
454 |
28095 |
0.080299 |
5303 |
35936 |
2 |
2020-04-22 |
2729 |
534 |
44248 |
0.0616751 |
8032 |
80184 |
3 |
2020-04-23 |
3370 |
437 |
37083 |
0.0908772 |
11402 |
117267 |
4 |
2020-04-24 |
2646 |
464 |
95273 |
0.0277728 |
14048 |
212540 |
5 |
2020-04-25 |
3021 |
420 |
38676 |
0.0781105 |
17069 |
251216 |
6 |
2020-04-26 |
2357 |
415 |
24113 |
0.0977481 |
19426 |
275329 |
7 |
2020-04-27 |
2324 |
260 |
26678 |
0.087113 |
21750 |
302007 |
8 |
2020-04-28 |
1739 |
333 |
37554 |
0.0463067 |
23489 |
339561 |
9 |
2020-04-29 |
2091 |
382 |
38589 |
0.0541864 |
25580 |
378150 |
For this sub-task, a research question that this table would’ve answered might look something like this:
“What is the sum as time progresses for the number of new cases and new tests?”
# Your solution here
4.4: Visualising Data#
4.4.1: Set the Seaborn figure theme and scale up the text in the figures#
There are five preset Seaborn styles (or themes): darkgrid
, whitegrid
, dark
, white
, and ticks
.
They are each suited to different applications and personal preferences.
You can see what they look like here.
Hint: You will need to use the font_scale
property of the set_theme()
function in Seaborn.
Once you’ve done that, create the same plot as in 1.1 and confirm that it looks bigger. Once you are able to confirm this, you will see all subsequent plots in this Jupyter Notebook using the same theme.
Remember to copy this code above to your other Jupyter Notebooks as well!!
# Your solution here
4.4.2: Visualize the COVID-19 dataset#
You’ve previously done work on the COVID-19 dataset, filtering through and analysing the data.
However, data is best represented visually (usually), so we should take our data and visually represent some points.
You’re task is to create a simple plot of the cumulative_new_cases
column.
Sample Output#
Note: We have left off the themes, axis-labels, axis titles, and plot titles so you can spend some time interpreting what you’re plotting. Make sure your plot has all the components of what makes an effective plot!
# Your solution here