Task 3: COVID-19 EDA#
For this task, you’ll do some further Data Analysis on a given dataset that can be found here.
Remember to run the following commands below to import all the necessary packages to run this Task.
Attribution#
The analysis in this Task is based on this tutorial by Aakash NS at FreeCodecamp. You are encouraged to go through the tutorial to get much deeper into the analysis.
The data for this project is believed to be from Our World in Data after a great deal of processing and cleaning.
import pandas as pd
import numpy as np
3.1: Importing data#
Load the dataset from a URL using pandas
, store the dataframe into a variable called data
.
The URL for the dataset is:
Once you’ve loaded the dataset, print out the first 13 rows of the dataset.
Note, do NOT use the print()
command.
# Your solution here
As you can see, we are examining a dataset that details COVID-19 infection rates in Italy. And there are some values that don’t exist in the dataset (NaN
values).
Many datasets have similar properties, and so we need to do some clean up to do some useful analysis.
3.2: Further Analysis#
Find where the first non-NaN value of new_tests
occurs, then print the next 5 rows after that.
Note, do NOT use the print()
command.
Sample Output#
date |
new_cases |
new_deaths |
new_tests |
|
---|---|---|---|---|
111 |
2020-04-20 |
3047 |
433 |
7841 |
112 |
2020-04-21 |
2256 |
454 |
28095 |
113 |
2020-04-22 |
2729 |
534 |
44248 |
114 |
2020-04-23 |
3370 |
437 |
37083 |
115 |
2020-04-24 |
2646 |
464 |
95273 |
# Your solution here
Once you’ve done that, use the count()
function on the data frame to figure out how many non-NaN values there are in each column.
Sample Output#
Your output should be similar to this, the number have to be exactly the same, but the format of the table can be slightly different.
0 |
|
---|---|
date |
248 |
new_cases |
248 |
new_deaths |
248 |
new_tests |
135 |
# Your solution here
Figuring out how many missing values there are is a very important component of an exploratory data analysis.
3.3: Adding a column#
Create a new column called incident_rate
that is calculated by dividing the new_cases
by new_tests
Sample Output#
Here is what the output should look like for the beginning of the dataset:
date |
new_cases |
new_deaths |
new_tests |
incident_rate |
|
---|---|---|---|---|---|
0 |
2019-12-31 |
0 |
0 |
nan |
nan |
1 |
2020-01-01 |
0 |
0 |
nan |
nan |
2 |
2020-01-02 |
0 |
0 |
nan |
nan |
3 |
2020-01-03 |
0 |
0 |
nan |
nan |
4 |
2020-01-04 |
0 |
0 |
nan |
nan |
5 |
2020-01-05 |
0 |
0 |
nan |
nan |
6 |
2020-01-06 |
0 |
0 |
nan |
nan |
7 |
2020-01-07 |
0 |
0 |
nan |
nan |
8 |
2020-01-08 |
0 |
0 |
nan |
nan |
9 |
2020-01-09 |
0 |
0 |
nan |
nan |
And here are a few non-NaN rows so you can make sure you calculated the incident_rate
correctly:
date |
new_cases |
new_deaths |
new_tests |
incident_rate |
|
---|---|---|---|---|---|
111 |
2020-04-20 |
3047 |
433 |
7841 |
0.388598 |
112 |
2020-04-21 |
2256 |
454 |
28095 |
0.080299 |
113 |
2020-04-22 |
2729 |
534 |
44248 |
0.0616751 |
# Your solution here
This new column is used to determine the incidence rate of COVID-19, but let’s calculate the average value of the column.
Sample Output#
0.02343722903508291
# Your solution here
3.5: Creating a new dataset#
Sometimes, the dataset contains rows with data that is not useful for the current analysis. In that case, we should remove rows that only have the information we need, but it’s generally not a good idea to delete the data from the CSV. Additionally, we should try to keep a variable around that contains the full dataset in case we need it in the future.
For our purposes in this analysis, since we need the new_tests
column, can safely drop any row that has NaN
values in the new_tests
column.
Create a new dataframe called datafiltered
that only has the filtered rows after completing the above step.
Sample Output#
date |
new_cases |
new_deaths |
new_tests |
incident_rate |
|
---|---|---|---|---|---|
111 |
2020-04-20 |
3047 |
433 |
7841 |
0.388598 |
112 |
2020-04-21 |
2256 |
454 |
28095 |
0.080299 |
113 |
2020-04-22 |
2729 |
534 |
44248 |
0.0616751 |
114 |
2020-04-23 |
3370 |
437 |
37083 |
0.0908772 |
115 |
2020-04-24 |
2646 |
464 |
95273 |
0.0277728 |
116 |
2020-04-25 |
3021 |
420 |
38676 |
0.0781105 |
117 |
2020-04-26 |
2357 |
415 |
24113 |
0.0977481 |
118 |
2020-04-27 |
2324 |
260 |
26678 |
0.087113 |
119 |
2020-04-28 |
1739 |
333 |
37554 |
0.0463067 |
120 |
2020-04-29 |
2091 |
382 |
38589 |
0.0541864 |
# Your solution here
3.6: Describe#
Now that you’ve managed to filter your data, you can start using more complex analysis functions.
You’ve already had experience using the df.describe().T
function in the previous task.
Repeat the same analysis you did in the previous task on the new data_filtered
object.
# Your solution to output `df.describe.T` for numerical columns:
# Your solution to output `df.describe.T` for categorical columns:
3.4. Initial Thoughts#
3.4.1. Use this section to record your observations.#
Does anything jump out at you as surprising or particularly interesting? Feel free to make additional plots as needed to explore your data set.
Where do you think you’ll go with exploring this dataset? Feel free to take notes in this section and use it as a scratch pad.
Any content in this area will only be marked for effort and completeness.
# Your observations here:#
Obs 1
Obs 2
…
3.7: Exporting data#
Great job! You’ve done some analysis and now are ready to further examine the dataset in the next Task!
But before we do that, save a new dataset called datafiltered.csv
to a new .csv file in the data folder in this repository.
# Your solution here