Task 3: COVID-19 EDA

Task 3: COVID-19 EDA#

For this task, you’ll do some further Data Analysis on a given dataset that can be found here.

Remember to run the following commands below to import all the necessary packages to run this Task.

Attribution#

The analysis in this Task is based on this tutorial by Aakash NS at FreeCodecamp. You are encouraged to go through the tutorial to get much deeper into the analysis.

The data for this project is believed to be from Our World in Data after a great deal of processing and cleaning.

import pandas as pd
import numpy as np

3.1: Importing data#

Load the dataset from a URL using pandas, store the dataframe into a variable called data.

The URL for the dataset is:

https://gist.githubusercontent.com/aakashns/f6a004fa20c84fec53262f9a8bfee775/raw/f309558b1cf5103424cef58e2ecb8704dcd4d74c/italy-covid-daywise.csv

Once you’ve loaded the dataset, print out the first 13 rows of the dataset.

Note, do NOT use the print() command.

# Your solution here

As you can see, we are examining a dataset that details COVID-19 infection rates in Italy. And there are some values that don’t exist in the dataset (NaN values).

Many datasets have similar properties, and so we need to do some clean up to do some useful analysis.

3.2: Further Analysis#

Find where the first non-NaN value of new_tests occurs, then print the next 5 rows after that.

Note, do NOT use the print() command.

Sample Output#

	date	new_cases	new_deaths	new_tests
111	2020-04-20	3047	433	7841
112	2020-04-21	2256	454	28095
113	2020-04-22	2729	534	44248
114	2020-04-23	3370	437	37083
115	2020-04-24	2646	464	95273

# Your solution here

Once you’ve done that, use the count() function on the data frame to figure out how many non-NaN values there are in each column.

Sample Output#

Your output should be similar to this, the number have to be exactly the same, but the format of the table can be slightly different.

	0
date	248
new_cases	248
new_deaths	248
new_tests	135

# Your solution here

Figuring out how many missing values there are is a very important component of an exploratory data analysis.

3.3: Adding a column#

Create a new column called incident_rate that is calculated by dividing the new_cases by new_tests

Sample Output#

Here is what the output should look like for the beginning of the dataset:

	date	new_tests	incident_rate
0	2019-12-31	nan	nan
1	2020-01-01	nan	nan
2	2020-01-02	nan	nan
3	2020-01-03	nan	nan
4	2020-01-04	nan	nan
5	2020-01-05	nan	nan
6	2020-01-06	nan	nan
7	2020-01-07	nan	nan
8	2020-01-08	nan	nan
9	2020-01-09	nan	nan

And here are a few non-NaN rows so you can make sure you calculated the incident_rate correctly:

	date	new_cases	new_deaths	new_tests	incident_rate
111	2020-04-20	3047	433	7841	0.388598
112	2020-04-21	2256	454	28095	0.080299
113	2020-04-22	2729	534	44248	0.0616751

# Your solution here

This new column is used to determine the incidence rate of COVID-19, but let’s calculate the average value of the column.

Sample Output#

0.02343722903508291

# Your solution here

3.5: Creating a new dataset#

Sometimes, the dataset contains rows with data that is not useful for the current analysis. In that case, we should remove rows that only have the information we need, but it’s generally not a good idea to delete the data from the CSV. Additionally, we should try to keep a variable around that contains the full dataset in case we need it in the future.

For our purposes in this analysis, since we need the new_tests column, can safely drop any row that has NaN values in the new_tests column.

Create a new dataframe called datafiltered that only has the filtered rows after completing the above step.

Sample Output#

	date	new_cases	new_deaths	new_tests	incident_rate
111	2020-04-20	3047	433	7841	0.388598
112	2020-04-21	2256	454	28095	0.080299
113	2020-04-22	2729	534	44248	0.0616751
114	2020-04-23	3370	437	37083	0.0908772
115	2020-04-24	2646	464	95273	0.0277728
116	2020-04-25	3021	420	38676	0.0781105
117	2020-04-26	2357	415	24113	0.0977481
118	2020-04-27	2324	260	26678	0.087113
119	2020-04-28	1739	333	37554	0.0463067
120	2020-04-29	2091	382	38589	0.0541864

# Your solution here

3.6: Describe#

Now that you’ve managed to filter your data, you can start using more complex analysis functions.

You’ve already had experience using the df.describe().T function in the previous task.

Repeat the same analysis you did in the previous task on the new data_filtered object.

# Your solution to output `df.describe.T` for numerical columns:

# Your solution to output `df.describe.T` for categorical columns:

3.4. Initial Thoughts#

3.4.1. Use this section to record your observations.#

Does anything jump out at you as surprising or particularly interesting? Feel free to make additional plots as needed to explore your data set.

Where do you think you’ll go with exploring this dataset? Feel free to take notes in this section and use it as a scratch pad.

Any content in this area will only be marked for effort and completeness.

# Your observations here:#

Obs 1
Obs 2
…

3.7: Exporting data#

Great job! You’ve done some analysis and now are ready to further examine the dataset in the next Task!

But before we do that, save a new dataset called datafiltered.csv to a new .csv file in the data folder in this repository.

# Your solution here

Task 3: COVID-19 EDA

Contents

Task 3: COVID-19 EDA#

Attribution#

3.1: Importing data#

3.2: Further Analysis#

Sample Output#

Sample Output#

3.3: Adding a column#

Sample Output#

Sample Output#

3.5: Creating a new dataset#

Sample Output#

3.6: Describe#

3.4. Initial Thoughts#

3.4.1. Use this section to record your observations.#

# Your observations here:#

3.7: Exporting data#