Task 3: COVID-19 EDA#

For this task, you’ll do some further Data Analysis on a given dataset that can be found here.

Remember to run the following commands below to import all the necessary packages to run this Task.

Attribution#

The analysis in this Task is based on this tutorial by Aakash NS at FreeCodecamp. You are encouraged to go through the tutorial to get much deeper into the analysis.

The data for this project is believed to be from Our World in Data after a great deal of processing and cleaning.

import pandas as pd
import numpy as np

3.1: Importing data#

Load the dataset from a URL using pandas, store the dataframe into a variable called data.

The URL for the dataset is:

Once you’ve loaded the dataset, print out the first 13 rows of the dataset.

Note, do NOT use the print() command.

# Your solution here

As you can see, we are examining a dataset that details COVID-19 infection rates in Italy. And there are some values that don’t exist in the dataset (NaN values).

Many datasets have similar properties, and so we need to do some clean up to do some useful analysis.

3.2: Further Analysis#

Find where the first non-NaN value of new_tests occurs, then print the next 5 rows after that.

Note, do NOT use the print() command.

Sample Output#

date

new_cases

new_deaths

new_tests

111

2020-04-20

3047

433

7841

112

2020-04-21

2256

454

28095

113

2020-04-22

2729

534

44248

114

2020-04-23

3370

437

37083

115

2020-04-24

2646

464

95273

# Your solution here

Once you’ve done that, use the count() function on the data frame to figure out how many non-NaN values there are in each column.

Sample Output#

Your output should be similar to this, the number have to be exactly the same, but the format of the table can be slightly different.

0

date

248

new_cases

248

new_deaths

248

new_tests

135

# Your solution here

Figuring out how many missing values there are is a very important component of an exploratory data analysis.

3.3: Adding a column#

Create a new column called incident_rate that is calculated by dividing the new_cases by new_tests

Sample Output#

Here is what the output should look like for the beginning of the dataset:

date

new_cases

new_deaths

new_tests

incident_rate

0

2019-12-31

0

0

nan

nan

1

2020-01-01

0

0

nan

nan

2

2020-01-02

0

0

nan

nan

3

2020-01-03

0

0

nan

nan

4

2020-01-04

0

0

nan

nan

5

2020-01-05

0

0

nan

nan

6

2020-01-06

0

0

nan

nan

7

2020-01-07

0

0

nan

nan

8

2020-01-08

0

0

nan

nan

9

2020-01-09

0

0

nan

nan

And here are a few non-NaN rows so you can make sure you calculated the incident_rate correctly:

date

new_cases

new_deaths

new_tests

incident_rate

111

2020-04-20

3047

433

7841

0.388598

112

2020-04-21

2256

454

28095

0.080299

113

2020-04-22

2729

534

44248

0.0616751

# Your solution here

This new column is used to determine the incidence rate of COVID-19, but let’s calculate the average value of the column.

Sample Output#

0.02343722903508291

# Your solution here

3.5: Creating a new dataset#

Sometimes, the dataset contains rows with data that is not useful for the current analysis. In that case, we should remove rows that only have the information we need, but it’s generally not a good idea to delete the data from the CSV. Additionally, we should try to keep a variable around that contains the full dataset in case we need it in the future.

For our purposes in this analysis, since we need the new_tests column, can safely drop any row that has NaN values in the new_tests column.

Create a new dataframe called datafiltered that only has the filtered rows after completing the above step.

Sample Output#

date

new_cases

new_deaths

new_tests

incident_rate

111

2020-04-20

3047

433

7841

0.388598

112

2020-04-21

2256

454

28095

0.080299

113

2020-04-22

2729

534

44248

0.0616751

114

2020-04-23

3370

437

37083

0.0908772

115

2020-04-24

2646

464

95273

0.0277728

116

2020-04-25

3021

420

38676

0.0781105

117

2020-04-26

2357

415

24113

0.0977481

118

2020-04-27

2324

260

26678

0.087113

119

2020-04-28

1739

333

37554

0.0463067

120

2020-04-29

2091

382

38589

0.0541864

# Your solution here

3.6: Describe#

Now that you’ve managed to filter your data, you can start using more complex analysis functions.

You’ve already had experience using the df.describe().T function in the previous task.

Repeat the same analysis you did in the previous task on the new data_filtered object.

# Your solution to output `df.describe.T` for numerical columns:
# Your solution to output `df.describe.T` for categorical columns:

3.4. Initial Thoughts#

3.4.1. Use this section to record your observations.#

Does anything jump out at you as surprising or particularly interesting? Feel free to make additional plots as needed to explore your data set.

Where do you think you’ll go with exploring this dataset? Feel free to take notes in this section and use it as a scratch pad.

Any content in this area will only be marked for effort and completeness.

# Your observations here:#

  • Obs 1

  • Obs 2

3.7: Exporting data#

Great job! You’ve done some analysis and now are ready to further examine the dataset in the next Task!

But before we do that, save a new dataset called datafiltered.csv to a new .csv file in the data folder in this repository.

# Your solution here