Practice with Altair - Part 2#

Introduction#

Learning Outcomes#

In this lab you will:

  • Use the Altair library in python to generate data visualizations for the following plots

    • Scatterplots

    • Bar/column plots

    • Line plots

    • Grouped bar plots

    • Box plots

    • Jitter/strip plots

    • Heatmaps

  • Extract insight(s) from a visualization

  • Summarize the benefits and disadvantages of two plot types showing the same data

# Load packages

import pandas as pd
import altair as alt
import numpy

# Need to enable this to allow work with larger datasets (https://altair-viz.github.io/user_guide/faq.html)
alt.data_transformers.enable('json')
DataTransformerRegistry.enable('json')

Exercise 1: Load the dataset#

rubric={correctness:1}

contagious_diseases = pd.read_csv('https://raw.githubusercontent.com/firasm/bits/master/us_contagious_diseases.csv')

Exercise 2: Explore the dataset#

rubric={reasoning:2, accuracy:1, quality:1}

Task 2.1: Explore your dataset by following the instructions below

# First, let's understand the columns

print("The dataframe columns are: ", list(contagious_diseases.columns), "\n")

# disease: disease type, there are 7 options. Can check the names with: `contagious_diseases['disease'].unique()`
# count: total amount of reported cases for all the "weeks_reporting"
# weeks_reporting: contains information about how many weeks make up the data.

print("- The diseases are: ", list(contagious_diseases['disease'].unique()),"\n")

years_list = sorted(list(contagious_diseases['year'].unique()))
print("- The years range from {0:.0f} to {1:.0f} and there are {2} years: ".format(years_list[0],
                                                                           years_list[-1],
                                                                           len(years_list)),"\n")

states_list = list(contagious_diseases['state'].unique())
print("- There is data for {0} states: ", len(states_list),"\n")

# Do a similar thing to explore the other columns: "weeks_reporting", "population", and "count". 
# Note: You do not have to follow the same format, but do print it out for us to read

### YOUR SOLUTION HERE
The dataframe columns are:  ['disease', 'state', 'year', 'weeks_reporting', 'count', 'population'] 

- The diseases are:  ['Hepatitis A', 'Measles', 'Mumps', 'Pertussis', 'Polio', 'Rubella', 'Smallpox'] 

- The years range from 1928 to 2011 and there are 84 years:  

- There is data for {0} states:  51 

Task 2.2: Select a state and disease name from your exploration above.

# Now, let's wrangle and clean the data a bit:

# Counts are pretty variable between diseases and states, so to compare them we should come up with an "annual rate".
# Also, since the population varies so much between states, we should normalize the annual rate to /100,000 people
# Also, not all reports are for exactly 52 weeks, so we will need to adjust for this
# Wrangled annual rate column is then:

contagious_diseases['annual_rate'] = (contagious_diseases['count'] / contagious_diseases['population']
                             ) * 100000 * (52 / contagious_diseases['weeks_reporting'])

# Pertussis and Polio have a bunch of missing data so we're just going to ignore those diseases for now
contagious_diseases = contagious_diseases[~contagious_diseases['disease'].isin(['Pertussis','Polio'])]

# Finally, let's pick a disease and two states and start exploring the data

### YOUR SOLUTION HERE

name_state = "FILL_IN_STATE_NAME_HERE"
name_state2 = "FILL_IN_STATE_NAME_HERE"
name_disease = "FILL_IN_DISEASE_NAME_HERE"

Exercise 3 - Line plot of the population change#

rubric={viz:5}

Task 3.1: Create a line plot of the populations of two states of your choice from 1928 to 2011

# Let's create conditional objects so we can more easily get subsets of data from our pandas df

cond_state2 = contagious_diseases['state'].isin([name_state,
                                                name_state2])
cond_disease = contagious_diseases["disease"] == name_disease

df_state2_disease = contagious_diseases[cond_state2 & cond_disease]

### YOUR SOLUTION HERE

Exercise 4: Boxplot#

rubric={vis:5, reasoning:5}

Task 4.1: Create a boxplot of all the states’ populations from 1930 to 2015 in 10 year intervals

  • If you think this is not an effective plot, I would probably agree with you, you can complain about it in 4.2!

# Again let's limit our dataset using conditional objects
cond_years = contagious_diseases["year"].isin(range(1930,2015,10))

# Hint: supply the data as: contagious_diseases[cond_years]

### YOUR SOLUTION HERE

Task 4.2: Explain why the plot above is not effective in describing how the population across all states has changed over time. Which plot would you make instead (there’s no need to actually make the plot) ?

Your answer here

Exercise 5: Plot the rates for a particular disease and state#

rubric={vis:3, reasoning:2}

Task 5.1: Choose a plot type, and plot the annual rate of the disease and state you picked above. Justify your selection and extract any insights

  • You can browse the Altair gallery for the possible plot types and the syntax

  • Remember to label the axes and titles

# Here is the wrangled data for a disease and the first state you picked earlier:
cond_state = contagious_diseases["state"] == name_state 
df_state_disease = contagious_diseases[cond_state & cond_disease]

### YOUR SOLUTION HERE

Task 5.2: Justify your plot type and list any insights you extracted from your plot

Your answer here

Exercise 6: Plot the rates for a particular disease and two states#

rubric={vis:3, reasoning:2}

Task 6.1: Create a new plot, now showing the data for two states overlaid for the same disease

# Let's first create the conditional object and wrangle our data again

df_state2_disease = contagious_diseases[cond_state2 & cond_disease]

### YOUR SOLUTION HERE

Task 6.2: Explain what insights you can extract from this about the disease and states you selected. From your plot, estimate the year where the incidence rate fell to negligible levels in each state?

Your answer here

Exercise 7: Introduction of vaccines#

rubric={vis:5}

According to the Centre for Disease Control (CDC), vaccines for the diseases we are considering were introduced in the United States in the following years:

### Create a dataframe with the vaccine data 
vaccines = pd.DataFrame.from_dict({"Hepatitis A": 1995 , "Measles": 1963, "Mumps": 1967, "Pertussis": 1914, "Polio": 1955, "Rubella":1969, "Smallpox":1800},
                                    orient = "index").reset_index()
vaccines = vaccines.rename(columns={"index":"Disease", 0:"Year Introduced"})

vaccines
Disease Year Introduced
0 Hepatitis A 1995
1 Measles 1963
2 Mumps 1967
3 Pertussis 1914
4 Polio 1955
5 Rubella 1969
6 Smallpox 1800

Task 7.1: Assign your plot in Exercise 3.1 to a variable, then add a line to that plot marking the year vaccines were introduced

### YOUR SOLUTION HERE

Exercise 8: Plot the aggregate annual rate for a disease across ALL states#

rubric={vis:3,reasoning:2}

Task 8.1 Wrangle the data and then plot the aggregate annual rate for a disease across the country

  • Note: to be clear, we do not want to see data for all 50 states individually plotted

# Some wrangling to get you the data you need

df_con_usavg = contagious_diseases.groupby(["disease", "year"]).agg(
    {"annual_rate": numpy.nanmean}).reset_index()

### YOUR SOLUTION HERE

Task 8.2: List any insights you extracted from your plot

Your answer here

Exercise 9: Faceted plot for all disease#

rubric={vis:10,reasoning:5}

Task 9.1: Create a plot - faceted by the disease - to show the aggregate annual rate for across the US for all diseases

  • Though it is generally bad practice to do so (and is so in this case also), to visualize the data better, you may let the axis with the annual rate be different for each faceted plot (justify it below if you do, and also if you chose not to)

### YOUR SOLUTION HERE

Task 9.2: Explain whether or not the plot above is an effective visualization. Consider also your decision to allow (or not allow) the y-axis to be different for each plot. Discuss the pros and cons of the choice.

Your answer here

Task 9.3: List the insights you were able to extract from this plot.

Your answer here

Exercise 10: Create a heatmap of the annual counts for a disease#

rubric={vis:20,reasoning:5}

Task 10.1: Create a heatmap of the annual counts for a disease, add a vertical line at the year the vaccine was introduced.

### YOUR SOLUTION HERE

Task 10.2: Critique your plot, is this an effective way to visualize the spread of the disease and the impact of the vaccine? List any insights you extracted from your plot.

Your answer here

Congratulations! You have now done a full data analysis and produced a heatmap similar to what the Wallstreet journal did here! You’re well on your way to becoming a true graphics master.

(Optional): Compare and Contrast#

rubric={reasoning:5}

Skim through the Wallstreet journal article here and find the heatmap for your disease. Compare and contrast the effectiveness of your heatmap vs. the one from the article. Think about the effective design principles we discussed in lectures. How would you improve your plot or the WSJ plot? (Note: there is no need to actually improve your heatmap, just write about how you would improve it if time permitted.

Your Answer here