Lecture 5#

Principles of Effective Visualizations#

Lecture learning goals#

By the end of the lecture you will be able to:

  1. Follow guidelines for best practices in visualization design.

  2. Avoid overplotting via 2D distribution plots.

  3. Adjust axes extents and formatting.

  4. Modify titles of several figure elements.

Readings#

This lecture’s readings are all from Fundamentals of Data Visualization. They are quite short, which is why there is three of them.


# Run this cell to ensure that altair plots show up in the exported HTML
# and that the R cell magic works
import altair as alt

# Save a vega-lite spec and a PNG blob for each plot in the notebook
alt.renderers.enable('mimetype')
# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')
alt.data_transformers.disable_max_rows()

# Load the R cell magic
%load_ext rpy2.ipython

The schematics of many of the guidelines shown in todays were from the Cato institutes guidelines, you can find them in full here. Many other guidelines are available via this post and this spreadsheet. The data visualization society has more interesting resources on style guidelines.

Another good way to learn is from other people’s mistakes. Here the economist criticizes their own plots, not all of them are related to concepts we teach in class, but still worthwhile consideration to keep in mind when visualizing your data.

It is important to remember that many of these are guidelines and there are times when you can break such guidelines, e.g. using different colors than the good defaults if they already have a pre-association such as for fruits, or political parties. This article discusses a few more cases where breaking a guideline worked well, although it is not always the case that they gained a lot (such as the 3D ice cubes), at least breaking the guideline without ruining the viz, and having something different that stands out could be important to make a visualization more memorable.

Overplotting#

%%R -o diamonds
# Copy diamonds df from R to Python

options(tidyverse.quiet = TRUE) 
library(tidyverse)

theme_set(theme_light(base_size = 18))
R[write to console]: Error in (function (filename = "Rplot%03d.png", width = 480, height = 480,  : 
  Graphics API version mismatch
---------------------------------------------------------------------------
RRuntimeError                             Traceback (most recent call last)
Input In [2], in <cell line: 1>()
----> 1 get_ipython().run_cell_magic('R', '-o diamonds', '# Copy diamonds df from R to Python\n\noptions(tidyverse.quiet = TRUE) \nlibrary(tidyverse)\n\ntheme_set(theme_light(base_size = 18))\n')

File /opt/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/site-packages/IPython/core/interactiveshell.py:2358, in InteractiveShell.run_cell_magic(self, magic_name, line, cell)
   2356 with self.builtin_trap:
   2357     args = (magic_arg_s, cell)
-> 2358     result = fn(*args, **kwargs)
   2359 return result

File /opt/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/site-packages/rpy2/ipython/rmagic.py:765, in RMagics.R(self, line, cell, local_ns)
    762 else:
    763     cell_display = CELL_DISPLAY_DEFAULT
--> 765 tmpd = self.setup_graphics(args)
    767 text_output = ''
    768 try:

File /opt/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/site-packages/rpy2/ipython/rmagic.py:461, in RMagics.setup_graphics(self, args)
    457 tmpd_fix_slashes = tmpd.replace('\\', '/')
    459 if self.device == 'png':
    460     # Note: that %% is to pass into R for interpolation there
--> 461     grdevices.png("%s/Rplots%%03d.png" % tmpd_fix_slashes,
    462                   **argdict)
    463 elif self.device == 'svg':
    464     self.cairo.CairoSVG("%s/Rplot.svg" % tmpd_fix_slashes,
    465                         **argdict)

File /opt/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/site-packages/rpy2/robjects/functions.py:203, in SignatureTranslatedFunction.__call__(self, *args, **kwargs)
    201         v = kwargs.pop(k)
    202         kwargs[r_k] = v
--> 203 return (super(SignatureTranslatedFunction, self)
    204         .__call__(*args, **kwargs))

File /opt/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/site-packages/rpy2/robjects/functions.py:126, in Function.__call__(self, *args, **kwargs)
    124     else:
    125         new_kwargs[k] = cv.py2rpy(v)
--> 126 res = super(Function, self).__call__(*new_args, **new_kwargs)
    127 res = cv.rpy2py(res)
    128 return res

File /opt/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/site-packages/rpy2/rinterface_lib/conversion.py:45, in _cdata_res_to_rinterface.<locals>._(*args, **kwargs)
     44 def _(*args, **kwargs):
---> 45     cdata = function(*args, **kwargs)
     46     # TODO: test cdata is of the expected CType
     47     return _cdata_to_rinterface(cdata)

File /opt/hostedtoolcache/Python/3.9.13/x64/lib/python3.9/site-packages/rpy2/rinterface.py:813, in SexpClosure.__call__(self, *args, **kwargs)
    806     res = rmemory.protect(
    807         openrlib.rlib.R_tryEval(
    808             call_r,
    809             call_context.__sexp__._cdata,
    810             error_occured)
    811     )
    812     if error_occured[0]:
--> 813         raise embedded.RRuntimeError(_rinterface._geterrmessage())
    814 return res

RRuntimeError: Error in (function (filename = "Rplot%03d.png", width = 480, height = 480,  : 
  Graphics API version mismatch

Plotting all the points in this df (there are around 50,000!) takes a little bit of time and causes the plot to become saturated so that we can’t see individual observations.

diamonds.shape
(53940, 10)
alt.Chart(diamonds).mark_point().encode(
    alt.X('carat'),
    alt.Y('price'))
../_images/Lecture5_8_0.png

Reducing marker size and increasing opacity only helps somewhat, there are still many overplotted areas in the chart.

A better approach in this case is to create a 2D histogram, where both the x and y-axes are binned which creates a binned mesh/net over the chart area and the number of observations are counted in each bin. Just like a histogram, but the bins are in 2D instead of 1D. A 2D histogram is a type of heatmap, where count is mapped to color, you could also have used a mark that maps size to color, which might even be more effective but that is not as commonly seen.

alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40)),
    alt.Y('price', bin=alt.Bin(maxbins=40)),
    alt.Color('count()'))
../_images/Lecture5_11_0.png

Here we can clearer see that a small area is much more dense than the others, although they looked similar in the saturated plot. How can we zoom into this area?

ggplot#

In ggplot, there are more options for 2D distribution plots.

%%R
theme_set(theme(text = element_text(size = 18)) + theme_light())
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_point()
../_images/Lecture5_14_0.png

As with geom_histogram, the binning is done by the geom, without explicitly changing the axis like in Altair.

%%R -w 600
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_bin2d()
../_images/Lecture5_16_0.png

Instead of squares, hexagonal bins can be used. These have theoretically superior qualities over squares, such as a more natural notation of neighbors (1 step any direction instead of diagonal versus orthoganol neighbors), and a more circular shape ensures that data points that contribute to the count of a hexagonal bin, are not far away from the center in a corner as it could be in a square.

%%R -w 600
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex()
../_images/Lecture5_18_0.png

We can also create 2 dimensional KDEs in ggplot. This works just like 1D KDEs, except that the kernel on each data point extends in 2 dimensions (so it looks a bit like a tent)

%%R -w 600
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_density_2d_filled()
../_images/Lecture5_20_0.png

In addition to indicate the density with color, we could also use ridges/contours, similar to a topographic map. This is akin to looking at a mountain range from above, so small circles indicate sharp peaks.

%%R -w 600
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_density_2d()
../_images/Lecture5_22_0.png

Axis ranges#

In many cases the most convenient way might be to filter the data before sending it to the chart. This was you are using the efficient pandas methods to do the heavy lifting and avoiding slowdown from plotting many points and then zoom.

The axis range is set with the domain parameter to alt.Scale. To set an axis range to less than the extent of the data, we also need to include clip=True in the mark, otherwise it will be plotted outside the figures. We also need to increase the number of bins to have higher resolution in this zoomed in part. Sometimes the range is padded with a bit of extra space automatically, if this is undesired nice=False can be set inside alt.Scale.

All these steps should reinforce that it is usually better to filter the data and let Altair handle the plotting.

# Mention filter data, clip, maxbins
alt.Chart(diamonds).mark_rect(clip=True).encode(
    alt.X('carat', bin=alt.Bin(maxbins=400), scale=alt.Scale(domain=(0, 1))),
    alt.Y('price', bin=alt.Bin(maxbins=400), scale=alt.Scale(domain=(0, 2000))),
    alt.Color('count()'))
../_images/Lecture5_24_0.png

Scales can be reversed.

# Invert
alt.Chart(diamonds).mark_rect(clip=True).encode(
    alt.X('carat', bin=alt.Bin(maxbins=400), scale=alt.Scale(domain=(0, 1))),
    alt.Y('price', bin=alt.Bin(maxbins=400), scale=alt.Scale(domain=(0, 2000), reverse=True)),
    alt.Color('count()'))
../_images/Lecture5_26_0.png

This is not usually that useful for an xy-axis, but remember that color, size, etc are all scales in Altair, so they can be reversed with the same syntax! This is quite convenient and we will see more of it in following lectures.

# Invert
alt.Chart(diamonds).mark_rect(clip=True).encode(
    alt.X('carat', bin=alt.Bin(maxbins=400), scale=alt.Scale(domain=(0, 1))),
    alt.Y('price', bin=alt.Bin(maxbins=400), scale=alt.Scale(domain=(0, 2000), reverse=True)),
    alt.Color('count()', scale=alt.Scale(reverse=True)))
../_images/Lecture5_28_0.png

ggplot#

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_x_continuous(limits = c(0, 1)) +
    scale_y_continuous(limits = c(0, 2000))
../_images/Lecture5_30_0.png

By deafut ggplot removes observations ouside the visible domain, so any big marks that are both inside and outside, such as bars for example, will be cut out. This is good because it makes it hard for people to make poor visualization choices, such as zooming in on bar charts instead of showin the entire domain starting from zero. In situation where you do need to include such partial graphics, you can set the out of bounds (oob) parameter to scales::oob_keep as described in https://scales.r-lib.org/reference/oob.html.

You can see that there is a bit of empty space or padding on each size of the x-axis to the left of 0 and to the right of 1. If we want to get rid of this, we can set expand = expansion(mult = c(0, 0))) in the scale we’re using. The vector contains the min and max padding and changes as a multiplication of the current axis range. So if we wanted some space at the right side, we could use mult = c(0, 0.05)) or similar instead. More details here.

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_x_continuous(limits = c(0, 1), expand = expansion(mult = c(0, 0))) +
    scale_y_continuous(limits = c(0, 2000))
../_images/Lecture5_32_0.png

To reverse the axis, we can set trans = 'reverse'. Other transforms include log10 which there is also a shortcut for scale_x_log10. We also need to set the limits to go the opposite direction.

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_x_continuous(limits = c(0, 1), expand = expansion(mult = c(0, 0))) +
    scale_y_continuous(limits = c(2000, 0), trans = 'reverse')
../_images/Lecture5_34_0.png

Just as in Altair, color scales can be controlled the same way as axis scales (the color for the hexagons is set via fill rather than color).

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_x_continuous(limits = c(0, 1), expand = expansion(mult = c(0, 0))) +
    scale_y_continuous(limits = c(2000, 0), trans = 'reverse') +
    scale_fill_continuous(trans = 'reverse')
../_images/Lecture5_36_0.png

Axis label formatting#

All labels formats can be found here. Notable ones include %, $, e, s.

# Remove the bins and take a sample of the code make the code clearer
diamonds = diamonds.sample(1000, random_state=1010)

Scientific notation (10^ or e+) can be useful internally, but can be confusing for communicating to a more general audience.

alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40)),
    alt.Y('price', bin=alt.Bin(maxbins=40), axis=alt.Axis(format='e')),
    alt.Color('count()'))
../_images/Lecture5_40_0.png

Standard international (SI) units are often easier to digest.

alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40)),
    alt.Y('price', bin=alt.Bin(maxbins=40), axis=alt.Axis(format='s')),
    alt.Color('count()'))
../_images/Lecture5_42_0.png

A prefaced ~ removes trailing zeros.

alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40)),
    alt.Y('price', bin=alt.Bin(maxbins=40), axis=alt.Axis(format='~s')),
    alt.Color('count()'))
../_images/Lecture5_44_0.png

Formaters can also be combined.

alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40)),
    alt.Y('price', bin=alt.Bin(maxbins=40), axis=alt.Axis(format='$~s')),
    alt.Color('count()'))
../_images/Lecture5_46_0.png

The same format keys can be used for the legend.

alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40)),
    alt.Y('price', bin=alt.Bin(maxbins=40), axis=alt.Axis(format='$~s')),
    alt.Color('count()', legend=alt.Legend(format='s')))
../_images/Lecture5_48_0.png
alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40)),
    alt.Y('price', bin=alt.Bin(maxbins=40), axis=alt.Axis(format='$~s', tickCount=1)),
    alt.Color('count()', legend=alt.Legend(format='s')))
../_images/Lecture5_49_0.png

The number of ticks can be modified via tickCount, but not for binned data.

alt.Chart(diamonds.sample(100)).mark_point().encode(
    alt.X('carat', axis=alt.Axis(tickCount=40)),
    alt.Y('price', axis=alt.Axis(tickCount=2)))
../_images/Lecture5_51_0.png

You can also remove an axis altogether.

alt.Chart(diamonds.sample(100)).mark_point().encode(
    alt.X('carat', axis=None),
    alt.Y('price', axis=alt.Axis(tickCount=2)))
../_images/Lecture5_53_0.png

And set a different theme.

alt.themes.names()
['dark',
 'default',
 'fivethirtyeight',
 'ggplot2',
 'latimes',
 'none',
 'opaque',
 'quartz',
 'urbaninstitute',
 'vox']
alt.themes.enable('dark')
alt.Chart(diamonds.sample(100)).mark_point().encode(
    alt.X('carat', axis=None),
    alt.Y('price', axis=alt.Axis(tickCount=2)))
../_images/Lecture5_56_0.png

ggplot#

The scales package helps with the formatting in ggplot.

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_y_continuous(labels = scales::label_scientific())
../_images/Lecture5_59_0.png
%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_y_continuous(labels = scales::label_number_si())
../_images/Lecture5_60_0.png
%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_y_continuous(labels = scales::label_dollar())
../_images/Lecture5_61_0.png

The legend can be formatted via the same syntax.

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_y_continuous(labels = scales::label_dollar()) +
    scale_fill_continuous(labels = scales::label_number_si())
../_images/Lecture5_63_0.png

The scales package also helps us setting the number of ticks (breaks) on an axis.

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    scale_y_continuous(
        labels = scales::label_dollar(),
        breaks = scales::pretty_breaks(n = 10)) +
    scale_fill_continuous(labels = scales::label_number_si())
../_images/Lecture5_65_0.png

You can remove an axis.

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    theme(axis.title.x=element_blank(),
          axis.text.x=element_blank(),
          axis.ticks.x=element_blank())
../_images/Lecture5_67_0.png

Or set a theme that hides all axis objects.

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    theme_void()
../_images/Lecture5_69_0.png

The classic theme is nice. There are many more sophisticated theme in the ggthemes.

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    theme_classic()
../_images/Lecture5_71_0.png

Figure, axis, and legend titles#

When doing EDA, axis titles etc don’t matter that much, since you are the primary person interpreting them. In communication however, your plots often need to be interpretable on their own without explanation. Setting descriptive titles is a big part of this, please see the required readings for more info.

Axis titles should be capitalized and contain spaces, no variable names with underscores.

# Set back to defaut theme
alt.themes.enable('default')
alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40), title='Carat'),
    alt.Y('price', bin=alt.Bin(maxbins=40), title='Price'),
    color='count()')
../_images/Lecture5_73_0.png

The legend title is controlled inside the encoding channel that is displays.

alt.Chart(diamonds).mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40), title='Carat'),
    alt.Y('price', bin=alt.Bin(maxbins=40), title='Price'),
    alt.Color('count()', title='Number of Records'))
../_images/Lecture5_75_0.png
alt.Chart(diamonds, title='Higher carat diamonds are more expensive').mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40), title='Carat'), 
    alt.Y('price', bin=alt.Bin(maxbins=40), title='Price'),
    alt.Color('count()', title='Number of Records'))
../_images/Lecture5_76_0.png

A suptitle is a property of a title.

(alt.Chart(
    diamonds,
    title=alt.TitleParams(
        text='Higher carat diamonds are more expensive',
        subtitle='But most diamonds are of low carat'))
 .mark_rect().encode(
    alt.X('carat', bin=alt.Bin(maxbins=40), title='Carat'), 
    alt.Y('price', bin=alt.Bin(maxbins=40), title='Price'),
    alt.Color('count()', title='Number of Records')))
../_images/Lecture5_78_0.png

ggplot#

%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    labs(x = 'Carat', y = 'Price')
../_images/Lecture5_80_0.png
%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    labs(x = 'Carat', y = 'Price', fill = 'Number')
../_images/Lecture5_81_0.png
%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    labs(x = 'Carat', y = 'Price', fill = 'Number', title = 'Diamonds') +
    scale_y_continuous(labels = scales::label_dollar())
../_images/Lecture5_82_0.png
%%R
ggplot(diamonds) +
    aes(x = carat,
        y = price) +
    geom_hex() +
    labs(x = 'Carat', y = 'Price', fill = 'Number', title = 'Diamonds', subtitle='Small diamonds') +
    scale_y_continuous(labels = scales::label_dollar())
../_images/Lecture5_83_0.png