Lecture 6¶

Lecture learning goals¶

By the end of the lecture you will be able to:

  1. Choose appropriate color schemes for your data.

  2. Use pre-made and custom color schemes.

  3. Selectively highlight and annotate data with color and text.

  4. Directly label data instead of using legends.

Required readings¶

This lecture’s readings are both from Fundamentals of Data Visualization.


Facilitate interpretation through informed color choices¶

In general, when presenting continuous data, a perceptually uniform colormap (such as viridis) is often the most suitable choice. This type of colormap ensures that equal steps in data are perceived as equal steps in color space. The human brain perceives changes in lightness to represent changes in the data more accurately than changes in hue. Therefore, colormaps with monotonically increasing lightness throughout the colormap will be easier to interpret for the viewer. More details and examples of such colormaps are available in the matplotlib documentation, and many of the core design principles are outlined in this entertaining talk.

Nearly 10% of the population is colour vision deficient; red-green colour blindness in particular affects 8% of men and 0.5% of women. Guidelines for making your visualizations more accessible to those with reduced color vision, will in many cases also improve the interpretability of your graphs for people who have standard color vision. If you are unsure how your plot will look for someone who sees colors differently than you, this website lets you upload and image and simulate different color vision deficiencies. A colormap designed specifically to look the same for people with and without the most common color vision deficiency is cividis. In addition to careful color choices, visualization clarity can be improved by using different shapes for each grouping.

The jet rainbow colormap should be avoided for many reasons, including that the sharp transitions between colors introduces visual threshold that do not represent the underlying continuous data. Another issue is luminance (brightness). For example, your eye is drawn to the yellow and cyan regions, because the luminance is higher. This can have the unfortunate effect of highlighting features in your data that don’t exist, misleading your viewers! Since higher values are not always lighter, this means that your graph is not going to translate well to greyscale. More details about jet can be found in this blog post and this series of posts. A better alternative when you really need small differences in your data to stand out is to use the turbo rainbow color scheme.

If you are interested in reading more about how color choices determines what we see, this is an interesting article.

Color schemes/maps¶

Categorical¶

# Run this cell to ensure that altair plots show up in the exported HTML
# and that the R cell magic works
import altair as alt

# Save a vega-lite spec and a PNG blob for each plot in the notebook
alt.renderers.enable('mimetype')
# Handle large data sets without embedding them in the notebook
alt.data_transformers.enable('data_server')

# Load the R cell magic
%load_ext rpy2.ipython
---------------------------------------------------------------------------
NoSuchEntryPoint                          Traceback (most recent call last)
<ipython-input-1-a91d7fde39a2> in <module>
      6 alt.renderers.enable('mimetype')
      7 # Handle large data sets without embedding them in the notebook
----> 8 alt.data_transformers.enable('data_server')
      9 
     10 # Load the R cell magic

/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/altair/utils/plugin_registry.py in enable(self, name, **options)
    175         if name is None:
    176             name = self.active
--> 177         return PluginEnabler(self, name, **options)
    178 
    179     @property

/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/altair/utils/plugin_registry.py in __init__(self, registry, name, **options)
     25         self.options = options  # type: Dict[str, Any]
     26         self.original_state = registry._get_state()  # type: Dict[str, Any]
---> 27         self.registry._enable(name, **options)
     28 
     29     def __enter__(self) -> "PluginEnabler":

/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/altair/utils/plugin_registry.py in _enable(self, name, **options)
    140         if name not in self._plugins:
    141             try:
--> 142                 ep = entrypoints.get_single(self.entry_point_group, name)
    143             except entrypoints.NoSuchEntryPoint:
    144                 if name in self.entrypoint_err_messages:

/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/entrypoints.py in get_single(group, name, path)
    215                 return EntryPoint.from_string(epstr, name, distro)
    216 
--> 217     raise NoSuchEntryPoint(group, name)
    218 
    219 def get_group_named(group, path=None):

NoSuchEntryPoint: No 'data_server' entry point found in group 'altair.vegalite.v4.data_transformer'

The default categorical colormap used in Altair is ā€œTableau10ā€, which consists of 10 colors and starts with a blue, orange, and red color.

import altair as alt
from vega_datasets import data

iris = data.iris()

alt.Chart(iris).mark_circle(size=100).encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('species'))
../_images/Lecture6_9_0.png

You can change the colormap (or colorscheme) by specifying its name as a string to scheme inside alt.Scale. All the available colormaps can be viewed on this page, which also lists what type of data the colormap is useful for (categorical, sequential, diverging, cyclic).

alt.Chart(iris).mark_circle(size=100).encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('species', scale=alt.Scale(scheme='dark2')))
../_images/Lecture6_11_0.png

If you don’t like any of the premade colormaps, you could make your own. It can be really fun to experiment with different colors and I encourage you do to so. However, please keep in mind that the existing colorscales have had a lot of knowledge and consideration going into them, so there are good reasons to use them for your final versions of plots especially for communication purposes, at least until you have gotten more knowledgeable about these topics yourself.

Below I use three colors by name, you can see all the available names in the first image here (one of the color, ā€œrebecca purpleā€ has a touching story to it). You can also specify colors directly from hex codes, these are defines over #000000 for black (ā€œzero colorā€) to #ffffff for white (ā€œfull colorā€) (example with ggplot below). This is very useful when trying to replicate a plot that someone else has done, you can use a color picker tool in gimp, paint, or similar, to get the exact HTML code from an image, and then use it in your plot as a string. If you don’t have any software with that functionality installed, you can use this online color picker tools.

colors = ['coral', 'steelblue', 'rebeccapurple']
alt.Chart(iris).mark_circle(size=100).encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('species', scale=alt.Scale(range=colors)))
../_images/Lecture6_13_0.png

Sequential¶

When encoding a numerical variable as color, a perceptually uniform sequential colormap will be chosen to accurately represent the changes numerical changes as color changes. It is usually a good idea to have the low values be the ones closest to the background color, as the light blue ones below.

alt.Chart(iris).mark_circle(size=100).encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('petalWidth'))
../_images/Lecture6_16_0.png

You can change the colorscheme to any of the ones listed here.

alt.Chart(iris).mark_circle(size=100).encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('petalWidth', scale=alt.Scale(scheme='greenblue')))
../_images/Lecture6_18_0.png

ā€œViridisā€ is a well-research colorscheme, originally developed for matplotlib and now used in many different places. Compared to the ones above, you see changes in detail slightly better because of the increased amount of hues/colors used, which could also give rise to a very slight extra highlighting effect (for example when going from green to yellow), as we discussed in the intro video.

alt.Chart(iris).mark_circle(size=100).encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('petalWidth', scale=alt.Scale(scheme='viridis')))
../_images/Lecture6_20_0.png

You can reverse a color scale, the same way we learn how to reverse axes scales.

alt.Chart(iris).mark_circle(size=100).encode(
    x='petalWidth',
    y='petalLength',
    color=alt.Color('petalWidth', scale=alt.Scale(scheme='viridis', reverse=True)))
../_images/Lecture6_22_0.png

Diverging¶

If we were to map a variable that has a natural midpoint, such as a correlation that is defined from -1 to 1, it is not that helpful to use the default colormap, since it will make values close to zero seem more important than value close to -1.

corr_df = data.gapminder().corr().stack().reset_index(name='corr')
alt.Chart(corr_df).mark_rect().encode(
    x='level_0',
    y='level_1',
    tooltip='corr', 
    color=alt.Color('corr')).properties(width=200, height=200)
../_images/Lecture6_24_0.png

Instead we can choose a color scheme that is more suitable for showing diverging values, and define the color domain manually to match the range of our variable. An alternative to setting the color scheme explicitly would have been to set domainMid=0, in which case Altair understand this is a diverging variable with a natural midpoint and uses the default diverging color scheme.

(alt.Chart(corr_df).mark_rect().encode(
    x='level_0',
    y='level_1',
    tooltip='corr',
    color=alt.Color('corr', scale=alt.Scale(domain=(-1, 1), scheme='purpleorange')))
 .properties(width=200, height=200)) 
../_images/Lecture6_26_0.png

ggplot¶

Categorical¶

The default categorical colormap in ggplot is not explicitly designed, but rather created by selecting equally spaced colors from the color wheel.

%%R

options(tidyverse.quiet = TRUE) 
library(tidyverse)

theme_set(theme_light(base_size = 18))
%%R

ggplot(iris) + 
    aes(x = Petal.Width,
        y = Petal.Length,
        color = Species) +
    geom_point(size = 5)
../_images/Lecture6_31_0.png

All useful color maps are not collecting in one place, but available through different functions and packages. For example, the color maps from color brewer are accessible via scale_color|fill_brewer|distiller (use the brewer suffix for categorical and distiller for sequential values).

ggplot(iris) + aes(x = Petal.Width, y = Petal.Length, color = Species) + geom_point(size = 5) + scale_color_brewer(palette = ā€˜Dark2’)

All R colors maps can be viewed in this repo. The tableau colors used in Altair are accessible via the ggthemes package.

%%R
ggplot(iris) + 
    aes(x = Petal.Width,
        y = Petal.Length,
        color = Species) +
    geom_point(size = 5) +
    ggthemes::scale_color_tableau()
../_images/Lecture6_35_0.png

We could also set the colorscale manually, let’s use the same colors as in the altair example, but this time via their HTML codes instead.

%%R
ggplot(iris) + 
    aes(x = Petal.Width,
        y = Petal.Length,
        color = Species) +
    geom_point(size = 5) +
    scale_color_manual(values = c('#FF7F50', '#4682B4', '#663399'))
../_images/Lecture6_37_0.png

Sequential¶

The default color map for numerical values goes from dark to white, since the default background is dark.

%%R
ggplot(iris) + 
    aes(x = Petal.Width,
        y = Petal.Length,
        color = Petal.Width) +
    geom_point(size = 5)
../_images/Lecture6_40_0.png

It can be changed to the viridis color map.

%%R
ggplot(iris) + 
    aes(x = Petal.Width,
        y = Petal.Length,
        color = Petal.Width) +
    geom_point(size = 5) +
    scale_color_viridis_c()
../_images/Lecture6_42_0.png

Reversing is possible via the same techniques as for axes, but it does not look great since the color legend is sorted ā€œupside downā€.

%%R
ggplot(iris) + 
    aes(x = Petal.Width,
        y = Petal.Length,
        color = Petal.Width) +
    geom_point(size = 5) +
    scale_color_viridis_c(trans = 'reverse')
../_images/Lecture6_44_0.png

There is a special syntax for colormaps that preserves the orientation of the legend while reversing.

%%R
ggplot(iris) + 
    aes(x = Petal.Width,
        y = Petal.Length,
        color = Petal.Width) +
    geom_point(size = 5) +
    scale_color_viridis_c(direction = -1)
../_images/Lecture6_46_0.png

Diverging¶

Like in Altair, it is not that informative to use the default color map for diverging values.

%%R -i corr_df
library(tidyverse)

ggplot(corr_df) +
    aes(x = level_0,
        y = level_1,
        fill = corr) +
    geom_tile()
../_images/Lecture6_49_0.png

The default bluered tableau diverging color map can be used via ggthemes.

%%R -i corr_df
ggplot(corr_df) +
    aes(x = level_0,
        y = level_1,
        fill = corr) +
    geom_tile() +
    ggthemes::scale_fill_gradient2_tableau()
../_images/Lecture6_51_0.png

However, this sets blue as high values by default, which is against people’s intuition since red is often used for ā€œhotā€ and blue or ā€œcoldā€. We can either reverse the colormap, or use one from ColorBrewer instead.

%%R -i corr_df
ggplot(corr_df) +
    aes(x = level_0,
        y = level_1,
        fill = corr) +
    geom_tile() +
    scale_fill_distiller(palette = 'PuOr')
../_images/Lecture6_53_0.png

Defining the colormap limits ensures that low and high values of the same magnitude are equally highlighted.

%%R -i corr_df
ggplot(corr_df) +
    aes(x = level_0,
        y = level_1,
        fill = corr) +
    geom_tile() +
    scale_fill_distiller(palette = 'PuOr', limits = c(-1, 1))
../_images/Lecture6_55_0.png

Highlighting with colors and text labels¶

We can also use color to highlight manually select elements in plots, for example the year with the highest wheat price in the figure below.

wheat = data.wheat().query('year > 1700')  # Reduce the number of bars for clarity

# Set the year to be highlighted to a separate value in a new column
wheat['highlight'] = False
wheat.loc[wheat['year'] == wheat['year'].iloc[wheat['wages'].argmax()], 'highlight'] = True

alt.Chart(wheat).mark_bar().encode(
    x='year:O',
    y="wheat",
    color='highlight')
../_images/Lecture6_58_0.png

The legend is not that useful here so lets remove it.

alt.Chart(wheat).mark_bar().encode(
    x='year:O',
    y="wheat",
    color=alt.Color('highlight', legend=None))
../_images/Lecture6_61_0.png

Adding an annotation in the form of the exact price can be helpful.

bars = alt.Chart(wheat).mark_bar().encode(
    x='year:O',
    y="wheat",
    color=alt.Color('highlight', legend=None))
bars + bars.mark_text(dy=-5).encode(text='wheat')
../_images/Lecture6_63_0.png

If we want to override the color, we need to set it in the encoding. Setting it in the mark would not work since we are building off a chart which has the encoding color set, and this has higher precedence than color set in the mark. To pass a literal value in the encoding (instead of asking altair to look for a column with this name in the dataframe), we can use alt.value().

bars = alt.Chart(wheat).mark_bar().encode(
    x='year:O',
    y="wheat",
    color=alt.Color('highlight', legend=None))
bars + bars.mark_text(dy=-5).encode(text='wheat', color=alt.value('black'))
../_images/Lecture6_65_0.png

Now that we are supplying the exact value, we no longer need the gridlines, which are there to help infer values (and also make exact comparisons between graphical elements far away from each other).

bars = alt.Chart(wheat).mark_bar().encode(
    x='year:O',
    y=alt.Y('wheat', axis=alt.Axis(grid=False)),
    color=alt.Color('highlight', legend=None))
bars + bars.mark_text(dy=-5).encode(text='wheat', color=alt.value('black'))
../_images/Lecture6_67_0.png

Generally, having an outline of a plot is not that aesthetically pleasing. It works well in altair when we have the gridlines since they melt together, but now that they are gone, let’s also remove the outline.

bars = alt.Chart(wheat).mark_bar().encode(
    x='year:O',
    y=alt.Y('wheat', axis=alt.Axis(grid=False)),
    color=alt.Color('highlight', legend=None))
(bars + bars.mark_text(dy=-5).encode(text='wheat', color=alt.value('black'))).configure_view(strokeWidth=0)
../_images/Lecture6_69_0.png

We can highlight only the year by filtering the data frame.

bars + alt.Chart(wheat.query('year == 1810')).mark_text(dy=-5).encode(
    x='year:O',
    y=alt.Y("wheat",axis=alt.Axis(grid=False)),
    text='wheat')
../_images/Lecture6_71_0.png

To set a custom text, we can use alt.value again.

bars + alt.Chart(wheat.query('year == 1810')).mark_text(dy=-5, dx=-30).encode(
    x='year:O',
    y="wheat",
    text=alt.value('The record year'))
../_images/Lecture6_73_0.png

To set multiple values, we could either add an annotation column to our existing data frame, or create a new dataframe as below.

import pandas as pd

annot_wheat = pd.DataFrame({'year': [1730, 1810], 'wheat': [26, 99], 'text': ['The lowest year', 'The record year']})
annot_wheat
year wheat text
0 1730 26 The lowest year
1 1810 99 The record year
bars + alt.Chart(annot_wheat).mark_text(dy=-5).encode(
    x='year:O',
    y="wheat",
    text='text')
../_images/Lecture6_76_0.png

To avoid the overlap of the new annotation and the blue bars, we would have to create the annotations in two separate steps and change their text position or color accordingly. If using two separate steps, we can also use the alt.value() technique, which avoids us having to create the new data frame.

ggplot¶

Using the dataframe with the highlight column, we can set the fill accordingly.

%%R -i wheat
ggplot(wheat) +
    aes(x = year,
        y = wheat,
        fill = highlight) +
    geom_bar(stat = 'identity', color = 'white') +
    ggthemes::scale_fill_tableau()
../_images/Lecture6_80_0.png

And remove the legend.

%%R 
ggplot(wheat) +
    aes(x = year,
        y = wheat,
        fill = highlight) +
    geom_bar(stat = 'identity', color = 'white') + 
    ggthemes::scale_fill_tableau() +
    theme(legend.position = 'none')
../_images/Lecture6_82_0.png

To add annotations, we can use geom_text with the label aesthetic.

%%R 
ggplot(wheat) +
    aes(x = year,
        y = wheat,
        fill = highlight,
        label = wheat) +
    geom_bar(stat = 'identity', color = 'white') + 
    geom_text(vjust=-0.3) +
    ggthemes::scale_fill_tableau() +
    theme(legend.position = 'none')
../_images/Lecture6_84_0.png

To get these to be the same colors as the bars, we can set the color aestethic, and add the corresponding color scale.

%%R
ggplot(wheat) +
    aes(x = year,
        y = wheat,
        fill = highlight,
        label = wheat,
        color = highlight) +
    geom_bar(stat = 'identity', color = 'white') + 
    geom_text(vjust=-0.3) +
    ggthemes::scale_fill_tableau() +
    ggthemes::scale_color_tableau() +
    theme(legend.position = 'none')
../_images/Lecture6_86_0.png

Now we can remove the gridlines.

%%R
ggplot(wheat) +
    aes(x = year,
        y = wheat,
        fill = highlight,
        label = wheat,
        color = highlight) +
    geom_bar(stat = 'identity', color = 'white') + 
    geom_text(vjust=-0.3) +
    ggthemes::scale_fill_tableau() +
    ggthemes::scale_color_tableau() +
    theme(legend.position = 'none',
          panel.grid.major = element_blank(),
          panel.grid.minor = element_blank())
../_images/Lecture6_88_0.png

If you want your label to represent the count (which we normally calculate in the geom for ggplot), you can set it to label = stat(count).

To set a specific annotation text, we could either use the same approach as in Altair of adding a new column to our data frame, or we could use the annotate function.

%%R
ggplot(wheat) +
    aes(x = year,
        y = wheat,
        fill = highlight) +
    geom_bar(stat = 'identity', color = 'white') + 
    annotate('text', label = 'The record year', x = 1800, y = 102) +
    ggthemes::scale_fill_tableau() +
    theme(legend.position = 'none')
../_images/Lecture6_90_0.png

Direct labeling instead of using a legend¶

In the example below, the legend is not in the same order as where the lines end, which can make it a bit less intuitive to read.

stocks = data.stocks()

alt.Chart(stocks).mark_line().encode(
    x='date',
    y='price',
    color='symbol')
../_images/Lecture6_92_0.png

We can align the ordering of these two, by calculating the order of the lines at the maximum year and then and then passing the labels in this order as a list to the sort parameter.

stock_order = (
    stocks
    .loc[stocks['date'] == stocks['date'].max()]
    .sort_values('price', ascending=False))
stock_order
symbol date price
436 GOOG 2010-03-01 560.19
559 AAPL 2010-03-01 223.02
245 AMZN 2010-03-01 128.82
368 IBM 2010-03-01 125.55
122 MSFT 2010-03-01 28.80
alt.Chart(stocks).mark_line().encode(
    x='date',
    y='price',
    color=alt.Color('symbol', sort=stock_order['symbol'].tolist()))
../_images/Lecture6_95_0.png

The title for categorical axes or legends are often not that informative, and in many cases we can remove it.

alt.Chart(stocks).mark_line().encode(
    x='date',
    y='price',
    color=alt.Color(
        'symbol',
        sort=stock_order['symbol'].tolist(),
        legend=alt.Legend(title=None)))
../_images/Lecture6_97_0.png

We can use the annotation approach from above to label the lines directly, and get rid of the legend altogether.

lines = alt.Chart(stocks).mark_line().encode(
    x='date',
    y='price',
    color=alt.Color('symbol', legend=None))

text = alt.Chart(stock_order).mark_text(dx=25).encode(
    x='date',
    y='price',
    text='symbol',
    color='symbol')

lines + text
../_images/Lecture6_99_0.png

ggplot¶

%%R -i stocks
ggplot(stocks) + 
    aes(x = date,
        y = price,
        color = symbol) +
    geom_line() + 
    ggthemes::scale_color_tableau()
../_images/Lecture6_101_0.png
%%R
ggplot(stocks) + 
    aes(x = date,
        y = price,
        color = symbol) +
    geom_line() +
    ggthemes::scale_color_tableau() +
    theme(legend.position = 'none')
../_images/Lecture6_102_0.png

Here we use the same approach with geom_text and label as we did above. The difference is that we’re explicitly setting the data inside geom_text to use the dataframe that has been filtered to contain the max year only.

%%R -i stock_order
ggplot(stocks) + 
    aes(x = date,
        y = price,
        color = symbol,
        label = symbol) +
    geom_line() +
    geom_text(data = stock_order, vjust=-1) +
    ggthemes::scale_color_tableau() +
    theme(legend.position = 'none')
../_images/Lecture6_104_0.png

You can try the ggrepel package to help you with annotations. It’s pretty cool!

%%R -i stock_order
library(ggrepel())

extrema <- 
  stocks %>%
    group_by(symbol) %>%
    slice(which.max(date))

# Adding a couple of years to the date
extrema['date'] = extrema['date'] + 5E7

ggplot(stocks) + 
    aes(x = date,
        y = price,
        color = symbol,
        label = symbol) +
    geom_line() +
    ggthemes::scale_color_tableau() +
    geom_text_repel(
        data = extrema,
        aes(date, label = symbol),
        min.segment.length = Inf,
      )
../_images/Lecture6_106_0.png