Book Summary: Fundamentals of Data Visualization by Claus O. Wilke

Rating: 8.0/10.

Part 1: From Data to Visualization

All figures should be reproducible from data and code, should not have to make manual adjustments in Illustrator, or you will be dissuaded from updating them, or you may forget how they’re generated. Figures may be “ugly” (aesthetically unpleasing), “bad” (unclear and confusing), or “wrong” (objectively incorrect).

An aesthetic is any visual component that represents information, like position, color, line type, etc. Variables are either discrete or continuous, and discrete variables can be ordered or unordered. It’s important to choose the right aesthetics for the variables.

Coordinate systems. Default is Cartesian system. Logarithmic is good for ratios or anything multiplicative. Less common are square root scales for areas, and polar scales for geographic or periodic data.

Color scales. Qualitative scales should look equivalent to each other as they have no order. Sequential scales are for ordered data, eg: from light to dark. Divergent scales are for data with a neutral midpoint. Accent scales are for highlighting specific elements against a neutral background.

Bar plots. Don’t use rotated labels when they’re too long, better to swap the x and y axes. If there is a natural ordering, use that, otherwise order by size, don’t order randomly or alphabetically. When using grouped bar plots, there is a tradeoff as the position-encoded variable is easier to read and the color-encoded variables harder. Stacked bar plots are good when the sum is important. Dot plots are a good alternative when you don’t want to start from zero (eg: comparing life expectancies across countries).

Visualizing distributions. Histograms are popular, although they depend on the choice of bin width. Density plots are an alternative, similarly you choose the kernel and bandwidth. You must be careful to not generate misleading densities where values are impossible (eg: age < 0). When visualizing multiple distributions, can use overlapping density plots or separate plots, don’t do stacked or overlapping histograms.

Disadvantage of histograms / density plots is they depend somewhat on kernel size. Alternative is CDF plots, which let you easily read off percentiles. Log transform is good when data is highly skewed. QQ plots are good for visualizing to what extent does a dataset follow a distribution.

Visualizing many distributions at once. Boxplot is a simple way, with median, 25/75 percentiles, fences, and outliers. Violin plots show distributions, strip charts show clouds of points with jitters. Sina plots combine violin with strip chart. On a horizontal axis, a ridgeline plot is similar to violin plot but sideways.

Visualizing proportions. Some statisticians dislike pie charts, but they are fine for small datasets and simple proportions; they are poor for showing trends across multiple proportions. For this, stacked bars are fine as long as the number of categories is small. Otherwise, consider having a separate plot for each category as a density plot relative to the whole.

Visualizing nested proportions. Sometimes each data point has multiple categorical attributes, like material x era. Then pie charts and bar charts are bad, use mosaic plots or treemaps, where each combination of two categories is a square within a whole. Can also use pie charts with a nested color scheme. If there are more than two attributes, can use parallel sets, but this can get messy.

Visualizing associations between continuous variables. Scatterplots are the most basic, good for encoding two variables. Bubble plots encode a third variable using bubble size, but size is harder to perceive than position. For more than a few variables, scatterplots are awkward, so it’s better to use correlograms, which show correlations between every pair of variables, but it obscures the relationships somewhat as it doesn’t show the actual data. PCA is an alternative, plot the weightings of the first two PCs. For paired data, can use a scatterplot with an y=x line, or a slopegraph which sorts by the biggest difference between pairs as the slope.

Visualizing time series. Most basic is a scatterplot, but can connect together the points into a line plot to be more visually appealing. If there are multiple trends, good to color all of the lines and label them directly instead of having a legend. If there are two variables, can make a connected scatterplot (also called phase portrait) parameterized by time, but make sure to label the direction and time scale.

Visualizing trends. Sometimes it is more important to convey the general trend than specific points. Simple method is use a moving average, but this ends a bit short of the actual endpoint. Alternative is LOESS or cubic splines, but the trend can look quite different depending on which spline you use. Also the coefficients are not independent, so may be better to fit a functional form like linear or exponential curve. When there is a combination of trends, helpful to break down into long-term, seasonal, and random components.

Visualizing geospatial data. Need to choose a projection — Mercator is bad for preserving area for the whole world, but okay if the section of map is small. A bunch of projections that are area preserving. Author dislikes maps that shrink down Alaska to a much smaller area. Layers can be used to overlay many things over a map. Choropleths color the map according to some attribute, better to use a smaller number of discrete colors (4-6) than a continuous color scale, but can be misleading when low-density areas have large area. Alternative is cartogram, which resizes states by population. Also, can rearrange graphs to the map of the US, easier to find specific states this way than when ordered alphabetically.

Visualizing uncertainty. Tempting to represent probability by area under a bell curve, but people are better at perceiving discrete objects than areas, so better to have random squares or discrete dots under curve (quantile dot plot). Don’t have too many dots or it will be perceived as continuous. Error bars may be understood to be standard deviation (measuring variance of data) or standard error (measuring uncertainty of estimate) so don’t confuse them. Common error is to misunderstand uncertainty estimates as deterministic, eg, bounds for possible values. Two ways to prevent this are: using graded error bars, and for curve fits, plotting alternative random fits.

Part 2: Principles of Figure Design

Principle of Proportional Ink. Whenever a shaded region represents a value, the area of the region should be proportional to the value. Thus, you should always start shaded bar charts at zero. If this is awkward, can define zero at one data point and others as difference relative to that point. For log charts, use dots and not shaded bars because there is no zero; unless it’s a ratio in which case the neutral point with no area should be 1, not 0.

Handling overlapping points. Transparency can help you plot more points, but if they’re stacking on top of each other, they may produce different shades which is confusing. Can add some jitter to avoid overlap on discrete values. If too many points to plot all of them, can plot 2D histograms or contour density plot.

Common pitfalls of color use. Don’t use too many distinct colors, best is 3-5 before matching color to label becomes burdensome. Don’t use overly saturated colors: the commonly-used rainbow scale is bad. Some people have difficulty distinguishing various pairs of colors, should use either a colorblind scale or test your figure with a colorblind simulator.

Redundant coding. Not enough to just distinguish points by color, helps to also make them different in some other way, like shape — this makes it easier for colorblind people and B/W printing. In line charts with a natural ordering of the lines, the legend should follow the same order, or alternatively, delete the legend and label lines directly. Can also label points on scatterplots by drawing circles around them of the matching color.

Multipanel figures. Small multiples is when you have a grid of the same type of plot, each differing by a variable. Arrange them in a logical order, and try to have them have the same axis scales. Compound figures are several independent figures. Each subfigure should have an alphabetic numbering, but it should not be too obvious. The subfigures should have a common visual language (eg: coloring scheme for male vs female and axes should be aligned).

Titles, captions, tables. Every figure should have a title, either integrated as bold text at the top, or within the captions if it’s for a publication. Label the axes, including units, but don’t do it if it’s too obvious. Some rules for tables: never use vertical lines, don’t use horizontal lines to separate data rows, numbers should be right-aligned while text left-aligned.

Balance the data and context. Here, context is visual elements like axes and legends that aren’t data. Want to use minimal amount of ink for visual elements, within reason. Gridlines should be a bit lighter than the data points, not too bold. Having no gridlines is also bad because it creates the impression of floating in space. Shaded backgrounds make it perceived as a single visual entity, which is useful for multiple grids but is otherwise ugly.

Use larger axis labels. Many people make their axis labels too small, they should be a similar size as the data points. This is particularly a problem when designing on a screen that’s bigger than print, so make sure you view it at the size it’s meant to be printed.

Avoid line drawings. Use shaded areas, and not outlines for histograms and density plots, or else they will appear hollow. Don’t use open points in scatterplots, solid points have better visual presence.

Don’t go 3D. Especially when the 3D aspect is extraneous, makes it harder for the visual system to read and compare values. The problem is the 3D figure ends up being printed to a 2D medium so there’s a two-step process for interpreting it. 3D is appropriate for when an actual object is 3D, like terrain maps; it helps to have interactive or slowly rotating plot if it’s 3D.

Part 3: Miscellaneous Topics

Image file formats. Prefer PDF when possible (vector format), then PNG (lossless bitmap), then JPEG as a last resort (lossy image). Vector formats can be scaled up without loss of detail, but may become very big if there are lots of points. Always save the highest resolution, and downscale when needed.

Visualization Software. Repeatability is when the identical figure can be generated from code, reproducibility means someone else can reproduce roughly the same results. Should try to minimize manual postprocessing and interactive plotting, since you often need to run it again with slightly different data, so make it so everything can be done through code. For exploratory analysis, should be able to quickly switch between different visualizations of the data.

Telling a story. Consider what point you are trying to make with your figures, don’t make them too complicated to the reader. Perhaps build up to a complex figure through a simplified version first. Use a variety of types of graphs to avoid being repetitive and be more memorable.

See more reviews on Amazon.com

Part 1: From Data to Visualization

Part 2: Principles of Figure Design

Part 3: Miscellaneous Topics

Share this:

Most similar books:

Leave a Reply Cancel reply