Programming your Pictures in R

Aidan Delaney & Brent Yorgey

| @aidandelaney
| https://byorgey.wordpress.com/

Motivation

Bioinformatics

images/F2.large.jpg
images/F2.large.jpg

Repeatability

  • Fits into repeatable workflows: see Taverna.
  • Is a programming language, so:
    • supports standard software development processes such as source control eg: euleR.
    • component reuse
    • testing

Takeaways

  1. Declarative drawing of diagrams allows separation of concerns.
  2. Declarative drawing assists in rapid prototyping of visualisation.

Introduction

R History

  • Based on S, developed by John Chambers (Becker and Chambers 1984).
  • Developed by Ross Ihaka and Robert Gentleman at University of Auckland (Ihaka and Gentleman 1996).
  • 30 year old(ish) interpreted language for statistical computation.
  • Well established

GG

  • Leyland Wilkinson developed the Grammar of Graphics (Wilkinson 2005).
  • R impementation by Hadley Wickham in 2005 (Wickham 2010)

First Steps

R DataFrame

mpg cyl disp am gear
Mazda RX4 21 6 160 1 4
Mazda RX4 Wag 21 6 160 1 4
Datsun 710 22.8 4 108 1 4
Hornet 4 Drive 21.4 6 258 0 3
Hornet Sportabout 18.7 8 360 0 3
Valiant 18.1 6 225 0 3
Duster 360 14.3 8 360 0 3

R DataFrame

  • Usually rows are the observations
  • Columns are the variables
  • Often (normally?) imported from a CSV or similar file.
mtcars$mpg

Returns a vector of the values of the mpg variable in top-to-bottom row order.

Importing data

For a comma separated values file you can simply

mydata <- read.csv("patterns.csv")

Import functions exist for Excel

library(xlsx)
mydata <- read.xlsx("patterns.xlsx", sheetName = "all-data")

and SPSS, SAS and several other file formats.

DataFrame "Queries"

We can get dataframe information by column, row or create a slice.

# by column
mtcars$gear
mtcars[,"gear"]

# by row
mtcars[1,]
mtcars["Fiat 128",]

# dataframe slice
mtcars[c("gear", "mpg")]

Simple Charts

Types

  • bar chart
  • point chart
  • line chart
  • box & whisker chart
  • histogram

BarChart

Let's plot the number of car models by their cylinder count.

p <- ggplot(mtcars, aes(x=factor(cyl))) + geom_bar()
plot(p)

Will plot the count of cars that share a certain cylinder count.

Point Chart

Let's plot the number of car models by their cylinder count.

p <- ggplot(mtcars, aes(x=factor(cyl), y=mpg)) + geom_point()
plot(p)

Requires both x and y aesthetics.

Line Chart

Let's plot the number of car models by their cylinder count.

p <- ggplot(mtcars, aes(x=wt, y=mpg)) + geom_line()
plot(p)

BoxPlot

p <- ggplot(mtcars, aes(x=factor(cyl), y=mpg))
p + geom_boxplot()

Histogram

ggplot(mtcars, aes(x=mpg)) + geom_histogram(binwidth=1)

Complex Diagrams

PieChart

A piechart is a barchart with a single stacked bar

pie <- ggplot(mtcars, aes(x = 1, fill = factor(cyl))) + geom_bar(position="stack")

Plotted on a different coordinate space i.e. polar.

pie <- ggplot(mtcars, aes(x = 1, fill = factor(cyl))) + geom_bar(position="stack") + coord_polar(theta = "y")

Histogram

Plot a histogram with an overlaid normal distribution:


ggplot(mtcars, aes(x=mpg)) +
  geom_histogram(aes(y = ..density..), binwidth=1) +
  stat_function(fun=dnorm,
                aes(colour = "red"),
                args = with(mtcars, c(mean = mean(mpg), sd = sd(mpg)))
                ) +
  labs(x="Miles per gallon", legend.position = "bottom", legend.direction = "horizontal")

Complex BarChart

Or similarly, let's use mean mpg as our y aesthetic. First we have to reshape our data.

require(ggplot2)
require(reshape2)

plot.data <- melt(tapply(mtcars$mpg, factor(mtcars$cyl),mean), varnames="cyl", value.name="mean")
ggplot(plot.data, aes(x=factor(cyl),y=mean)) + geom_bar(stat="identity")

Or

ggplot(mtcars, aes(y=mpg, x=factor(cyl), group=factor(cyl))) + stat_summary(fun.y=mean, geom="bar")

Short example

ggplot(mtcars, aes(x=mpg, y=disp)) + geom_point()
ggplot(mtcars, aes(x=mpg, y=disp)) + geom_point() + geom_smooth()
ggplot(mtcars, aes(x=mpg, y=disp)) + geom_point() + geom_smooth() + coord_flip()
ggplot(mtcars, aes(x=mpg, y=disp)) + geom_point(aes(color=factor(am))) + geom_smooth() + coord_flip()

ggplot(mtcars, aes(x=mpg, y=disp)) + geom_point(aes(color=factor(am))) + stat_smooth(method="lm") + coord_flip()

Reusable Function

ggplotRegression <- function (fit) {

ggplot(fit$model, aes_string(x = names(fit$model)[2], y = names(fit$model)[1])) +
  geom_point() +
  stat_smooth(method = "lm", col = "red") +
  labs(title = paste("Adj R2 = ",signif(summary(fit)$adj.r.squared, 5),
                     "Intercept =",signif(fit$coef[[1]],5 ),
                     " Slope =",signif(fit$coef[[2]], 5),
                     " P =",signif(summary(fit)$coef[2,4], 5)))
}

fit <- lm(mpg~disp, data=mtcars)
ggplotRegression(fit)

ggplot2 Grammar

Overview

  • Diagrams are built up in layers
  • Each diagram has
    1. A data layer,
    2. a stat istics layer,
    3. a geom etry layer,
    4. a scale layer, and
    5. a theme layer.
  • We can write ggplot(mtcars, aes(x=mpg, y=disp)) + geom_point() because of sane defaults.

Data

  • Must be in a data frame.
  • We map data columns to aes thetics.
    • generally x is required
    • color can be useful
    • y is required by some
> p <- ggplot(mtcars, aes(x=cyl))
> summary(p)
data: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, x [32x12]
mapping:  x = cyl
faceting: facet_null()

Statistics

We can add count as a statistic layer to our plot

> q <- p + stat_count()
> summary(q)
data: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, x [32x12]
mapping:  x = cyl
faceting: facet_null()
-----------------------------------
geom_bar: na.rm = FALSE, width = NULL
stat_count: na.rm = FALSE, width = NULL
position_stack

Geometry

We can choose a geometry to combine with our data and statistic:

> q <- p + geom_bar()
> summary(q)
data: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb, x [32x12]
mapping:  x = cyl
faceting: facet_null()
-----------------------------------
geom_bar: na.rm = FALSE, width = NULL
stat_count: na.rm = FALSE, width = NULL
position_stack

We've made no difference yet, as adding stat_count gives a default of geom_bar.

Geom Summary

df <- data.frame(
  x = c(3, 1, 5),
  y = c(2, 4, 6),
  label = c("a","b","c")
)
p <- ggplot(df, aes(x, y, label = label)) + xlab(NULL) + ylab(NULL)
p + geom_point() + labs(title = "geom_point")
p + geom_bar(stat="identity") + labs(title = "geom_bar(stat=\"identity\")")
p + geom_line() + labs(title = "geom_line")
p + geom_area() + labs(title = "geom_area")
p + geom_path() + labs(title = "geom_path")
p + geom_text() + labs(title = "geom_text")
p + geom_tile() + labs(title = "geom_tile")
p + geom_polygon() + labs(title = "geom_polygon")

The "cheat sheet" has many more types.

Scales

Adding a scale modifies the axes or:

p + scale_fill_brewer()

Aesthetics

Implementing many of Tufte's guidelines becomes:

p + geom_bar(stat="identity") + labs(title = "geom_bar(stat=\"identity\")") + theme_minimal()
p + geom_bar(stat="identity") + labs(title = "geom_bar(stat=\"identity\")") + theme_bw()

or my favourite

p + xkcdrect(
+     aes(xmin = x, xmax = x+1, ymin = 0, ymax = y),
+     df
+ ) + theme_xkcd()

Conclusion

Takeaways

  1. Declarative drawing of diagrams allows separation of concerns.
  2. Declarative drawing assists in rapid prototyping of visualisation.

Exercises

There are three exercises for you to tackle, available at http://aidandelaney.github.io/handouts/2016DiagramsRTutorial-questions.pdf.

References

Becker, R, and J Chambers. 1984. S: An Interactive Environment for Data Analysis and Graphics. Wadsworth & Brooks/Cole.

Ihaka, R, and R Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics, no. 5: 299–314.

Wickham, Hadley. 2010. “Ggplot2: Elegant Graphics for Data Analysis.” Journal of Statistical Software 35 (1).

Wilkinson, Leland. 2005. The Grammar of Graphics. 2nd ed. Springer-Verlag New York.