ggplot2

Introduction

ggplot2 [Cran, Github] is a package created by Hadley Wickham which defines a “grammar” for graphics based upon the work of Leland Wilkerson in The Grammer of Graphics.

The basic idea is that while base R’s plotting mechanics are pretty robust, knowledge in one does not necessarily transfer to another. For example, while plot can create scatter plots and line plots easily, the arguments that pass to it don’t also pass to histogram. Some do (see ?par), but those aren’t always supported.

ggplot2 takes the emphasis away from understanding the specific quirks of each desired plot, and instead attempts to create a unified language of plots, such that the difference between a histogram and a scatter plot call is simply gom_point vs geom_histogram.

Graphic grammer template

The basic template of any plot in ggplot2 is as follows.

ggplot(data = <DATA>,
       mapping = aes(<MAPPING>)) +
  <GEOM_FUNCTION>(
     stat = <STAT>,
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

There are 7 arguments, wrapped in brackets, which can be defined. Rarely will all 7 be defined. Only the first three, <DATA>, <MAPPING> and <GEOM_FUNCTION> need to be defined1

Data

The data is straightforward - the data to operate on.

Mapping

An aesthetic mapping defines how variables in the dataset are connected to visual properties or outputs. The terms “aesthetic” and “mapping” are often used interchangeably with the more formal “aesthetic mapping”. Just think of a mapping as defining properties of the output that depend upon variables. For example, coloring the points of a scatter plot based upon a categorical variable is a mapping, whereas coloring all points red is not.

The most basic (useful) mapping would be aes(x = var1, y = var2). This tells ggplot what variables are used on what axis.

geom_function

A ggplot call, even with a full mapping, will not display anything. We need to literally add a plot type to the call to define what to plot.

Some examples include:

The others

The remaining arguments will be discussed after diving further into the basics.

Basic plots

Let’s start with the simplest2 example:

library(ggplot2)
data(midwest)
ggplot(data = midwest,
       mapping = aes(x = percbelowpoverty))

The data argument tells ggplot that we are using the “midwest” data set which contains county-level data for 5 states (IL, IN, MI, OH, WI). The mapping argument passes a simple aes which defines the x-axis as percent college graduates. We see this axis properly defined, but because we did not pass any geom_function, no additional plot is created.

Let’s create a histogram plot using geom_histogram.

ggplot(data = midwest,
       mapping = aes(x = percbelowpoverty)) + 
  geom_histogram()

Now, geom_histogram knows everything that ggplot knows, so it requires no additional arguments. We could easily change this to a density plot:

ggplot(data = midwest,
             mapping = aes(x = percbelowpoverty)) + 
  geom_density()

Important: The + must end the previous line, not begin the following.

By changing from geom_histogram to geom_density, we’ve inherited the same mapping information and don’t need to change anything else. Contrast this with the built-in R functionality:

hist(midwest$percbelowpoverty)

plot(density(midwest$percbelowpoverty))

The base R syntax is more concise (though that benefit will vanish with more complex plots) but requires knowing both unique commands, as well as knowing the trick to plotting density curves. On the other hand, the ggplot version only requires swapping geom_histogram to geom_density.

Scatter plots

Scatter plots are nice to play with, both because they’re such a staple of data analyses, but also because many more modifications can be done to them. The code is similar to before, using the geom_point function.

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point()

There definitely seems to be a trend here. What about drawing a smoothed line instead?

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_smooth()

That helps but I’d really like to see those two overlaid. In base R, I’d need to plot two separate commands, telling the second to add to the first. In ggplot, I simply add another geom.

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point() + 
  geom_smooth()

By default, the smoothing is done by LOESS. To add a line of best fit, use method = 'lm'. You can also have both!

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point() + 
  geom_smooth() + 
  geom_smooth(method = 'lm', color = 'seagreen')

Notice that since neither method nor color refers to any variable, they are not aesthetics.

Location of mapping argument

These two plots are identical:

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point()

ggplot(midwest) + 
  geom_point(aes(x = percbelowpoverty,
                 y = perchsd))

The mapping can be given in either the ggplot command or individual geom_ commands. Mappings given in the ggplot are the default for all geom_ commands. Mappings given in individual geom_ commands are specific to that command and can override the defaults. For example,

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point(aes(x = percblack))

Note that the label on the x-axis is percbelowpoverty not percblack, even though percblack is what’s actually plotted. This is because ggplot(midwest, aes(x = percbelowpoverty, y = perchsd)) generates the plotting space (including axes and labels) before the points are plotted.

Additional Aesthetic Mappings

We saw color used above. Let’s contrast how color is used as aesthetic versus not.

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point(color = 'green')

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point(aes(color = 'green'))

Remember, arguments inside a mapping should apply to variables. Since 'green' doesn’t exist as a variable, a new variable is created which is constant. What if we pass a proper argument?

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point(aes(color = state))

Now we see something! ggplot will automatically color as best it can. Consider what happens with a continuous variable

ggplot(midwest,
       aes(x = percbelowpoverty,
           y = perchsd)) + 
  geom_point(aes(color = percollege))

We see a nice gradient, and note that higher values of perchsd, percent high school diplomas, tend to have higher percollege, percent college graduates.

The other arguments

These will be presented in brief. Extended discussions of them can be found online; see for example the Visualization chapter of Hadley’s R for Data Science book.

Stat

Behind the scenes, certain geom_functions are transforming the data before plotting. For example,

data(mpg)
ggplot(mpg, aes(x = class)) + geom_bar()

geom_bar is performing a count on the number of each type of class. If you already have summarized data (e.g. you have just the counts already), you may want to tell geom_bar to use the existing count variable instead of computing a new count. Passing stat = "identity" and aes(x = class, y = count) will accomplish this.

class_agg <- data.frame(table(mpg$class))
names(class_agg) <- c("class", "count")
ggplot(class_agg, aes(x = class)) + geom_bar()

ggplot(class_agg, aes(x = class, y = count)) + 
  geom_bar(aes(fill = class), stat = "identity")

Note the additional mapping of fill = class just to make the plot look nicer.

See Statistical Transformations for further notes.

Position

The position argument is used to tweak how certain aspects of the plots are displayed. Its use depends heavily on the type of plot. For each geom, some positions will work, some will do nothing, and some will produce nonsense. They are most commonly used when trying to create grouped plots.

For example, we can look at a histogram of car mpg (hwy).

ggplot(mpg, aes(x = hwy)) + geom_histogram()

If we add fill = class, it will group by class. The default is position = "stack"; let’s see what each does.

ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "stack")

ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "identity")

ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "dodge")

ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "fill")

ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "jitter")

stack looks good. identity has the histograms on top of each other, so not very useful (more useful for geom_density). dodge and fill can help bar charts, but not useful here. jitter is crazy and useless. However, note that all the positions are accepted.

In general:

  • identity is useful when you want to plot things exactly as they are.
  • stack is useful when you want to look both at overall values and per-group values.
  • dodge is useful for group comparisons.
  • fill is useful for considering percentages instead of counts.
  • jitter is useful for scatter plots (and similar) when multiple values may be placed at the same point.

Positions are generally a trial-and-error procedure for me. If the default isn’t sufficient, see if the others look/work better.

Coordinate systems

The default coordinate system is coord_cartesian(). There are two useful and two niche different coordinate systems.

  • coord_fixed forces the x and y axes to have a fixed ratio between units on each. (Default ratio is 1:1, it takes an argument to define the ratio)
  • coord_flip flips x and y. Most useful to get e.g. horizontal instead of vertical bar charts.
  • coord_map plots map data
  • coord_polar plots on the polar coordinate system.

Demonstrations of coord_fixed and coord_flip:

ggplot(midwest, aes(x = percwhite, y = percblack)) + 
  geom_point()

ggplot(midwest, aes(x = percwhite, y = percblack)) + 
  geom_point() + 
  coord_fixed()

ggplot(midwest, aes(x = percwhite, y = percblack)) + 
  geom_point() +
  coord_flip()

coord_cartesian (as well as _fixed and _flip) take xlim and ylim arguments, so if you want to restrict axes, use this.

Facets

Facets create plots per grouping variable. This will produce a grid of plots. For most general cases, use facet_wrap, with an argument being a formula where the right hand side is variables to group on.

ggplot(midwest, aes(x = percwhite, y = percblack)) + 
  geom_point() + 
  facet_wrap(~ state)

You can pass two grouping variables (~ state + group2), however, the output isn’t guaranteed to make a “table”. For that, use facet_grid. The formula argument’s left hand side represent rows, right hand side represent columns. A . in either position says no faceting on that dimension.

ggplot(mpg) + 
  geom_histogram(aes(x = hwy)) + 
  facet_grid(class ~ cyl)

Representing higher dimensions in scatter plots

Although scatter plots are inherently a two (or three) dimensional visualization, clever use of plot characteristics can represent higher dimensions. There is always a trade-off - the more information you represent in a single plot, the more likely you are to confuse the reader.

We’ll be using the mpg dataset from ggplot2. Our primary variables of interest will be city mileage (cty) versus highway mileage (hwy), and we’ll be adding other variables as we go.

library(ggplot2)
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame':    234 obs. of  11 variables:
##  $ manufacturer: chr  "audi" "audi" "audi" "audi" ...
##  $ model       : chr  "a4" "a4" "a4" "a4" ...
##  $ displ       : num  1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
##  $ year        : int  1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
##  $ cyl         : int  4 4 4 4 6 6 6 4 4 4 ...
##  $ trans       : chr  "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
##  $ drv         : chr  "f" "f" "f" "f" ...
##  $ cty         : int  18 21 20 21 16 18 18 18 16 20 ...
##  $ hwy         : int  29 29 31 30 26 26 27 26 25 28 ...
##  $ fl          : chr  "p" "p" "p" "p" ...
##  $ class       : chr  "compact" "compact" "compact" "compact" ...
ggplot(mpg) + 
  geom_point(aes(x = hwy, y = cty))

Before we add any higher dimensions, note that each point in that plot may represent multiple cars, e.g. there are a Jeep Grand Cherokee and a Toyota Tacoma that each have 17 cty and 22 hwy. To make that clear, we can either use jitter:

ggplot(mpg) + 
  geom_jitter(aes(x = hwy, y = cty))

or make the size of each point relative to the count:

ggplot(mpg) + 
  geom_count(aes(x = hwy, y = cty))

The geom_count version looks nicer, but changing the size of the points is a technique we’ll use to represent a 3rd dimension, so we’ll stick with the geom_jitter version.

Displaying a 3rd (or higher) dimension of categorical variables is straightforward and you may have already used some or all of these methods. The most straightforward is faceting which we saw earlier:

ggplot(mpg) + 
  geom_jitter(aes(x = hwy, y = cty)) + 
  facet_grid(cyl ~ drv)

This allows us to represent four dimensions painlessly, as long as the two categorical dimensions have few unique values:

ggplot(mpg) + 
  geom_jitter(aes(x = hwy, y = cty)) + 
  facet_grid(class ~ manufacturer)

If we wanted a single plot, we can manipulate the points drawn to be distinct per group.

ggplot(mpg) + 
  geom_jitter(aes(x = hwy, y = cty, color = class, shape = drv, size = cyl)) +
  scale_radius()

(Note the + scale_radius(). When using the mapping size, I sometimes see issues with the smallest point being out of proportion with the rest. [Try it - run the above without the last line.] Adding scale_radius() [as opposed to scale_size()] can sometimes fix it - try both options and choose whichever one you like best.)

We’ve gotten messy here (probably too confusing for publication) but this is plotting 5 dimensions! From this plot we can see:

  • Mileage in city or highway is strongly correlated.
  • More cylinders = worse mileage.
  • 4 wheel drive is less efficient than rear-wheel which is less efficient than front-wheel.
  • SUV’s and pickups have the worse mileage; 2-seaters, minivans and midsize have moderate mileage; subcompacts and compacts have the best mileage.

Note the choice of which variable gets which modification. cyl, which is ordinal, is used for the size. There are more class levels than drv, so I feel class is a better choice for colors. Look how bad this could look with different choices:

ggplot(mpg) + 
  geom_jitter(aes(x = hwy, y = cty, color = cyl, shape = class, size = drv))
## Warning: Using size for a discrete variable is not advised.
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

ggplot even complains… a lot.

Note that color functions differently for continuous data:

ggplot(mpg) + 
  geom_jitter(aes(x = hwy, y = cty, color = displ, shape = drv, size = cyl)) +
  scale_radius()

We see that larger displ (displacement, a.k.a size) corresponds to worse mileage.

Simpler plots with qplot

One of the major downsides of ggplot2 is that simple plots (e.g. the equivalent of R’s plot(x,y)) require a bit more set-up: ggplot(data = d, aes(x = x, y = y)) + geom_point().

However, there is a function for quick plots, qplot. Basically you pass an x and/or a y variable as the first two arguments, followed by any number of named aesthetic arguments and ultimately a data argument. For example, the last plot above:

ggplot(mpg) + 
  geom_jitter(aes(x = hwy, y = cty, color = displ, shape = drv, size = cyl)) +
  scale_radius()

becomes

qplot(x = hwy, y = cty, color = displ, shape = drv, size = cyl, data = mpg)

For very quick plots, the defaults are histogram (if only x is provided), scatter plot (if both x and y are provided), or a scatter plot of y against row number (if only y is provided).

plot(mpg$cty, mpg$hwy)

qplot(mpg$cty, mpg$hwy)

qplot(cty, hwy, data = mpg)

You can treat it just like any other ggplot call and add additional geoms or settings:

qplot(cty, hwy, data = mpg) + geom_smooth()
## `geom_smooth()` using method = 'loess'

Alternatively, qplot takes various arguments such as geom, facet, etc.

qplot(cty, hwy, data = mpg, geom = 'jitter', facets = drv ~ cyl)

However, if you’re getting this complicated, you’re probably going to want to switch to full ggplot calls. As stated, qplot is more a replacement for quick plot(x, y) or hist(x) during preliminary data exploration.


  1. Really, only <DATA> needs to be defined to create a valid ggplot gg object, but a blank plot is boring.

  2. Technically the simplest example would be just ggplot(data = midwest) but that would create literally a blank plot.

Josh Errickson