ggplot2
Introduction
ggplot2 [Cran, Github] is a package created by Hadley Wickham which defines a “grammar” for graphics based upon the work of Leland Wilkerson in The Grammer of Graphics.
The basic idea is that while base R’s plotting mechanics are pretty robust, knowledge in one does not necessarily transfer to another. For example, while plot
can create scatter plots and line plots easily, the arguments that pass to it don’t also pass to histogram
. Some do (see ?par
), but those aren’t always supported.
ggplot2 takes the emphasis away from understanding the specific quirks of each desired plot, and instead attempts to create a unified language of plots, such that the difference between a histogram and a scatter plot call is simply gom_point
vs geom_histogram
.
Graphic grammer template
The basic template of any plot in ggplot2 is as follows.
ggplot(data = <DATA>,
mapping = aes(<MAPPING>)) +
<GEOM_FUNCTION>(
stat = <STAT>,
position = <POSITION>
) +
<COORDINATE_FUNCTION> +
<FACET_FUNCTION>
There are 7 arguments, wrapped in brackets, which can be defined. Rarely will all 7 be defined. Only the first three, <DATA>
, <MAPPING>
and <GEOM_FUNCTION>
need to be defined1
Data
The data is straightforward - the data to operate on.
Mapping
An aesthetic mapping defines how variables in the dataset are connected to visual properties or outputs. The terms “aesthetic” and “mapping” are often used interchangeably with the more formal “aesthetic mapping”. Just think of a mapping as defining properties of the output that depend upon variables. For example, coloring the points of a scatter plot based upon a categorical variable is a mapping, whereas coloring all points red is not.
The most basic (useful) mapping would be aes(x = var1, y = var2)
. This tells ggplot what variables are used on what axis.
geom_function
A ggplot
call, even with a full mapping, will not display anything. We need to literally add a plot type to the call to define what to plot.
Some examples include:
geom_histogram
geom_point
for scatter plotsgeom_boxplot
geom_line
for line plots
The others
The remaining arguments will be discussed after diving further into the basics.
Basic plots
Let’s start with the simplest2 example:
library(ggplot2)
data(midwest)
ggplot(data = midwest,
mapping = aes(x = percbelowpoverty))
The data
argument tells ggplot that we are using the “midwest” data set which contains county-level data for 5 states (IL, IN, MI, OH, WI). The mapping
argument passes a simple aes
which defines the x-axis as percent college graduates. We see this axis properly defined, but because we did not pass any geom_function, no additional plot is created.
Let’s create a histogram plot using geom_histogram
.
ggplot(data = midwest,
mapping = aes(x = percbelowpoverty)) +
geom_histogram()
Now, geom_histogram
knows everything that ggplot
knows, so it requires no additional arguments. We could easily change this to a density plot:
ggplot(data = midwest,
mapping = aes(x = percbelowpoverty)) +
geom_density()
Important: The +
must end the previous line, not begin the following.
By changing from geom_histogram
to geom_density
, we’ve inherited the same mapping information and don’t need to change anything else. Contrast this with the built-in R functionality:
hist(midwest$percbelowpoverty)
plot(density(midwest$percbelowpoverty))
The base R syntax is more concise (though that benefit will vanish with more complex plots) but requires knowing both unique commands, as well as knowing the trick to plotting density curves. On the other hand, the ggplot version only requires swapping geom_histogram
to geom_density
.
Scatter plots
Scatter plots are nice to play with, both because they’re such a staple of data analyses, but also because many more modifications can be done to them. The code is similar to before, using the geom_point
function.
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point()
There definitely seems to be a trend here. What about drawing a smoothed line instead?
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_smooth()
That helps but I’d really like to see those two overlaid. In base R, I’d need to plot two separate commands, telling the second to add to the first. In ggplot, I simply add another geom.
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point() +
geom_smooth()
By default, the smoothing is done by LOESS. To add a line of best fit, use method = 'lm'
. You can also have both!
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point() +
geom_smooth() +
geom_smooth(method = 'lm', color = 'seagreen')
Notice that since neither method
nor color
refers to any variable, they are not aesthetics.
Location of mapping argument
These two plots are identical:
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point()
ggplot(midwest) +
geom_point(aes(x = percbelowpoverty,
y = perchsd))
The mapping can be given in either the ggplot
command or individual geom_
commands. Mappings given in the ggplot
are the default for all geom_
commands. Mappings given in individual geom_
commands are specific to that command and can override the defaults. For example,
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point(aes(x = percblack))
Note that the label on the x-axis is percbelowpoverty
not percblack
, even though percblack
is what’s actually plotted. This is because ggplot(midwest, aes(x = percbelowpoverty, y = perchsd))
generates the plotting space (including axes and labels) before the points are plotted.
Additional Aesthetic Mappings
We saw color
used above. Let’s contrast how color
is used as aesthetic versus not.
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point(color = 'green')
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point(aes(color = 'green'))
Remember, arguments inside a mapping should apply to variables. Since 'green'
doesn’t exist as a variable, a new variable is created which is constant. What if we pass a proper argument?
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point(aes(color = state))
Now we see something! ggplot will automatically color as best it can. Consider what happens with a continuous variable
ggplot(midwest,
aes(x = percbelowpoverty,
y = perchsd)) +
geom_point(aes(color = percollege))
We see a nice gradient, and note that higher values of perchsd
, percent high school diplomas, tend to have higher percollege
, percent college graduates.
The other arguments
These will be presented in brief. Extended discussions of them can be found online; see for example the Visualization chapter of Hadley’s R for Data Science book.
Stat
Behind the scenes, certain geom_functions are transforming the data before plotting. For example,
data(mpg)
ggplot(mpg, aes(x = class)) + geom_bar()
geom_bar
is performing a count on the number of each type of class
. If you already have summarized data (e.g. you have just the counts already), you may want to tell geom_bar
to use the existing count variable instead of computing a new count. Passing stat = "identity"
and aes(x = class, y = count)
will accomplish this.
class_agg <- data.frame(table(mpg$class))
names(class_agg) <- c("class", "count")
ggplot(class_agg, aes(x = class)) + geom_bar()
ggplot(class_agg, aes(x = class, y = count)) +
geom_bar(aes(fill = class), stat = "identity")
Note the additional mapping of fill = class
just to make the plot look nicer.
See Statistical Transformations for further notes.
Position
The position argument is used to tweak how certain aspects of the plots are displayed. Its use depends heavily on the type of plot. For each geom, some positions will work, some will do nothing, and some will produce nonsense. They are most commonly used when trying to create grouped plots.
For example, we can look at a histogram of car mpg (hwy
).
ggplot(mpg, aes(x = hwy)) + geom_histogram()
If we add fill = class
, it will group by class. The default is position = "stack"
; let’s see what each does.
ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "stack")
ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "identity")
ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "dodge")
ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "fill")
ggplot(mpg, aes(x = hwy, fill = class)) + geom_histogram(position = "jitter")
stack
looks good. identity
has the histograms on top of each other, so not very useful (more useful for geom_density
). dodge
and fill
can help bar charts, but not useful here. jitter
is crazy and useless. However, note that all the positions are accepted.
In general:
identity
is useful when you want to plot things exactly as they are.stack
is useful when you want to look both at overall values and per-group values.dodge
is useful for group comparisons.fill
is useful for considering percentages instead of counts.jitter
is useful for scatter plots (and similar) when multiple values may be placed at the same point.
Positions are generally a trial-and-error procedure for me. If the default isn’t sufficient, see if the others look/work better.
Coordinate systems
The default coordinate system is coord_cartesian()
. There are two useful and two niche different coordinate systems.
coord_fixed
forces the x and y axes to have a fixed ratio between units on each. (Default ratio is 1:1, it takes an argument to define the ratio)coord_flip
flips x and y. Most useful to get e.g. horizontal instead of vertical bar charts.coord_map
plots map datacoord_polar
plots on the polar coordinate system.
Demonstrations of coord_fixed
and coord_flip
:
ggplot(midwest, aes(x = percwhite, y = percblack)) +
geom_point()
ggplot(midwest, aes(x = percwhite, y = percblack)) +
geom_point() +
coord_fixed()
ggplot(midwest, aes(x = percwhite, y = percblack)) +
geom_point() +
coord_flip()
coord_cartesian
(as well as _fixed
and _flip
) take xlim
and ylim
arguments, so if you want to restrict axes, use this.
Facets
Facets create plots per grouping variable. This will produce a grid of plots. For most general cases, use facet_wrap
, with an argument being a formula where the right hand side is variables to group on.
ggplot(midwest, aes(x = percwhite, y = percblack)) +
geom_point() +
facet_wrap(~ state)
You can pass two grouping variables (~ state + group2
), however, the output isn’t guaranteed to make a “table”. For that, use facet_grid
. The formula argument’s left hand side represent rows, right hand side represent columns. A .
in either position says no faceting on that dimension.
ggplot(mpg) +
geom_histogram(aes(x = hwy)) +
facet_grid(class ~ cyl)
Representing higher dimensions in scatter plots
Although scatter plots are inherently a two (or three) dimensional visualization, clever use of plot characteristics can represent higher dimensions. There is always a trade-off - the more information you represent in a single plot, the more likely you are to confuse the reader.
We’ll be using the mpg
dataset from ggplot2. Our primary variables of interest will be city mileage (cty
) versus highway mileage (hwy
), and we’ll be adding other variables as we go.
library(ggplot2)
str(mpg)
## Classes 'tbl_df', 'tbl' and 'data.frame': 234 obs. of 11 variables:
## $ manufacturer: chr "audi" "audi" "audi" "audi" ...
## $ model : chr "a4" "a4" "a4" "a4" ...
## $ displ : num 1.8 1.8 2 2 2.8 2.8 3.1 1.8 1.8 2 ...
## $ year : int 1999 1999 2008 2008 1999 1999 2008 1999 1999 2008 ...
## $ cyl : int 4 4 4 4 6 6 6 4 4 4 ...
## $ trans : chr "auto(l5)" "manual(m5)" "manual(m6)" "auto(av)" ...
## $ drv : chr "f" "f" "f" "f" ...
## $ cty : int 18 21 20 21 16 18 18 18 16 20 ...
## $ hwy : int 29 29 31 30 26 26 27 26 25 28 ...
## $ fl : chr "p" "p" "p" "p" ...
## $ class : chr "compact" "compact" "compact" "compact" ...
ggplot(mpg) +
geom_point(aes(x = hwy, y = cty))
Before we add any higher dimensions, note that each point in that plot may represent multiple cars, e.g. there are a Jeep Grand Cherokee and a Toyota Tacoma that each have 17 cty
and 22 hwy
. To make that clear, we can either use jitter:
ggplot(mpg) +
geom_jitter(aes(x = hwy, y = cty))
or make the size of each point relative to the count:
ggplot(mpg) +
geom_count(aes(x = hwy, y = cty))
The geom_count
version looks nicer, but changing the size of the points is a technique we’ll use to represent a 3rd dimension, so we’ll stick with the geom_jitter
version.
Displaying a 3rd (or higher) dimension of categorical variables is straightforward and you may have already used some or all of these methods. The most straightforward is faceting which we saw earlier:
ggplot(mpg) +
geom_jitter(aes(x = hwy, y = cty)) +
facet_grid(cyl ~ drv)
This allows us to represent four dimensions painlessly, as long as the two categorical dimensions have few unique values:
ggplot(mpg) +
geom_jitter(aes(x = hwy, y = cty)) +
facet_grid(class ~ manufacturer)
If we wanted a single plot, we can manipulate the points drawn to be distinct per group.
ggplot(mpg) +
geom_jitter(aes(x = hwy, y = cty, color = class, shape = drv, size = cyl)) +
scale_radius()
(Note the + scale_radius()
. When using the mapping size
, I sometimes see issues with the smallest point being out of proportion with the rest. [Try it - run the above without the last line.] Adding scale_radius()
[as opposed to scale_size()
] can sometimes fix it - try both options and choose whichever one you like best.)
We’ve gotten messy here (probably too confusing for publication) but this is plotting 5 dimensions! From this plot we can see:
- Mileage in city or highway is strongly correlated.
- More cylinders = worse mileage.
- 4 wheel drive is less efficient than rear-wheel which is less efficient than front-wheel.
- SUV’s and pickups have the worse mileage; 2-seaters, minivans and midsize have moderate mileage; subcompacts and compacts have the best mileage.
Note the choice of which variable gets which modification. cyl
, which is ordinal, is used for the size. There are more class
levels than drv
, so I feel class
is a better choice for colors. Look how bad this could look with different choices:
ggplot(mpg) +
geom_jitter(aes(x = hwy, y = cty, color = cyl, shape = class, size = drv))
## Warning: Using size for a discrete variable is not advised.
## Warning: The shape palette can deal with a maximum of 6 discrete values
## because more than 6 becomes difficult to discriminate; you have 7.
## Consider specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).
ggplot even complains… a lot.
Note that color functions differently for continuous data:
ggplot(mpg) +
geom_jitter(aes(x = hwy, y = cty, color = displ, shape = drv, size = cyl)) +
scale_radius()
We see that larger displ
(displacement, a.k.a size) corresponds to worse mileage.
Simpler plots with qplot
One of the major downsides of ggplot2 is that simple plots (e.g. the equivalent of R’s plot(x,y)
) require a bit more set-up: ggplot(data = d, aes(x = x, y = y)) + geom_point()
.
However, there is a function for quick plots, qplot
. Basically you pass an x
and/or a y
variable as the first two arguments, followed by any number of named aesthetic arguments and ultimately a data
argument. For example, the last plot above:
ggplot(mpg) +
geom_jitter(aes(x = hwy, y = cty, color = displ, shape = drv, size = cyl)) +
scale_radius()
becomes
qplot(x = hwy, y = cty, color = displ, shape = drv, size = cyl, data = mpg)
For very quick plots, the defaults are histogram (if only x
is provided), scatter plot (if both x
and y
are provided), or a scatter plot of y
against row number (if only y
is provided).
plot(mpg$cty, mpg$hwy)
qplot(mpg$cty, mpg$hwy)
qplot(cty, hwy, data = mpg)
You can treat it just like any other ggplot call and add additional geoms or settings:
qplot(cty, hwy, data = mpg) + geom_smooth()
## `geom_smooth()` using method = 'loess'
Alternatively, qplot
takes various arguments such as geom
, facet
, etc.
qplot(cty, hwy, data = mpg, geom = 'jitter', facets = drv ~ cyl)
However, if you’re getting this complicated, you’re probably going to want to switch to full ggplot
calls. As stated, qplot
is more a replacement for quick plot(x, y)
or hist(x)
during preliminary data exploration.