ggplot
and the grammar of graphics
The “gg” in ggplot
stands for “grammar of graphics”.
Grammar is what elevates single words to complex statements.
From OED (online),
“… basic or formal principles, elements, or rules of a particular subject or activity; the study of these…”
The advent of ggplot
is somewhat of a philosophical shift (and divide) among relevant “scientists”.
From Hadley Wickham (2010),
The grammar of graphics takes us beyond a limited set of charts (words) to an almost unlimited world of graphical forms (statements).
The rules of graphics grammar are sometimes mathematical and sometimes aesthetic.
To develop grammar skills, you need the relevant vocabulary.
In my experience, ggplot
has quite a bit of a “vocabulary overhead”.
ggplot
.ggplot
experience than I sought out, probably have the exact opposite feelings.The idea of “layering” is central to the implementation.
In this framework, a plot is the
These ideas are nothing new, but they are emphasized and implemented a bit differently.
ggplot
visualizedShamelessly excerpted from H. Wickham (2010).
Personally I find it confusing to think of the points (and their shapes) as separate from, in fact in some real sense, prior to the consideration or construction of axes!
Both things can change in a flash, imagine sketching a quick graph by hand before realizing you should change the axes.
It’s not a big deal to stop and reassess, and nothing really changes, just the aesthetics.
You could almost as easily think of the difference between “a” and “b” values of the original variable \(D\) or the graphics “shape” parameter being represented as colors.
That said, shape is more easily distinguished from aspects of color, so a safer choice in general.
Additionally,
In fact, few such things are printed these days.
But here the emphasis on accessibility is most important.
Occasionally it is useful to break up a plot over some repeated unit for clarity or emphasis.
This is referred to as faceting. In addition to the choice of point symbol shown in Figure 5, the data for separate groups could be pulled into separate (but otherwise identical) axes, with the point symbol remaining distinct for emphasis.
Again, those figures were pulled from Hadley Wickham’s “A Layered Grammar of Graphics” (2010).
With possibly less work than it took to copy, paste, and edit those figures, as well as to make the acknowledgement in the slides and narration, we could have recreated most of that using “base R” (and soon ggplot()
).
Give both a try.
People argue bitterly, and defensively, about their choice of tools.
Personally, I feel more powerful in “base R” (I can get away with this, maybe at the starts of your careers you should embrace more cutting-edge tools?). At the end of the day, the more you know the better.
A strength that one user identifies of one system, is the biggest weakness identified by another.
Which is the butter knife, the screwdriver, the power drill?
ggplot
This is like learning a new language, just as the introduction to R has possibly been.
Through practice, I can think about and do more things (in base R), with
Moving to ggplot
, our specific language is going to explode, specifically in terms of vocabulary.
ggplot
The main components (Wickham, 2010) of “layered grammar” (i.e., ggplot
) are roughly,
Additionally,
Ensure that the copy of the data/
directory is beside this file before rendering.
We have used a few datasets, with combinations of different variable types.
In terms of aesthetics, restrict to thinking of only the choice of variable names from the data and possibly some distinguishing feature of the plot (e.g., shape, size, color, style) assigned from an additional variable.
We begin with ggplot()
being given some data or possibly data and the aesthetics (variables and possibly some scheme for annotation). Any of the following work.
Verify these in ggplot
then recreate them/it in base R.
Some rather helpful combinations are possible. Try some of the following,
Alter the aesthetics by using Animal
then Mass
to modify the appearance of the
shape = ...
color = ...
(or col
)size = ...
Some combinations simply do not make much sense, or at least don’t add much value.
Here, this is over the top, but specifying col = ...
to emphasize values of a third variable can be useful in general.
You could specify shape = ...
(\(\approx\) pch = ...
), but here there are too many possible values for Animal
(called levels), and we run out of shapes.
This refers to the types of graphs that can be produced from the given data.
base R syntax | ggplot analog |
---|---|
pch = ... |
shape = (category) |
cex = ... |
size = (category or number) |
col = ... |
color = (category or number) |
base R syntax | ggplot analog |
---|---|
lty = ... |
linetype = |
lwd = ... |
linewidth = ... |
col = ... |
color (category or number) |
There are a variety of related commands geom_path()
, geom_line()
, and geom_step()
which apply to specific situations and have sensible defaults.
As you encounter the help files, you might see things like stat = "identity"
appear.
Additionally, some plot types are the visuals most associated with certain statistical output - these pairings are often defaults.
To avoid option conflicts, it is probably best to keep aes()
simple until the geometry is selected or to spesify within the geometry itself.
Using values of Animal
(or Mass
) to with some of the relevant mapping options like color =
, shape =
, or (new) linewidth =
or linetype =
throws (rather helpful) warnings since some those don’t make sense for this type of graph or data.
Try it.
There is more to say about “graph types”, but before leaving this dataset, let’s improve the rendering of the “Mouse-Elephant Curve”.
Starting with the base plot,
append ... + scale_x_log10() + scale_y_log10()
and ... + geom_smooth(method = lm)
.
Then move aes(Mass, MetabolicRate)
inside geom_point()
, as we did earlier. Repeat the experiment and think about what happens.
Let’s revisit the “milestones of adulthood”.
geom_
’s)Using geom_line()
with linetype = milestone
or color = milestone` are nice options.
Similarly, setting any of color = ...
, shape = ...
, or size = ...
to milestone
while using geom_point()
generates a variety of nice results.
Try it. Make some mistakes and weird combinations.
Part of the goal is exploration. That takes practice and patience. Generate graphs using each of the following.
Exploratory graph in ggplot()
.
An implementation in base R.
Adding a legend only takes legend("topright", ms, lty = 1:4)
.
Plotting NULL
is sometimes a nice trick, but note the axes labels.
Admitting my weakness (preference) for “base R”, I will say that I enjoy the fact that some of the more “hands-on” programming often helps me feel like I’ve gotten a better sense of the data itself.
In that sense I feel like I pay a bit more attention to the project, as opposed to perhaps racing through and feeling like I am done before I am.
We can separate panels by a given variable using facet_grid()
.
Interestingly, with slightly different syntax, facet_wrap()
does something similar.
These are sometimes called “small multiples” where the structure of the graph is repeated but the content changes in a clear and consistent way from panel to panel.
We can explore a few other plot types using our third small dataset.
There’s nothing wrong with using this small, “simple” dataset. The na.omit()
line simply removes some incomplete rows from our raw data.
We can produce a boxplot.
Layer on related visualizations ... + geom_violin(color = "red")
and ... + geom_jitter(color = rgb(0, 0, 1, alpha = 0.5))
.
The “violin” plot shows some of the features of an empirical (kernel) density.
The density plot is essentially a smoothed histogram, which itself is a higher resolution version of a boxplot, which itself is a visual of the five-number summary.
The only other helpful thing to consider at this point would be to use
as the analog to mtext()
from base R.