ggplot and concepts — what’s right, and what’s wrong

A few months back I gave a presentation to the NYC R Meetup. (R is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on ggplot2, a popular package for generating graphs of data and statistics. In the talk (which you can see here, including both my slides and my patter!) I presented both the really great things about ggplot2 and some of its downsides. In this blog post, I wanted to expand a bit on my thinking on ggplot, the Grammar of Graphics, and how peoples’ conceptual representations of graphs, data, ggplot, and R all interact. ggplot is both incredibly elegant and unfortunately difficult to learn to use well, I think as a consequence of the variety of representations.

The ggplot package, written by the overachieving and remarkable Hadley Wickham, is based on earlier more theoretical work by Leland Wilkinson. Wilkinson abstracted the process of putting data onto an image, and created a Grammar of Graphics, which describes how the data maps to the parts of a graph, rather than describing the final graph itself. For example, here’s how to create a pie chart, clipped from Wilkinson’s book:

Don’t worry about the details, but briefly, a pie chart is just a stacked bar graph (summary.proportion) plotted in polar coordinates (polar.theta). If you took the time to learn this grammar, you would realize that the hierarchical structure of a graph on a page (elements have positions and labels and visual properties like color, each of which have their own abstract structure) maps cleanly to the hierarchical structure of the grammar, and that variables in the grammar map cleanly to the linear structure of the data. As a user of this system, you would be able to see all three key representations at once: the data, the grammatical mapping from data to graph, and the graph itself.

Now consider ggplot, the implementation of the Grammar of Graphics in the R programming language. Does ggplot maintain three visible representations, all straightforwardly mappable to each other? Sadly, it does not. Instead, users of ggplot must map among four representations: the data (a standard data.frame object), the R syntax for ggplot2 (which has some quirks), an underlying ggplot object (similar to the Grammar of Graphics, but vastly more complex and impossible to examine directly), and the generated graph.

Consider the simple pie graph, below.

This chart is generated in ggplot2 by the following R code:

> zz <- data.frame(cat=c("a", "b"), val=c(5,3))

> zz
 cat val
1   a   5
2   b   3
> pp <- ggplot(zz, aes(x="", y=val, fill=cat)) + geom_bar(width=1) + 
> print(pp)

The print() function is optional within an R interpreter session, but I include because it illustrates a point that’s not initially obvious to many users. Unlike the built-in R plotting tools, the ggplot() function and its associated functions don’t plot anything on the screen, they just construct an object of type “ggplot”. Almost all of the actual work of mapping the data to stuff on your screen occurs when you print that object, using print() or ggsave().

So what does that object look like? If you type str(pp), you’ll get an answer, but it’s about a hundred lines of undecipherable hierarchical object and list structure, not intended to be examined by mere mortals. But there’s something critically important about that structure — like the original Grammar of Graphics, and unlike the R syntax above, it’s hierarchically structured.

In the R syntax, you create a base ggplot structure with the ggplot() call, then you abuse the “+” operator to make changes to that structure. The geom_bar() function adds a layer to the ggplot() object, where a layer is just what it sounds like, a set of information about one of potentially many overlaid layers of content that will be put on the graph. So you construct a ggplot object by first initializing everything about the basic plot, then tack on layers with +, right? Actually no, because the coord_polar() call doesn’t create or modify a layer at all, it modifies the base object! Even if you’ve acquired the nonobvious intuition that ggplot objects are hierarchical and are created by concatenating layers, you now have to break the analogy again to fully understand what + is doing!

There is a way to partially see the structure directly, but it’s not well thought-out from the point of view of someone trying to learn how to use the package. The summary() method on ggplot objects tells you about things you didn’t specify (faceting?), it’s incomplete, and it doesn’t map well to the R syntax. If something in your plot isn’t working the way you want it to, summary() won’t help you.

> summary(pp)
mapping:  x = , y = val, fill = cat
faceting: facet_grid(. ~ ., FALSE)
stat_bin: width = 1
position_stack: (width = NULL, height = NULL)

Another shortcut that leads to conceptual problems by ggplot beginners is the use of qplot(). The qplot() function is a wrapper around ggplot(). Unlike ggplot(), you can give qplot() data that is not in the form of a data.frame, and the syntax is somewhat different. There’s nothing wrong with some syntactic sugar to make life easier, but in this case, learning ggplot by starting with qplot is like trying to learn a foreign language by starting with contractions and slang. You may be able to say a few essential things on your vacation, but you won’t be able to creatively construct new sentences as new situations arise. The brilliance of the Grammar of Graphics is exactly that it’s a grammar — you can construct new graphs and new types of graphs as new situations arise! But tutorials that start with qplot, with the ggplot book an unfortunate (but in other ways excellent) example, send their learners down a linguistic garden path. To fully use the power of the system requires unlearning the conceptual structures that map the slang to charts on a screen, and starting over with learning the new, more powerful ggplot() grammar and hierarchical representations.

I’d like to conclude this overlong rant with two notes. First, just today a new graphics package for R was introduced. jjplot uses many of the ideas of the Grammar of Graphics and ggplot2, but seems to avoid at least a few of the conceptual problems. The + operator is not overloaded in conceptually confusing ways, and there is no distracting qplot function to mislead new users. Additionally, a quick look at the source code finds it much, much simpler than ggplot2’s source, which will likely lead to a more active base of contributors. I look forward to trying jjplot and watching its continuing development, and hope the authors learn from both the remarkable successes and frustrating failures of ggplot. Second, I use ggplot extensively in my work. It’s simply the best available tool for quickly generating elegant graphs of data in R, especially if that generation needs to happen automatically in code. Hadley Wickham deserves extensive praise for the amount of effort he has put into developing and popularizing the Grammar of Graphics. If you want to be maximally effective when visualizing data in R, take the time to learn ggplot2, but do so while keeping in mind that the learning process will be easiest if you skip qplot and other shortcuts, think hierarchically, and prepare for some frustration. Fortunately, the support communities on the ggplot mailing list and Stack Overflow are extremely helpful, as is Hadley himself.

10 thoughts on “ggplot and concepts — what’s right, and what’s wrong

  1. Pingback: Somethink to Chew On » how to speak ggplot2 like a native, and Predictive Analytics World

  2. stat arb

    Yeah, this looks a little confusing conceptually … but it sounds like if one mastered it, then one has already abstracted presentation from data in a programmatic way, such that any implementation would already have the visualization taken care of.

    Kind of like LaTeX I guess. But it’s funny how the default is supposed to be all you need and yet people modify things all the time. All of my formulae have \, \; \! for example and I’m not a command-defining wizard like some.

  3. MartinInFrankfurtaM

    After trying for weeks to understand the gg-implementation of the grammar of graphics, after studying the book and the web-site and the forums, I feel enlightened after reading this post.
    The ggplot use of the overloaded “+” amounts to abuse, because addition is commutative and layering is not.
    Could you please write more about the structure of the ggplot() object and the print method?

  4. eric

    I’ve been struggling with ggplot and the Wickham book for a few weeks now. I find ggplot VERY difficult to learn. And the book is not really all that great in terms of explaining things. Concepts like a pie chart is just a stacked bar chart, objects are hierarchial, the + operator is overloaded, and what we’re doing here is concatentating layers make me think to myself, huh, what ????

    I sure hope jjplot takes off and someone writes a simple easy to understand book about it.

  5. LC

    Very nice post. I do think that there are many appealing concepts and decisions introduced in ggplot, but I get burned every time I learn it because it’s restrictive in the types of plots you can make with its native syntax (HW deliberately makes it difficult if not impossible to make types of plots which he disapproves, and these mostly include issues regarding axis addition or customizations of limits), and also because of the (lack of) speed. And you’re quite right about the “abuse” of the operator! I couldn’t quite put my finger on it but you are spot on in that it does break your intuitive understanding of what is advertised as a superposition of layers. Having complained about all this, HW’s packages (reshape, plyr) are still the main reason why I use R, but in the end I still go back to lattice graphics for the plotting. It’s warty, but fast and amenable for further customization than ggplot2 (and there’s even a ggplot2like() themed settings available in latticeExtra).

  6. chango

    Interesting criticisms of ggplot2. I actually came to this page from R bloggers to see if Hadley had weighed in. He’s extremely open to criticism and outside ideas (there are many developers of ggplot and they are an amazing community, he’s not the only person driving it).

    In fact, ggplot2 is under going major work currently (under the guise of a separate layers package), see the thread in the ggplot2-dev group.

    Here he describes the what and the why of rewriting it. He’s aware of the speed issues, and the nightmare of the proto objects ( which i was expecting to be is major drawback). The appropriate use of operators for clarity is a great point (though I believe commutativity a property of the object being operated on and not the operator itself) and I too believe that qplot should be scrapped. Alas, I do not understand the need to build a separate, new jjplot… but it could be cool. There’s just so much awesome work going on on ggplot2.

    It’s actually why I switched to R and Hadley’s other packages are equally paradigm changing and save me a ton of work. When they defuddle me, the community is strongly supportive.

  7. Robert Weiss

    Hi all,
    I use ggplt2 since 2 years now .. here and there in my research doing mainly parsing large data files and trying to plot them.
    It is conceptually a mess this packet … you start with qplot … you figure out ggplot is the root concept … you start with per-configured geoms … you find out +layers() are the way … so far so good … but than .. stats … what the f*** ? … so you can split mappings of variables to graphical elements and than stats .. sounds cool … conceptually … totally wrong … as everything is not clearly separated rather than mixed across everything.
    .. but the worst … the absolutely worst ever in this package … are the error messages … I honestly think it would be better to suppress them just all .. never ever I saw or coded myself in the Linux kernel such a non speaking error message output as ggplot has… you need to have google alway right open to it ti find a working example … so long story … sad story …

    I hardly ever was forced to use such a peace of scrambled functionalities to get a proper plot .

  8. Robert Weiss

    ..and to provide oneexample where my personal brain did and will allays fail after 2 weeks of not using ggplot is this wired concept -> citation:

    By default, geom bar assumes that your data is unaggregated. There
    are two ways to do what you want.

    Supply weights:

    ggplot(d, aes(x = Gender, weight = Freq)) +

    geom_bar() +
    facet_wrap(~Dept) +

    Or disable the default aggregation:

    ggplot(d, aes(x = Gender, y = Freq)) +
    geom_bar(stat = “identity”) +
    facet_wrap(~Dept) +

  9. Mike Williamson


    Yes, I would agree on both aspects of speed, and inability to do “unapproved” things. The speed doesn’t seem a big deal, until you start to incorporate automated quick & dirty exploratory graphs into part of your daily routine. I had to stop using ggplot2 for a while b/c of how slow it was, for a time when I was generating & studying hundreds of graphs (the easiest way for me to capture relationships that were not obvious in their form… typically non-linear).
    Yet, it is still a great package and SO deep. At my best, when I understood ggplot well, I could really make some beautiful — and varied — graphs that are really about the only ones I would feel comfortable publishing (generated from ‘R’, that is). But I have not used it for a while, and now I feel like a clueless small child once again.

    There is so much to love about ggplot, though. I think that the original publication of the Grammar of Graphics, and Hadley Wickham’s implementation of ggplot, have forged a bold path towards where we need to go. Thinking boldly leads to failure at times, but is also a necessary step towards the changes that are really needed.

    I am now curious about jjplot, but I am certain that things will keep improving, no matter the “winning” package.


Leave a Reply

Your email address will not be published. Required fields are marked *