<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Somethink to Chew On &#187; ggplot2</title>
	<atom:link href="http://www.harlan.harris.name/tag/ggplot2/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.harlan.harris.name</link>
	<description>the blog of Harlan Harris</description>
	<lastBuildDate>Sun, 06 Nov 2011 20:57:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>intuitive visualizations of categorization for non-technical audiences</title>
		<link>http://www.harlan.harris.name/2011/04/visualizing-categorization-models/</link>
		<comments>http://www.harlan.harris.name/2011/04/visualizing-categorization-models/#comments</comments>
		<pubDate>Mon, 25 Apr 2011 12:45:48 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[dataviz]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[predictive]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=88</guid>
		<description><![CDATA[For a project I&#8217;m working on at work, I&#8217;m building a predictive model that categorizes something (I can&#8217;t tell you what) into two bins. There is a default bin that 95% of the things belong to and a bin that the business cares a lot about, containing 5% of the things. Some readers may be [...]]]></description>
			<content:encoded><![CDATA[<p>For a project I&#8217;m working on at work, I&#8217;m building a predictive model that categorizes something (I can&#8217;t tell you what) into two bins. There is a default bin that 95% of the things belong to and a bin that the business cares a lot about, containing 5% of the things. Some readers may be familiar with the use of predictive models to identify better sales leads, so that you can target the leads most likely to convert and minimize the amount of effort wasted on people who won&#8217;t purchase your product. Although my situation doesn&#8217;t have to do with sales leads, I&#8217;m going to pretend it does, as it&#8217;s a common domain.</p>
<p>My data is many thousands of &#8220;leads&#8221;, for which I&#8217;ve constructed hundreds of predictive features (mostly 1/0, a few numeric) each. I can plug this data into any number of common statistical and machine learning systems which will crunch the numbers and provide a black box that can do a pretty good job of separating more-valuable leads from less valuable leads. That&#8217;s great, but now I have to communicate what I&#8217;ve done, and how valuable it is, to an audience that struggles with relatively simple statistical concepts like correlation. What can I do?</p>
<p><span id="more-88"></span></p>
<p>I&#8217;m generally interested in finding better ways to build clean, intuitive, and informative visualizations of data, especially when the visualizations can leverage intuitions and skills that everyone has. For example, almost everyone has a surprisingly good <a href="http://www.nytimes.com/2008/09/16/science/16angi.html" target="_blank">approximate number sense</a>, the ability to quickly identify about how many items are in a largish group. For example, if shown a photo of 30 oranges and a photo of 20 oranges, you would be able to immediately say that there were more oranges in the first photo, and you would happily say that that photo had a few dozen oranges in it. This psychological skill can be used to make more effective visualizations of certain types of data. Instead of comparing two quantities by lines in a chart, or even a number in a table, it may be useful to compare <em>visual density</em>.</p>
<p>How can this be used to make better visualizations of prediction quality? Consider the standard ways that predictive model quality is reported. I have obfuscated the test set data from the problem I mentioned above, and placed it in a <a href="http://dl.dropbox.com/u/7644953/classifier-visualization.Rdata">public Dropbox</a> in Rdata format. I&#8217;ve also put together an <a href="http://www.r-project.org/" target="_blank">R</a> script to demonstrate various ways of looking at the predictions and put it in a <a href="http://gist.github.com/937821" target="_blank">Github gist</a>. Follow along if you&#8217;d like.</p>
<p>First, take a look at the data frame and some summary statistics:</p>
<pre>&gt; head(pred.df)
      predicted actual actual.bin
7379  0.6020833    yes          1
5357  0.5791667    yes          1
7894  0.5791667    yes          1
5893  0.5604167    yes          1
16093 0.5541667    yes          1
2883  0.5520833    yes          1

&gt; summary(pred.df)
   predicted        actual       actual.bin
 Min.   :0.000000   no :7785   Min.   :0.0000
 1st Qu.:0.004167   yes: 366   1st Qu.:0.0000
 Median :0.016667              Median :0.0000
 Mean   :0.040827              Mean   :0.0449
 3rd Qu.:0.041667              3rd Qu.:0.0000
 Max.   :0.602083              Max.   :1.0000</pre>
<p>The mode predicts about 4% of the items will be in the &#8220;yes&#8221; category, which is similar to the 4.5% that actually were. Using the very flexible <a href="http://cran.r-project.org/web/packages/ROCR/index.html" target="_blank">ROCR</a> package, I can quickly and easily convert this data frame into an object that can then be used to calculate any number of standard measures of predictiveness. First, I calculate the <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic" target="_blank">AUC </a>value, which has a very intuitive interpretation. Consider sorting the list of items from most-predicted-to-be-&#8221;yes&#8221; to least. If the predictions are good, most of the &#8220;yes&#8221; values will be relatively high in the list. The AUC is equivalent to asking, if I randomly pick a &#8220;yes&#8221; item and a &#8220;no&#8221; item out of the list, how likely is the &#8220;yes&#8221; item to be higher on the list? If the list was randomly shuffled, it would 0.5; if it were perfectly shuffled with 20/20 hindsight, the AUC would be 1.0.</p>
<pre>&gt; # convert to their object type (labels should be some sort of ordered type)
&gt; pred.rocr &lt;- prediction(pred.df$predicted, pred.df$actual)
&gt; # Area Under the ROC Curve
&gt; performance(pred.rocr, 'auc')@y.values[[1]]
[1] 0.8237496</pre>
<p>In this case, it&#8217;s about .82, which is probably valuable but far from perfect. Another common way of looking at this type of predictions comes from business uses, where the goal is to identify leads (or whatever) that are likely to convert to purchases. From this point of view, the goal is to <em>lift</em> the leads higher in the list, so that you can focus on the top of the list and got more benefit from sales effort with less work. Two common ways of looking at lift are with a decile table, which shows how much value you get by focusing on the top 10%, 20%, etc. of the list, sorted by the predictive model, and the lift chart, which visualizes the same thing by showing how much benefit over random guessing you get by looking at more or less of the sorted list. Here they are for this data:</p>
<pre># decile table
dec.table &lt;- ldply((1:10)/10, function(x) data.frame(
    decile=x,
    prop.yes=sum(pred.df$actual.bin[1:ceiling(nrow(pred.df)*x)])/sum(pred.df$actual.bin),
    lift=mean(pred.df$actual.bin[1:ceiling(nrow(pred.df)*x)])/mean(pred.df$actual.bin)))
print(dec.table, digits=2)

   decile prop.yes lift
1     0.1     0.61  6.1
2     0.2     0.69  3.4
3     0.3     0.76  2.5
4     0.4     0.80  2.0
5     0.5     0.84  1.7
6     0.6     0.90  1.5
7     0.7     0.92  1.3
8     0.8     0.95  1.2
9     0.9     0.99  1.1
10    1.0     1.00  1.0

# Lift Curve
plot(performance(pred.rocr, 'lift', 'rpp'))</pre>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/liftchart.png"><img class="aligncenter size-full wp-image-91" title="Lift Chart" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/liftchart.png" alt="" width="400" height="350" /></a><br />
This graph shows, not particularly intuitively in my view, that if you focus on the top 10% of the data, you get more 5 times the bang for the buck than if you focus evenly on the whole set of items. The decile table shows the same thing &#8212; the top decile is lifted by a factor 0f 6.1, and in fact you get 61% of the &#8220;yes&#8221; items in that top 10% of the data. These are very useful numbers to know, but I think there are considerably more intuitive ways of showing how the predictive model pulls the &#8220;yes&#8221; values away from the 5% base rate.</p>
<p>These more intuitive ways are <em>not</em> the standard graphs used in statistics and machine learning, such as the sensitivity/specificity curve and the ROC curve. Those graphs, shown below, illustrate trade-offs between accepting false positives and false negatives. Useful, yes, but to understand them you have to think about the ways you could set a threshold and what effect that threshold would have on the nature of your predictions. That&#8217;s not particularly intuitive, and the visualization doesn&#8217;t visually contrast two things, so it&#8217;s difficult to get an intuitive understanding of what has been gained.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/specsens.png"><img class="aligncenter size-full wp-image-94" title="specsens" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/specsens.png" alt="" width="367" height="276" /></a><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/roc.png"><img class="aligncenter size-full wp-image-95" title="ROC" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/roc.png" alt="" width="376" height="270" /></a>I&#8217;ve put some thought and some tinkering into potentially better ways of visualizing the output of predictive models. The key, I think, is to use a visualization that builds on the scatter graph. Scatter graphs are great for less-technical audiences, because you can tell them that every individual dot is a customer (widget, whatever). They can immediately see the number of items in question, and if you can plot the points on axes that make sense to them, they can go from &#8220;that dot there represents one person with this level of X and this level of Y&#8221;, to &#8220;this set of dots represents a set of people with similar levels of X and Y&#8221;, to &#8220;this graph represents everyone, and their respective levels of X and Y.&#8221; And because of skills like the approximate number sense and the ability to quickly understand visual density, scatter graphs can give a vastly better understanding of the range of a data set than summary graphs that just plot a line and maybe some error bars.</p>
<p>Here are several versions of a graph that illustrates how the predictive model smears out the set of dots from the 5% base rate, disproportionately pulling the &#8220;yes&#8221; items to the right, separating at least some of them from the much larger set of &#8220;no&#8221; items. One key change from a basic scatter graph is to jitter the Y position of each point randomly, which I think makes these graphs look a little like a <a href="http://en.wikipedia.org/wiki/Agarose_gel_electrophoresis" target="_blank">PCR gel image</a>.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/dual.png"><img class="aligncenter size-full wp-image-97" title="dual" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/dual.png" alt="" width="330" height="400" /></a>This first approach is built around a basic scatter graph, where the X axis is the predicted likelihood of being a &#8220;yes&#8221;, and the Y axis is 0 for actual &#8220;no&#8221; and 1 for actual &#8220;yes&#8221; items. On top of that is an orange line representing the base rate of about 5%, a blue line showing the smoothed ratio between &#8220;yes&#8221; and &#8220;no&#8221; items at each level of prediction, and a thin grey line showing where the blue line ought to be. In this case, the model tends to underestimate the likelihood that some items are to be &#8220;yes&#8221; items. At 50%, half of the items should be &#8220;yes&#8221; and half should be &#8220;no&#8221;, but it&#8217;s more like 3:1.</p>
<p>I like this graph as it intuitively lets people see the extent to which the predictive model is separating the categories, and how much better it does than just assuming the base rate. My second approach at this combines the &#8220;smears&#8221; with another way of visualizing lift.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/single.png"><img class="aligncenter size-full wp-image-96" title="single" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/single.png" alt="" width="400" height="400" /></a>In this graph, the smeared real data is at the bottom of the graph, and the black line represents the lift, or how much better you are at identifying &#8220;yes&#8221; items by using the predictions. It&#8217;s also an intuitive way of motivating the need to draw a boundary to focus effort. When trying to convert the points at the 25% level or above, you may be ineffective 75% of the time, but you&#8217;re also more than 10 times more efficient than you would be otherwise.</p>
<p>My final attempt worth sharing is this one, which combines the dual smear approach with the cumulative value numbers from the lift table.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/dualcum.png"><img class="aligncenter size-full wp-image-98" title="dualcum" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/dualcum.png" alt="" width="450" height="400" /></a>Now, in addition to being able to see the density of yes and no items for various levels of the prediction, you can see what proportion of the potential &#8220;yes&#8221; values exist to the <em>right </em>of each level of the prediction. For example, at a threshold of 25%, you capture 40 or 45% of the &#8220;yes&#8221; items. At a 5% threshold you capture more than 70% of the &#8220;yes&#8221; items.</p>
<p>I&#8217;d love some feedback on these graphs! Do you agree with my assertion that scatter graphs are more visually intuitive and  easier to motivate to non-technical audiences? Do these variations on lift charts seem clearer or more valuable than traditional alternatives to you? Have I re-invented something that should be cited?</p>
<p>The R code for these graphs is available in the <a href="https://gist.github.com/937821" target="_blank">Github gist</a>. I used <a href="http://www.harlan.harris.name/tag/ggplot2/" target="_blank">ggplot2</a>, naturally, which is an essential tool for exploring the space of possible visualizations without being tied down by traditional graph structures.</p>
<p>Incidentally, for people interested in building graphs that leverage people&#8217;s innate visual capabilities, I recommend Kosslyn&#8217;s book, <a href="http://www.amazon.com/Graph-Design-Mind-Stephen-Kosslyn/dp/0195311841" target="_blank">Graph Design for the Eye and Mind</a>.</p>
<p>Also incidentally, the question of how to communicate or visualize the potentially incredibly complex sets of rules/weights/whatever inside the categorization black box is another fascinating issue, the subject of ongoing research, and maybe something I&#8217;ll write about soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/04/visualizing-categorization-models/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>how to speak ggplot2 like a native, and Predictive Analytics World</title>
		<link>http://www.harlan.harris.name/2010/10/how-to-speak-ggplot2-like-a-native-and-predictive-analytics-world/</link>
		<comments>http://www.harlan.harris.name/2010/10/how-to-speak-ggplot2-like-a-native-and-predictive-analytics-world/#comments</comments>
		<pubDate>Sun, 24 Oct 2010 23:17:08 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[meetup]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=71</guid>
		<description><![CDATA[I was recently given the opportunity to re-present my ggplot2 talk, which I originally gave to the NYC R Meetup, to the DC R Meetup group. The Meetup was held co-located with the Predictive Analytics World conference in Alexandria, VA. (More on my thoughts on PAW below&#8230;) Contentwise, I made only small changes, changing a [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently given the opportunity to re-present <a href="http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/">my ggplot2 talk, which I originally gave</a> to <a href="http://www.meetup.com/nyhackr/">the NYC R Meetup</a>, to <a href="http://www.meetup.com/R-users-DC/">the DC R Meetup </a>group. The Meetup was held co-located with the <a href="http://www.predictiveanalyticsworld.com/">Predictive Analytics World </a>conference in Alexandria, VA. (More on my thoughts on PAW below&#8230;) Contentwise, I made only small changes, changing a bit of patter and adding more examples at the end. I still love ggplot, with some frustration at the way it is typically introduced. Some of the audience had no R experience at all, while others were experts. One person, a grad student at U. of Maryland, had had very similar difficulty as I had when originally learning ggplot2, and his enthusiastic nods during my presentation were very validating! For reference, <a href="http://www.meetup.com/R-users-DC/calendar/14236478/">the Meetup page is here</a>, and I stuck the current version of the slides in a public <a href="http://www.dropbox.com/">Dropbox</a>, <a href="http://dl.dropbox.com/u/7644953/ggplotIntro%20-%20PAW2010.pptx">located here</a>.</p>
<p>And a few thoughts about PAW. The conference was well-run (although I have my gripes with the hotel and its location!) and there were an interesting and eclectic lineup of speakers, from a variety of industries. Compared to academic conferences I&#8217;ve attended, I missed having all the grad students around. At PAW, I felt rather young, which had not been true at academic conferences in quite a long time! The content of the conference focused on people using predictive methods (statistics, data mining, machine learning) at the individual-customer level, for marketing or retainment or other purposes. That&#8217;s not my primary interest right now &#8212; my work is focused at a slightly higher <a href="http://en.wikipedia.org/wiki/Operations_research">operations-research</a>-y level, trying to make sure that customers in the aggregate have good options. But I enjoyed learning about what other people are doing using somewhat similar methods. Next year, though, I think I&#8217;ll try to go to a different conference, perhaps <a href="http://www.warwick.ac.uk/statsdept/useR-2011/">UseR!</a> in the UK, or <a href="http://meetings2.informs.org/Practice2011/">INFORMS&#8217; applied conference</a>&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2010/10/how-to-speak-ggplot2-like-a-native-and-predictive-analytics-world/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>ggplot and concepts &#8212; what&#8217;s right, and what&#8217;s wrong</title>
		<link>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/</link>
		<comments>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/#comments</comments>
		<pubDate>Sun, 07 Mar 2010 21:52:16 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=47</guid>
		<description><![CDATA[A few months back I gave a presentation to the NYC R Meetup. (R is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on ggplot2, a popular package for generating graphs of data and statistics. In the talk (which you can see here, including [...]]]></description>
			<content:encoded><![CDATA[<p>A few months back I gave a presentation to the <a href="http://www.meetup.com/nyhackr/">NYC R Meetup</a>. (<a href="http://www.r-project.org/">R</a> is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on <a href="http://had.co.nz/ggplot2/">ggplot2</a>, a popular package for generating graphs of data and statistics. In the talk (<a href="http://www.vcasmo.com/video/drewconway/7017">which you can see here</a>, including both my slides and my patter!) I presented both the really great things about ggplot2 and some of its downsides. In this blog post, I wanted to expand a bit on my thinking on ggplot, the Grammar of Graphics, and how peoples&#8217; conceptual representations of graphs, data, ggplot, and R all interact. ggplot is both incredibly elegant and unfortunately difficult to learn to use well, I think as a consequence of the variety of representations.<span id="more-47"></span></p>
<p>The ggplot package, written by the overachieving and remarkable <a href="http://had.co.nz/">Hadley Wickham</a>, is based on <a href="http://books.google.com/books?id=_kRX4LoFfGQC&amp;dq=grammar+of+graphics&amp;printsec=frontcover&amp;source=bn&amp;hl=en&amp;ei=7kZsS8-lDI_e8Qb4hcD2BQ&amp;sa=X&amp;oi=book_result&amp;ct=result&amp;resnum=4&amp;ved=0CB8Q6AEwAw#v=onepage&amp;q=&amp;f=false">earlier more theoretical work by Leland Wilkinson</a>. Wilkinson abstracted the process of putting data onto an image, and created a Grammar of Graphics, which describes <em>how</em> the data maps to the parts of a graph, rather than describing the final graph itself. For example, here&#8217;s how to create a pie chart, clipped from Wilkinson&#8217;s book:</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-11.33.36-AM.png"><img class="aligncenter size-full wp-image-50" title="Wilkinson Pie Graph Example" src="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-11.33.36-AM.png" alt="" width="607" height="393" /></a>Don&#8217;t worry about the details, but briefly, a pie chart is just a stacked bar graph (summary.proportion) plotted in polar coordinates (polar.theta). If you took the time to learn this grammar, you would realize that the hierarchical structure of a graph on a page (elements have positions and labels and visual properties like color, each of which have their own abstract structure) maps cleanly to the hierarchical structure of the grammar, and that variables in the grammar map cleanly to the linear structure of the data. As a user of this system, you would be able to see all three key representations at once: the <span style="text-decoration: underline;">data</span>, the <span style="text-decoration: underline;">grammatical mapping</span> from data to graph, and the <span style="text-decoration: underline;">graph</span> itself.</p>
<p>Now consider ggplot, the implementation of the Grammar of Graphics in the R programming language. Does ggplot maintain three visible representations, all straightforwardly mappable to each other? Sadly, it does not. Instead, users of ggplot must map among four representations: the <span style="text-decoration: underline;">data</span> (a standard data.frame object), the <span style="text-decoration: underline;">R syntax</span> for ggplot2 (which has some quirks), an <span style="text-decoration: underline;">underlying ggplot object</span> (similar to the Grammar of Graphics, but vastly more complex and impossible to examine directly), and the generated <span style="text-decoration: underline;">graph</span>.</p>
<p>Consider the simple pie graph, below.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-1.56.41-PM.png"><img class="aligncenter size-full wp-image-53" title="Simple Pie Chart" src="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-1.56.41-PM.png" alt="" width="249" height="183" /></a>This chart is generated in ggplot2 by the following R code:</p>
<pre>&gt; zz &lt;- data.frame(cat=c("a", "b"), val=c(5,3))

&gt; zz
 cat val
1   a   5
2   b   3
&gt; pp &lt;- ggplot(zz, aes(x="", y=val, fill=cat)) + geom_bar(width=1) +
        coord_polar("y")
&gt; print(pp)</pre>
<p>The print() function is optional within an R interpreter session, but I include because it illustrates a point that&#8217;s not initially obvious to many users. Unlike the built-in R plotting tools, the ggplot() function and its associated functions don&#8217;t plot anything on the screen, they just construct an object of type &#8220;ggplot&#8221;. Almost all of the actual work of mapping the data to stuff on your screen occurs when you print that object, using print() or ggsave().</p>
<p>So what does that object look like? If you type str(pp), you&#8217;ll get an answer, but it&#8217;s about a hundred lines of undecipherable hierarchical object and list structure, not intended to be examined by mere mortals. But there&#8217;s something critically important about that structure &#8212; like the original Grammar of Graphics, and unlike the R syntax above, it&#8217;s hierarchically structured.</p>
<p>In the R syntax, you create a base ggplot structure with the ggplot() call, then you abuse the &#8220;+&#8221; operator to make changes to that structure. The geom_bar() function adds a layer to the ggplot() object, where a layer is just what it sounds like, a set of information about one of potentially many overlaid layers of content that will be put on the graph. So you construct a ggplot object by first initializing everything about the basic plot, then tack on layers with +, right? Actually no, because the coord_polar() call doesn&#8217;t create or modify a layer at all, it modifies the base object! Even if you&#8217;ve acquired the nonobvious intuition that ggplot objects are hierarchical and are created by concatenating layers, you now have to break the analogy again to fully understand what + is doing!</p>
<p>There is a way to partially see the structure directly, but it&#8217;s not well thought-out from the point of view of someone trying to learn how to use the package. The summary() method on ggplot objects tells you about things you didn&#8217;t specify (faceting?), it&#8217;s incomplete, and it doesn&#8217;t map well to the R syntax. If something in your plot isn&#8217;t working the way you want it to, summary() won&#8217;t help you.</p>
<pre>&gt; summary(pp)
mapping:  x = , y = val, fill = cat
faceting: facet_grid(. ~ ., FALSE)
-----------------------------------
geom_bar:
stat_bin: width = 1
position_stack: (width = NULL, height = NULL)</pre>
<p>Another shortcut that leads to conceptual problems by ggplot beginners is the use of qplot(). The qplot() function is a wrapper around ggplot(). Unlike ggplot(), you can give qplot() data that is not in the form of a data.frame, and the syntax is somewhat different. There&#8217;s nothing wrong with some syntactic sugar to make life easier, but in this case, learning ggplot by starting with qplot is like trying to learn a foreign language by starting with contractions and slang. You may be able to say a few essential things on your vacation, but you won&#8217;t be able to creatively construct new sentences as new situations arise. The brilliance of the Grammar of Graphics is exactly that it&#8217;s a grammar &#8212; you can construct new graphs and new types of graphs as new situations arise! But tutorials that start with qplot, with <a href="http://had.co.nz/ggplot2/book/" target="_blank">the ggplot book </a>an unfortunate (but in other ways excellent) example, send their learners down a linguistic garden path. To fully use the power of the system requires unlearning the conceptual structures that map the slang to charts on a screen, and starting over with learning the new, more powerful ggplot() grammar and hierarchical representations.</p>
<p>I&#8217;d like to conclude this overlong rant with two notes. First, just today <a href="http://pleasescoopme.com/2010/03/07/jjplot-yet-another-plotting-library-for-r/" target="_blank">a new graphics package for R was introduced</a>. <a href="http://code.google.com/p/jjplot/" target="_blank">jjplot</a> uses many of the ideas of the Grammar of Graphics and ggplot2, but seems to avoid at least a few of the conceptual problems. The + operator is not overloaded in conceptually confusing ways, and there is no distracting qplot function to mislead new users. Additionally, a quick look at the source code finds it much, much simpler than ggplot2&#8242;s source, which will likely lead to a more active base of contributors. I look forward to trying jjplot and watching its continuing development, and hope the authors learn from both the remarkable successes and frustrating failures of ggplot. Second, I use ggplot extensively in my work. It&#8217;s simply the best available tool for quickly generating elegant graphs of data in R, especially if that generation needs to happen automatically in code. Hadley Wickham deserves extensive praise for the amount of effort he has put into developing and popularizing the Grammar of Graphics. If you want to be maximally effective when visualizing data in R, take the time to learn ggplot2, but do so while keeping in mind that the learning process will be easiest if you skip qplot and other shortcuts, think hierarchically, and prepare for some frustration. Fortunately, the support communities on the <a href="http://groups.google.com/group/ggplot2" target="_blank">ggplot mailing list </a>and <a href="http://stackoverflow.com/questions/tagged/ggplot2" target="_blank">Stack Overflow </a>are extremely helpful, as is Hadley himself.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

