<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Somethink to Chew On &#187; programming</title>
	<atom:link href="http://www.harlan.harris.name/tag/programming/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.harlan.harris.name</link>
	<description>the blog of Harlan Harris</description>
	<lastBuildDate>Sun, 06 Nov 2011 20:57:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>making meat shares more efficient with R and Symphony</title>
		<link>http://www.harlan.harris.name/2011/05/optimizing-meat-shares-details/</link>
		<comments>http://www.harlan.harris.name/2011/05/optimizing-meat-shares-details/#comments</comments>
		<pubDate>Mon, 09 May 2011 18:07:42 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[csa]]></category>
		<category><![CDATA[operations research]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=110</guid>
		<description><![CDATA[In my previous post, I motivated a web application that would allow small-scale sustainable meat producers to sell directly to consumers using a meat share approach, using constrained optimization techniques to maximize utility for everyone involved. In this post, I&#8217;ll walk through some R code that I wrote to demonstrate the technique on a small [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://www.harlan.harris.name/2011/05/optimizing-meat-shares" target="_blank">previous post</a>, I motivated a web application that would allow small-scale sustainable meat producers to sell directly to consumers using a meat share approach, using constrained optimization techniques to maximize utility for everyone involved. In this post, I&#8217;ll walk through some R code that I wrote to demonstrate the technique on a small scale.</p>
<p>Although the problem is set up in R, the actual mathematical optimization is done by <a href="http://www.coin-or.org/SYMPHONY/" target="_blank">Symphony</a>, an open-source mixed-integer solver that&#8217;s part of the <a href="http://www.coin-or.org/" target="_blank">COIN-OR project</a>. (The problem of optimizing assignments, in this case of cuts of meat to people, is an integer planning problem, because the solution involves assigning either 0 or 1 of each cut to each person. More generally, linear programming and related optimization frameworks allow solving for real-numbered variables.) The RSymphony package allows problems set up in R to be solved by the C/C++ Symphony code with little hassle.</p>
<p>My code is in a public github repository called <a href="https://github.com/HarlanH/groupmeat-demo/" target="_blank">groupmeat-demo</a>, and the demo code discussed here is in the <a href="https://github.com/HarlanH/groupmeat-demo/blob/master/subset_test.R" target="_blank">subset_test.R</a> file. (The other stuff in the repo is an unfinished version of a larger-scale demo with slightly more realistic data.)</p>
<p>For this toy problem, we want to optimally assign 6 items to 3 people, each of whom have a different utility (value) for each item. In this case, I&#8217;m ignoring any fixed utility, such as cost in dollars, but that could be added into the formulation. Additionally, assume that items #1 and #2 cannot both be assigned, as with pork loin and pork chops.</p>
<p>This sort of problem is fairly simple to define mathematically. To set up the problem in code, I&#8217;ll need to create some matrices that are used in the computation. Briefly, the goal is to maximize an objective expression, <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bc%7D%5ET%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{c}^T\mathbf{x}' title='\mathbf{c}^T\mathbf{x}' class='latex' />, where the <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' /> are variables that will be 0 or 1, indicating an assignment or non-assignment, and the <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bc%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{c}' title='\mathbf{c}' class='latex' /> is a coefficient vector representing the utilities of assigning each item to each person. Here, there are 6 items for 3 people, so I&#8217;ll have a 6&#215;3 matrix, flattened to an 18-vector. The goal will be to find 0&#8242;s and 1&#8242;s for <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' /> that maximize the whole expression.</p>
<p>Here&#8217;s what the <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bc%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{c}' title='\mathbf{c}' class='latex' /> matrix looks like:</p>
<pre>      pers1 pers2  pers3
item1 0.467 0.221 0.2151
item2 0.030 0.252 0.4979
item3 0.019 0.033 0.0304
item4 0.043 0.348 0.0158
item5 0.414 0.050 0.0096
item6 0.029 0.095 0.2311</pre>
<p>It appears as if everyone like item1, but only person1 likes item5.</p>
<p>Additionally, I need to define some constraints. For starters, it makes no sense to assign an item to more than one person. So, for each row of that matrix, the sum of the variables (not the utilities) must be 1, or maybe 0 (if that item is not assigned). I&#8217;ll create a constraint matrix, where each row contains 18 columns, and the pattern of 0&#8242;s and 1&#8242;s defines a row of the assignment matrix. Since there are 6 items, there are 6 rows (for now). Each row needs to be less than or equal to one (I&#8217;ll tell the solver to use integers only later), so I also define vectors of inequality symbols and right-hand-sides.</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code7'); return false;">View Code</a> RSLANG</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1107"><td class="code" id="p110code7"><pre class="rslang" style="font-family:monospace;"># for each item/row, enforce that the sum of indicators for its assignment are &lt;= 1
mat &lt;- laply(1:num.items, function(ii) { x &lt;- mat.0; x[ii, ] &lt;- 1; as.double(x) })
dir &lt;- rep('&lt;=', num.items)
rhs &lt;- rep(1, num.items)</pre></td></tr></table></div>

<p>To add the loin/chops constraint, I need to add another row, specifying that the sum of the indicators for <em>both </em>rows now must be 1 or less as well.</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code8'); return false;">View Code</a> RSLANG</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1108"><td class="code" id="p110code8"><pre class="rslang" style="font-family:monospace;"># for rows 1 and 2, enforce that the sum of indicators for their assignments are &lt;= 1
mat &lt;- rbind(mat, matrix(matrix(c(1, 1, rep(0, num.items-2)), nrow=num.items, ncol=num.pers), nrow=1))
dir &lt;- c(dir, '&lt;=')
rhs &lt;- c(rhs, 1)</pre></td></tr></table></div>

<p>Here&#8217;s what those matrices and vectors look like:</p>
<pre>
> mat
     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1,] 1 0 0 0 0 0 1 0 0  0  0  0  1  0  0  0  0  0
[2,] 0 1 0 0 0 0 0 1 0  0  0  0  0  1  0  0  0  0
[3,] 0 0 1 0 0 0 0 0 1  0  0  0  0  0  1  0  0  0
[4,] 0 0 0 1 0 0 0 0 0  1  0  0  0  0  0  1  0  0
[5,] 0 0 0 0 1 0 0 0 0  0  1  0  0  0  0  0  1  0
[6,] 0 0 0 0 0 1 0 0 0  0  0  1  0  0  0  0  0  1
[7,] 1 1 0 0 0 0 1 1 0  0  0  0  1  1  0  0  0  0
> dir
[1] "<=" "<=" "<=" "<=" "<=" "<=" "<="
> rhs
[1] 1 1 1 1 1 1 1
</pre>
<p>Finally, specify that the variables must be binary (0 or 1), and call SYMPHONY to solve the problem:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code9'); return false;">View Code</a> RSLANG</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1109"><td class="code" id="p110code9"><pre class="rslang" style="font-family:monospace;"># this is an IP problem, for now
types &lt;- rep('B', num.items * num.pers)
max &lt;- TRUE # maximizing utility
&nbsp;
soln &lt;- Rsymphony_solve_LP(obj, mat, dir, rhs, types=types, max=max)</pre></td></tr></table></div>

<p>And, with a bit of post-processing to recover matrices from vectors, here&#8217;s the result:</p>
<pre>
$solution
 [1] 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1

$objval
[1] 1.52

$status
TM_OPTIMAL_SOLUTION_FOUND
                        0 

Person #1 got Items 5 worth 0.41
Person #2 got Items 3, 4 worth 0.38
Person #3 got Items 2, 6 worth 0.73</pre>
<p>So that&#8217;s great. It found an optimal solution worth more than 50% more than the expected value of a random assignment. But there&#8217;s a problem. There&#8217;s no guarantee that everyone gets anything, and in this case, person #3 gets almost twice as much utility as person #2. Unfair! We need to enforce an additional constraint, that the difference between the maximum utility that any one person gets and the minimum utility that any one person gets is not too high. This is sometimes called a parity constraint. Adding parity constraints is a little tricky, but the basic idea here is to add two more variables to the 18 I&#8217;ve already defined. These variables are positive real numbers, and they are forced by constraints to be the maximum and minimum total utilities per person. In the objective function, then, they are weighted so that their difference is not to big. So, that expression becomes: <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bc%7D%5ET%5Cmathbf%7Bx%7D%20-%20%5Clambda%20x_%7B19%7D%20-%20-%20%5Clambda%20x%5E%7B20%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{c}^T\mathbf{x} - \lambda x_{19} - - \lambda x^{20}' title='\mathbf{c}^T\mathbf{x} - \lambda x_{19} - - \lambda x^{20}' class='latex' />. The first variable (the maximum utility of any person) is minimized, while the second variable is maximized. The <img src='http://s.wordpress.com/latex.php?latex=%5Clambda&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\lambda' title='\lambda' class='latex' /> free parameter defines how much to trade off parity with total utility, and I&#8217;ll set it to 1 for now.</p>
<p>For the existing rows of the constraint matrix, these new variables get 0&#8242;s. But two more rows need to be added, per person, to force their values to be no bigger/smaller (and thus the same as) the maximum/minimum of any person&#8217;s assigned utility.</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code10'); return false;">View Code</a> RSLANG</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p11010"><td class="code" id="p110code10"><pre class="rslang" style="font-family:monospace;"># now for those upper and lower variables
# \forall p, \sum_i u_i x_{i,p} - d.upper \le 0
# \forall p, \sum_i u_i x_{i,p} - d.lower \ge 0
# so, two more rows per person
d.constraint &lt;- function(iperson, ul) { # ul = 1 for upper, 0 for lower
  x &lt;- mat.utility.0
  x[, iperson ] &lt;- 1
  x &lt;- x * obj.utility
  c(as.double(x), (if (ul) c(-1,0) else c(0,-1)))
}
mat &lt;- rbind(mat, maply(expand.grid(iperson=1:num.pers, ul=c(1,0)), d.constraint, .expand=FALSE))
dir &lt;- c(dir, c(rep('&lt;=', num.pers), rep('&gt;=', num.pers)))
rhs &lt;- c(rhs, rep(0, num.pers*2))</pre></td></tr></table></div>

<p>The constraint inequalities then becomes as follows:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code11'); return false;">View Code</a> TEXT</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p11011"><td class="code" id="p110code11"><pre class="text" style="font-family:monospace;">&gt; print(mat, digits=2)
     1     2     3     4    5     6    7    8     9   10   11    12   13  14    15    16     17   18  19 20
  1.00 0.000 0.000 0.000 0.00 0.000 1.00 0.00 0.000 0.00 0.00 0.000 1.00 0.0 0.000 0.000 0.0000 0.00  0  0
  0.00 1.000 0.000 0.000 0.00 0.000 0.00 1.00 0.000 0.00 0.00 0.000 0.00 1.0 0.000 0.000 0.0000 0.00  0  0
  0.00 0.000 1.000 0.000 0.00 0.000 0.00 0.00 1.000 0.00 0.00 0.000 0.00 0.0 1.000 0.000 0.0000 0.00  0  0
  0.00 0.000 0.000 1.000 0.00 0.000 0.00 0.00 0.000 1.00 0.00 0.000 0.00 0.0 0.000 1.000 0.0000 0.00  0  0
  0.00 0.000 0.000 0.000 1.00 0.000 0.00 0.00 0.000 0.00 1.00 0.000 0.00 0.0 0.000 0.000 1.0000 0.00  0  0
  0.00 0.000 0.000 0.000 0.00 1.000 0.00 0.00 0.000 0.00 0.00 1.000 0.00 0.0 0.000 0.000 0.0000 1.00  0  0
  1.00 1.000 0.000 0.000 0.00 0.000 1.00 1.00 0.000 0.00 0.00 0.000 1.00 1.0 0.000 0.000 0.0000 0.00  0  0
  0.47 0.030 0.019 0.043 0.41 0.029 0.00 0.00 0.000 0.00 0.00 0.000 0.00 0.0 0.000 0.000 0.0000 0.00 -1  0
  0.00 0.000 0.000 0.000 0.00 0.000 0.22 0.25 0.033 0.35 0.05 0.095 0.00 0.0 0.000 0.000 0.0000 0.00 -1  0
  0.00 0.000 0.000 0.000 0.00 0.000 0.00 0.00 0.000 0.00 0.00 0.000 0.22 0.5 0.030 0.016 0.0096 0.23 -1  0
  0.47 0.030 0.019 0.043 0.41 0.029 0.00 0.00 0.000 0.00 0.00 0.000 0.00 0.0 0.000 0.000 0.0000 0.00  0 -1
  0.00 0.000 0.000 0.000 0.00 0.000 0.22 0.25 0.033 0.35 0.05 0.095 0.00 0.0 0.000 0.000 0.0000 0.00  0 -1
  0.00 0.000 0.000 0.000 0.00 0.000 0.00 0.00 0.000 0.00 0.00 0.000 0.22 0.5 0.030 0.016 0.0096 0.23  0 -1
&gt; dir
 [1] &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&gt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot;
&gt; rhs
 [1] 1 1 1 1 1 1 1 0 0 0 0 0 0</pre></td></tr></table></div>

<p>Looking at just the last row, this constraint says that the sum of the utilities of any assigned items for person #3, minus the lower limit, must be at least 0. That is essentially the definition of the lower limit, that that constraint holds true for all three people in this problem. Similar logic applies for the upper limit.</p>
<p>Running the solver with this set of inputs gives the following:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code12'); return false;">View Code</a> TEXT</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p11012"><td class="code" id="p110code12"><pre class="text" style="font-family:monospace;">$solution
 [1] 0.000 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 0.000 0.000
[18] 0.000 0.498 0.433
&nbsp;
$objval
[1] 1.31
&nbsp;
$status
TM_OPTIMAL_SOLUTION_FOUND
                        0 
&nbsp;
Person #1 got Items 3, 5 worth 0.43
Person #2 got Items 4, 6 worth 0.44
Person #3 got Items 2 worth 0.50</pre></td></tr></table></div>

<p>The last two numbers in the solution are the values of the upper and lower bounds. Note that the objective value is only 41% higher than a random assignment, but the utilities assigned to each person are much closer. Dropping the <img src='http://s.wordpress.com/latex.php?latex=%5Clambda&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\lambda' title='\lambda' class='latex' /> value to something closer to 0 causes the weights of the parity bounds to be less important, and the solution tends to be closer to the initial result.</p>
<p>Scaling this up to include constraints in pricing, farm preferences, price vs. preference meta-preferences, etc., is not conceptually difficult, but would just entail careful programming. It is left as an exercise for the well-motivated reader!</p>
<p>If you&#8217;ve made it this far, I&#8217;d definitely appreciate any feedback about this idea, corrections to my formulation or code or terminology, etc!</p>
<p>(Thanks to Paul Ruben and others on <a href="http://www.or-exchange.com/" target="_blank">OR-Exchange</a>, who helped me <a href="http://www.or-exchange.com/questions/2750/assignment-problem-maximizing-utility-equitably" target="_blank">figure out how to think about the parity problem</a>, and to the authors of <a href="http://wordpress.org/extend/plugins/wp-codebox/" target="_blank">WP-codebox</a> and <a href="http://wordpress.org/extend/plugins/wp-latex/" target="_blank">WP LaTeX</a> for giving me tools to put nice scrollable R code and math in this post!)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/05/optimizing-meat-shares-details/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>ggplot and concepts &#8212; what&#8217;s right, and what&#8217;s wrong</title>
		<link>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/</link>
		<comments>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/#comments</comments>
		<pubDate>Sun, 07 Mar 2010 21:52:16 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=47</guid>
		<description><![CDATA[A few months back I gave a presentation to the NYC R Meetup. (R is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on ggplot2, a popular package for generating graphs of data and statistics. In the talk (which you can see here, including [...]]]></description>
			<content:encoded><![CDATA[<p>A few months back I gave a presentation to the <a href="http://www.meetup.com/nyhackr/">NYC R Meetup</a>. (<a href="http://www.r-project.org/">R</a> is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on <a href="http://had.co.nz/ggplot2/">ggplot2</a>, a popular package for generating graphs of data and statistics. In the talk (<a href="http://www.vcasmo.com/video/drewconway/7017">which you can see here</a>, including both my slides and my patter!) I presented both the really great things about ggplot2 and some of its downsides. In this blog post, I wanted to expand a bit on my thinking on ggplot, the Grammar of Graphics, and how peoples&#8217; conceptual representations of graphs, data, ggplot, and R all interact. ggplot is both incredibly elegant and unfortunately difficult to learn to use well, I think as a consequence of the variety of representations.<span id="more-47"></span></p>
<p>The ggplot package, written by the overachieving and remarkable <a href="http://had.co.nz/">Hadley Wickham</a>, is based on <a href="http://books.google.com/books?id=_kRX4LoFfGQC&amp;dq=grammar+of+graphics&amp;printsec=frontcover&amp;source=bn&amp;hl=en&amp;ei=7kZsS8-lDI_e8Qb4hcD2BQ&amp;sa=X&amp;oi=book_result&amp;ct=result&amp;resnum=4&amp;ved=0CB8Q6AEwAw#v=onepage&amp;q=&amp;f=false">earlier more theoretical work by Leland Wilkinson</a>. Wilkinson abstracted the process of putting data onto an image, and created a Grammar of Graphics, which describes <em>how</em> the data maps to the parts of a graph, rather than describing the final graph itself. For example, here&#8217;s how to create a pie chart, clipped from Wilkinson&#8217;s book:</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-11.33.36-AM.png"><img class="aligncenter size-full wp-image-50" title="Wilkinson Pie Graph Example" src="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-11.33.36-AM.png" alt="" width="607" height="393" /></a>Don&#8217;t worry about the details, but briefly, a pie chart is just a stacked bar graph (summary.proportion) plotted in polar coordinates (polar.theta). If you took the time to learn this grammar, you would realize that the hierarchical structure of a graph on a page (elements have positions and labels and visual properties like color, each of which have their own abstract structure) maps cleanly to the hierarchical structure of the grammar, and that variables in the grammar map cleanly to the linear structure of the data. As a user of this system, you would be able to see all three key representations at once: the <span style="text-decoration: underline;">data</span>, the <span style="text-decoration: underline;">grammatical mapping</span> from data to graph, and the <span style="text-decoration: underline;">graph</span> itself.</p>
<p>Now consider ggplot, the implementation of the Grammar of Graphics in the R programming language. Does ggplot maintain three visible representations, all straightforwardly mappable to each other? Sadly, it does not. Instead, users of ggplot must map among four representations: the <span style="text-decoration: underline;">data</span> (a standard data.frame object), the <span style="text-decoration: underline;">R syntax</span> for ggplot2 (which has some quirks), an <span style="text-decoration: underline;">underlying ggplot object</span> (similar to the Grammar of Graphics, but vastly more complex and impossible to examine directly), and the generated <span style="text-decoration: underline;">graph</span>.</p>
<p>Consider the simple pie graph, below.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-1.56.41-PM.png"><img class="aligncenter size-full wp-image-53" title="Simple Pie Chart" src="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-1.56.41-PM.png" alt="" width="249" height="183" /></a>This chart is generated in ggplot2 by the following R code:</p>
<pre>&gt; zz &lt;- data.frame(cat=c("a", "b"), val=c(5,3))

&gt; zz
 cat val
1   a   5
2   b   3
&gt; pp &lt;- ggplot(zz, aes(x="", y=val, fill=cat)) + geom_bar(width=1) +
        coord_polar("y")
&gt; print(pp)</pre>
<p>The print() function is optional within an R interpreter session, but I include because it illustrates a point that&#8217;s not initially obvious to many users. Unlike the built-in R plotting tools, the ggplot() function and its associated functions don&#8217;t plot anything on the screen, they just construct an object of type &#8220;ggplot&#8221;. Almost all of the actual work of mapping the data to stuff on your screen occurs when you print that object, using print() or ggsave().</p>
<p>So what does that object look like? If you type str(pp), you&#8217;ll get an answer, but it&#8217;s about a hundred lines of undecipherable hierarchical object and list structure, not intended to be examined by mere mortals. But there&#8217;s something critically important about that structure &#8212; like the original Grammar of Graphics, and unlike the R syntax above, it&#8217;s hierarchically structured.</p>
<p>In the R syntax, you create a base ggplot structure with the ggplot() call, then you abuse the &#8220;+&#8221; operator to make changes to that structure. The geom_bar() function adds a layer to the ggplot() object, where a layer is just what it sounds like, a set of information about one of potentially many overlaid layers of content that will be put on the graph. So you construct a ggplot object by first initializing everything about the basic plot, then tack on layers with +, right? Actually no, because the coord_polar() call doesn&#8217;t create or modify a layer at all, it modifies the base object! Even if you&#8217;ve acquired the nonobvious intuition that ggplot objects are hierarchical and are created by concatenating layers, you now have to break the analogy again to fully understand what + is doing!</p>
<p>There is a way to partially see the structure directly, but it&#8217;s not well thought-out from the point of view of someone trying to learn how to use the package. The summary() method on ggplot objects tells you about things you didn&#8217;t specify (faceting?), it&#8217;s incomplete, and it doesn&#8217;t map well to the R syntax. If something in your plot isn&#8217;t working the way you want it to, summary() won&#8217;t help you.</p>
<pre>&gt; summary(pp)
mapping:  x = , y = val, fill = cat
faceting: facet_grid(. ~ ., FALSE)
-----------------------------------
geom_bar:
stat_bin: width = 1
position_stack: (width = NULL, height = NULL)</pre>
<p>Another shortcut that leads to conceptual problems by ggplot beginners is the use of qplot(). The qplot() function is a wrapper around ggplot(). Unlike ggplot(), you can give qplot() data that is not in the form of a data.frame, and the syntax is somewhat different. There&#8217;s nothing wrong with some syntactic sugar to make life easier, but in this case, learning ggplot by starting with qplot is like trying to learn a foreign language by starting with contractions and slang. You may be able to say a few essential things on your vacation, but you won&#8217;t be able to creatively construct new sentences as new situations arise. The brilliance of the Grammar of Graphics is exactly that it&#8217;s a grammar &#8212; you can construct new graphs and new types of graphs as new situations arise! But tutorials that start with qplot, with <a href="http://had.co.nz/ggplot2/book/" target="_blank">the ggplot book </a>an unfortunate (but in other ways excellent) example, send their learners down a linguistic garden path. To fully use the power of the system requires unlearning the conceptual structures that map the slang to charts on a screen, and starting over with learning the new, more powerful ggplot() grammar and hierarchical representations.</p>
<p>I&#8217;d like to conclude this overlong rant with two notes. First, just today <a href="http://pleasescoopme.com/2010/03/07/jjplot-yet-another-plotting-library-for-r/" target="_blank">a new graphics package for R was introduced</a>. <a href="http://code.google.com/p/jjplot/" target="_blank">jjplot</a> uses many of the ideas of the Grammar of Graphics and ggplot2, but seems to avoid at least a few of the conceptual problems. The + operator is not overloaded in conceptually confusing ways, and there is no distracting qplot function to mislead new users. Additionally, a quick look at the source code finds it much, much simpler than ggplot2&#8242;s source, which will likely lead to a more active base of contributors. I look forward to trying jjplot and watching its continuing development, and hope the authors learn from both the remarkable successes and frustrating failures of ggplot. Second, I use ggplot extensively in my work. It&#8217;s simply the best available tool for quickly generating elegant graphs of data in R, especially if that generation needs to happen automatically in code. Hadley Wickham deserves extensive praise for the amount of effort he has put into developing and popularizing the Grammar of Graphics. If you want to be maximally effective when visualizing data in R, take the time to learn ggplot2, but do so while keeping in mind that the learning process will be easiest if you skip qplot and other shortcuts, think hierarchically, and prepare for some frustration. Fortunately, the support communities on the <a href="http://groups.google.com/group/ggplot2" target="_blank">ggplot mailing list </a>and <a href="http://stackoverflow.com/questions/tagged/ggplot2" target="_blank">Stack Overflow </a>are extremely helpful, as is Hadley himself.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

