<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Somethink to Chew On &#187; Professional</title>
	<atom:link href="http://www.harlan.harris.name/category/professional/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.harlan.harris.name</link>
	<description>the blog of Harlan Harris</description>
	<lastBuildDate>Sun, 06 Nov 2011 20:57:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Data Science, Moore&#8217;s Law, and Moneyball</title>
		<link>http://www.harlan.harris.name/2011/09/data-science-moores-law-and-moneyball/</link>
		<comments>http://www.harlan.harris.name/2011/09/data-science-moores-law-and-moneyball/#comments</comments>
		<pubDate>Wed, 28 Sep 2011 00:24:04 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[meetup]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=149</guid>
		<description><![CDATA[I&#8217;m fond of navel gazing, meta discussions, and so forth. I&#8217;ve recently written about inferring navel gazing from link data, and about the meaning of the &#8220;Analytics&#8221; buzzword. This post will be my second on that other infectious buzzword, &#8220;Data Science&#8221;. When I moved to Washington DC in July, I was struck by the fact [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m fond of navel gazing, meta discussions, and so forth. I&#8217;ve recently written about <a title="hacking .gov shortened links" href="http://www.harlan.harris.name/2011/07/hacking-gov-shortened-links/">inferring navel gazing from link data</a>, and about <a title="On “Analytics” and related fields" href="http://www.harlan.harris.name/2011/04/on-analytics-and-related-fields/">the meaning of the &#8220;Analytics&#8221; buzzword</a>. This post will be <a title="“Data Scientist” and other titles" href="http://www.harlan.harris.name/2011/02/data-scientist-and-other-titles/">my second</a> on that other infectious buzzword, &#8220;Data Science&#8221;.</p>
<p>When I moved to Washington DC in July, I was struck by the fact that there was no <a href="http://www.meetup.com" target="_blank">Meetup</a> for analytics/applied statistics/machine learning/data science. There&#8217;s a great <a href="http://www.meetup.com/DC-Tech-Meetup/" target="_blank">DC Tech Meetup</a>, a great <a href="http://www.meetup.com/bigdatadc/" target="_blank">Big Data Meetup</a>, and a great <a href="http://www.meetup.com/R-users-DC/" target="_blank">R Meetup</a>, but nothing like the <a href="http://www.meetup.com/NYC-Predictive-Analytics/" target="_blank">NYC Predictive Analytics Meetup</a>. So, I and a couple of others I talked to about this (<a title="Marck's LinkedIn page" href="http://www.linkedin.com/in/marckvaisman" target="_blank">Marck Vaisman</a>, who I first met through the <a href="http://www.meetup.com/nyhackr/" target="_blank">NYC R Meetup </a>a couple years ago, and <a title="Matt's LinkedIn profile" href="http://www.linkedin.com/pub/matthew-bryan/26/210/2a4" target="_blank">Matt Bryan</a>, who I met just after moving to town), started a new Meetup, which we decided to call &#8220;<a href="http://www.meetup.com/Data-Science-DC/" target="_blank">Data Science DC</a>&#8220;.</p>
<p>For our second meetup, we thought we should address some aspect of our name, and so I presented a little bit about the term and the controversies around its definition and its recent dramatic upsurge in popularity. Here are the slides (note that you should be able to click through the links on the slide to the source documents):</p>
<p><iframe src="https://docs.google.com/present/embed?id=dgxc9gbd_58c2j8vnrh&amp;size=m" frameborder="0" width="555" height="451"></iframe></p>
<p>I mostly didn&#8217;t present a personal opinion about what I though the term means, or what it should mean, but instead wanted to present a bunch of other peoples&#8217; points of view to kick off an interesting discussion. And in that sense I succeeded. We had an exceedingly interesting conversation following my slides, and I think a couple of the most interesting ideas from the evening came out of that discussion.</p>
<p>Here are three theses I&#8217;d like to propose.</p>
<ol>
<li><strong>&#8220;Data Science&#8221; is defined as what &#8220;Data Scientists&#8221; do.</strong> <em>What</em> Data Scientists do has been <a href="http://www.dataists.com/2010/09/a-taxonomy-of-data-science/" target="_blank">very</a> <a href="http://radar.oreilly.com/2010/06/what-is-data-science.html" target="_blank">well</a> <a href="http://www.drewconway.com/zia/?p=2378" target="_blank">covered</a>, and it runs the gamut from data collection and munging, through application of statistics and machine learning and related techniques, to interpretation, communication, and visualization of the results. <em>Who</em> Data Scientists are may be the more fundamental question.</li>
<li>One reason Data Science is a big thing <em>now</em> is because <strong>advances in technology have made it easy for Data Scientists to develop wide-ranging expertise</strong>. Even 10 years ago, the idea that the same person could integrate several databases, run a multilevel regression, and generate elegant visualizations would be seen as incredibly rare.</li>
<li>The other reason Data Science is a big thing <em>now</em> is because<strong> sabermetrics demonstrated that number-crunching brings results</strong>. There&#8217;s nothing business leaders love more than a sports analogy, and the analytic revolution in professional sports immediately draw attention to the ways that numbers beat intuition.</li>
</ol>
<p>I tend to like the idea that Data Science is defined by its practitioners, that it&#8217;s a <a href="http://www.johndcook.com/blog/2011/08/18/jack-of-all-trades/" target="_blank">career path </a>rather than a category of activities. In my conversations with people, it seems that people who consider themselves Data Scientists typically have eclectic career paths, that might in some ways seem <a href="http://www.ribbonfarm.com/2011/08/19/the-calculus-of-grit/" target="_blank">not to make much sense</a>. A typical path might be someone who started out learning to program, then spent some time in a scientific field, then hopped around a variety of different roles, collecting a wide variety of different skills, all of which related to using analytical techniques to make sense of data.</p>
<p>This sort of career path isn&#8217;t particularly new, but what is new is that it&#8217;s now possible to relatively quickly and cheaply do get started in all of the processes involved in Data Science. (Thanks to Taylor Horton for suggesting this at the Meetup!) Fast computers, open source tools, and some programming skills allow someone to try a new data management approach or a new machine learning technology incredibly quickly, and to iterate on approaches until a solution to a particular problem is found. This has two consequences. First of all, the productivity of a modern Data Scientist is remarkable. Projects that a few decades ago would have taken teams of people literally years can now be done in a few days. Second of all, this amazing productivity allows people to spend their <a title="PDF" href="http://www.coachingmanagement.nl/The%20Making%20of%20an%20Expert.pdf" target="_blank">10,000 hours developing expertise</a> in the now vertically integrated process of Data Science, rather than having to spend all of that time focusing on developing skills on just a single aspect of the task. There are huge number of things that need to be learned to be an effective Data Scientist, but it is now possible to learn those skills quickly enough to make a career out of being a Jack of all Trades and a near-master at many of them.</p>
<p>So now there&#8217;s a supply of people who could be Data Scientists. But what about the type of demand that drives an<a href="http://blogs.oreilly.com/cgi-bin/mt/mt-search.cgi?blog_id=57&amp;tag=data%20science&amp;limit=20&amp;IncludeBlogs=57" target="_blank"> incessant stream of O&#8217;Reilly articles</a> and <a href="http://www.nytimes.com/2009/08/06/technology/06stats.html" target="_blank">job postings</a>? Where does the demand come from? <a href="http://twitter.com/#!/justgrimes" target="_blank">Justin Grimes </a>had an intriguing idea that resonated with me &#8212; analytics in sports, which I propose as the other reason why analytics and data science have become buzzwords. Although <a href="http://en.wikipedia.org/wiki/Scientific_management" target="_blank">business has used mathematical methods for 100 years</a>, (thousands if you include finance and insurance) the idea that you could hire a very small number of people to analyze data and beat gut instincts in many aspects of decision making is much newer. The idea that<a href="http://en.wikipedia.org/wiki/Moneyball" target="_blank"> a statistician could turn around the Oakland A&#8217;s</a> by radically overturning longstanding recruiting practice was a powerful analogy. Even now, <a href="http://www.amazon.com/Super-Crunchers-Thinking-Numbers-Smart/dp/0553805401" target="_blank">business books about analytics almost always have sports examples in the first chapters</a>. I made a point at work the other day by noting that <a title="Freakanomics Radio on Prediction" href="http://www.wnyc.org/shows/freakonomics-radio/2011/jun/24/" target="_blank">most professional sports prognosticators predict NFL playoff outcomes wrong because they over-weight last years&#8217; results</a>. Sports analogies get attention.</p>
<p>Does this make sense? Data Science is a buzzword now because a group of people with eclectic talents match a growing demand for and recognition of the value of those talents. I&#8217;d love feedback on these thoughts!</p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/09/data-science-moores-law-and-moneyball/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>hacking .gov shortened links</title>
		<link>http://www.harlan.harris.name/2011/07/hacking-gov-shortened-links/</link>
		<comments>http://www.harlan.harris.name/2011/07/hacking-gov-shortened-links/#comments</comments>
		<pubDate>Sat, 30 Jul 2011 21:40:25 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[dataviz]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[hackathon]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=136</guid>
		<description><![CDATA[This past Friday, the web portal to the US Federal government, USA.gov, organized hackathons across the US for programmers and data scientists to work with and analyze the data from their link-shortening service. It turns out that if you shorten a web link with bit.ly, the shortened link looks like 1.usa.gov/V6NpL (that one goes to [...]]]></description>
			<content:encoded><![CDATA[<p>This past Friday, the web portal to the US Federal government, USA.gov, <a href="http://blog.usa.gov/post/7054661537/1-usa-gov-open-data-and-hack-day">organized hackathons</a> across the US for programmers and data scientists to work with and analyze the data from their link-shortening service. It turns out that if you shorten a web link with <a href="http://bit.ly/">bit.ly</a>, the shortened link looks like <a href="http://1.usa.gov/V6NpL">1.usa.gov/V6NpL</a> (that one goes to a NASA page). And because this service was paid for by taxpayer money, the data about each clickthrough is freely available.</p>
<p>Shortened-link click-through data is interesting. It tells you the time and approximate geographic location of each click-through, and the web page or service that the link was on (assuming someone didn&#8217;t type the URL in by hand). You also know when the shortened link was created, which tells you a little bit about the way links are shared. Bit.ly themselves have several full time data scientists on staff whose job is to learn about what shortened-link data can say about web traffic patterns and link sharing, potentially very lucrative information.</p>
<p>For my part, I just wanted to do some fun visualizations. Along with friends in NYC, I joined the hackathon remotely, following along on twitter and listening to dance music in their <a href="http://turntable.fm/">turntable.fm</a> room. I managed to get rough drafts of two somewhat non-trivial graphs done during the official hackathon, and I re-built them with larger and more random data later.</p>
<p>This first graph looks at the difference in time between when a link was created (the first time someone tried to shorten the target URL) and when the clickthroughs happened. For each of the 25 most frequently visited target domains (mostly US government agencies), I built a density plot, or smoothed histogram, of the timings.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/07/plot_link_age.png"><img src="http://www.harlan.harris.name/wp-content/uploads/2011/07/plot_link_age_5-300x300.png" alt="" title="Link Age Faceted Density Plot" width="300" height="300" class="aligncenter size-medium wp-image-138" /></a></p>
<p>(click for a larger image) There are some interesting differences. Links from senate.gov are mostly clicked through within a few hours of their creation, and links from the NY Courts are clicked through in less than an hour. There appear to be links to NOAA and the State of California pages that are frequently clicked through hundreds of days after their creation. It would be interesting to dive into the content of the target pages, categorize them, and learn what causes these differences.</p>
<p>Speaking of diving into the content, I did a very simple version of that next. When clicking a link to a government web page, are people looking for information about their hometown? Fortunately, clickthrough data includes geocode information for the clicker&#8217;s IP address, which includes the nearest city. I decided to find out by scraping the text content of the 100 most frequently accessed web pages, and detected whether or not each city was in each web page. </p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/07/plot_navelgazers1.png"><img src="http://www.harlan.harris.name/wp-content/uploads/2011/07/plot_navelgazers_51-300x300.png" alt="" title="Navelgazers" width="300" height="300" class="aligncenter size-medium wp-image-142" /></a></p>
<p>(again, click for larger image) This &#8220;navel-gazers&#8221; plot shows the summarized results. For each city in the data set with more than 5 clickthroughs, I plotted the raw number of clickthroughs from that city (the X axis) against the proportion of clickthroughs that ended up on a web page with the name of the city in it (the Y axis). Many cities are clustered in the lower-left, with few clicks and no instances of their city on the target page. Large cities like New York and London are far to the right, as expected from their population, and they show up in target web pages occasionally. Washington (DC) is both a frequent clicker of shortened links, as well as a city that tends to show up on web pages, unsurprising given that it is the seat of the Federal government. The exceptions are the most interesting. People in Bangalore clicked through more than 15 times in this sample, and about 12% of their clicks were to pages with the name of their city. In Boulder, a quarter of the 12 or so clicks mentioned their town! </p>
<p>Deeper analysis would be needed to explain these results, but they were fun to put together! For those interested in checking out my work, including R code to pull a sample of 1.usa.gov data from the archives, please check out my repository on GitHub: <a href="https://github.com/HarlanH/hackathon-1usagov">https://github.com/HarlanH/hackathon-1usagov</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/07/hacking-gov-shortened-links/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>making meat shares more efficient with R and Symphony</title>
		<link>http://www.harlan.harris.name/2011/05/optimizing-meat-shares-details/</link>
		<comments>http://www.harlan.harris.name/2011/05/optimizing-meat-shares-details/#comments</comments>
		<pubDate>Mon, 09 May 2011 18:07:42 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[csa]]></category>
		<category><![CDATA[operations research]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=110</guid>
		<description><![CDATA[In my previous post, I motivated a web application that would allow small-scale sustainable meat producers to sell directly to consumers using a meat share approach, using constrained optimization techniques to maximize utility for everyone involved. In this post, I&#8217;ll walk through some R code that I wrote to demonstrate the technique on a small [...]]]></description>
			<content:encoded><![CDATA[<p>In my <a href="http://www.harlan.harris.name/2011/05/optimizing-meat-shares" target="_blank">previous post</a>, I motivated a web application that would allow small-scale sustainable meat producers to sell directly to consumers using a meat share approach, using constrained optimization techniques to maximize utility for everyone involved. In this post, I&#8217;ll walk through some R code that I wrote to demonstrate the technique on a small scale.</p>
<p>Although the problem is set up in R, the actual mathematical optimization is done by <a href="http://www.coin-or.org/SYMPHONY/" target="_blank">Symphony</a>, an open-source mixed-integer solver that&#8217;s part of the <a href="http://www.coin-or.org/" target="_blank">COIN-OR project</a>. (The problem of optimizing assignments, in this case of cuts of meat to people, is an integer planning problem, because the solution involves assigning either 0 or 1 of each cut to each person. More generally, linear programming and related optimization frameworks allow solving for real-numbered variables.) The RSymphony package allows problems set up in R to be solved by the C/C++ Symphony code with little hassle.</p>
<p>My code is in a public github repository called <a href="https://github.com/HarlanH/groupmeat-demo/" target="_blank">groupmeat-demo</a>, and the demo code discussed here is in the <a href="https://github.com/HarlanH/groupmeat-demo/blob/master/subset_test.R" target="_blank">subset_test.R</a> file. (The other stuff in the repo is an unfinished version of a larger-scale demo with slightly more realistic data.)</p>
<p>For this toy problem, we want to optimally assign 6 items to 3 people, each of whom have a different utility (value) for each item. In this case, I&#8217;m ignoring any fixed utility, such as cost in dollars, but that could be added into the formulation. Additionally, assume that items #1 and #2 cannot both be assigned, as with pork loin and pork chops.</p>
<p>This sort of problem is fairly simple to define mathematically. To set up the problem in code, I&#8217;ll need to create some matrices that are used in the computation. Briefly, the goal is to maximize an objective expression, <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bc%7D%5ET%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{c}^T\mathbf{x}' title='\mathbf{c}^T\mathbf{x}' class='latex' />, where the <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' /> are variables that will be 0 or 1, indicating an assignment or non-assignment, and the <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bc%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{c}' title='\mathbf{c}' class='latex' /> is a coefficient vector representing the utilities of assigning each item to each person. Here, there are 6 items for 3 people, so I&#8217;ll have a 6&#215;3 matrix, flattened to an 18-vector. The goal will be to find 0&#8242;s and 1&#8242;s for <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bx%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{x}' title='\mathbf{x}' class='latex' /> that maximize the whole expression.</p>
<p>Here&#8217;s what the <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bc%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{c}' title='\mathbf{c}' class='latex' /> matrix looks like:</p>
<pre>      pers1 pers2  pers3
item1 0.467 0.221 0.2151
item2 0.030 0.252 0.4979
item3 0.019 0.033 0.0304
item4 0.043 0.348 0.0158
item5 0.414 0.050 0.0096
item6 0.029 0.095 0.2311</pre>
<p>It appears as if everyone like item1, but only person1 likes item5.</p>
<p>Additionally, I need to define some constraints. For starters, it makes no sense to assign an item to more than one person. So, for each row of that matrix, the sum of the variables (not the utilities) must be 1, or maybe 0 (if that item is not assigned). I&#8217;ll create a constraint matrix, where each row contains 18 columns, and the pattern of 0&#8242;s and 1&#8242;s defines a row of the assignment matrix. Since there are 6 items, there are 6 rows (for now). Each row needs to be less than or equal to one (I&#8217;ll tell the solver to use integers only later), so I also define vectors of inequality symbols and right-hand-sides.</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code7'); return false;">View Code</a> RSLANG</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1107"><td class="code" id="p110code7"><pre class="rslang" style="font-family:monospace;"># for each item/row, enforce that the sum of indicators for its assignment are &lt;= 1
mat &lt;- laply(1:num.items, function(ii) { x &lt;- mat.0; x[ii, ] &lt;- 1; as.double(x) })
dir &lt;- rep('&lt;=', num.items)
rhs &lt;- rep(1, num.items)</pre></td></tr></table></div>

<p>To add the loin/chops constraint, I need to add another row, specifying that the sum of the indicators for <em>both </em>rows now must be 1 or less as well.</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code8'); return false;">View Code</a> RSLANG</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1108"><td class="code" id="p110code8"><pre class="rslang" style="font-family:monospace;"># for rows 1 and 2, enforce that the sum of indicators for their assignments are &lt;= 1
mat &lt;- rbind(mat, matrix(matrix(c(1, 1, rep(0, num.items-2)), nrow=num.items, ncol=num.pers), nrow=1))
dir &lt;- c(dir, '&lt;=')
rhs &lt;- c(rhs, 1)</pre></td></tr></table></div>

<p>Here&#8217;s what those matrices and vectors look like:</p>
<pre>
> mat
     1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
[1,] 1 0 0 0 0 0 1 0 0  0  0  0  1  0  0  0  0  0
[2,] 0 1 0 0 0 0 0 1 0  0  0  0  0  1  0  0  0  0
[3,] 0 0 1 0 0 0 0 0 1  0  0  0  0  0  1  0  0  0
[4,] 0 0 0 1 0 0 0 0 0  1  0  0  0  0  0  1  0  0
[5,] 0 0 0 0 1 0 0 0 0  0  1  0  0  0  0  0  1  0
[6,] 0 0 0 0 0 1 0 0 0  0  0  1  0  0  0  0  0  1
[7,] 1 1 0 0 0 0 1 1 0  0  0  0  1  1  0  0  0  0
> dir
[1] "<=" "<=" "<=" "<=" "<=" "<=" "<="
> rhs
[1] 1 1 1 1 1 1 1
</pre>
<p>Finally, specify that the variables must be binary (0 or 1), and call SYMPHONY to solve the problem:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code9'); return false;">View Code</a> RSLANG</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p1109"><td class="code" id="p110code9"><pre class="rslang" style="font-family:monospace;"># this is an IP problem, for now
types &lt;- rep('B', num.items * num.pers)
max &lt;- TRUE # maximizing utility
&nbsp;
soln &lt;- Rsymphony_solve_LP(obj, mat, dir, rhs, types=types, max=max)</pre></td></tr></table></div>

<p>And, with a bit of post-processing to recover matrices from vectors, here&#8217;s the result:</p>
<pre>
$solution
 [1] 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1

$objval
[1] 1.52

$status
TM_OPTIMAL_SOLUTION_FOUND
                        0 

Person #1 got Items 5 worth 0.41
Person #2 got Items 3, 4 worth 0.38
Person #3 got Items 2, 6 worth 0.73</pre>
<p>So that&#8217;s great. It found an optimal solution worth more than 50% more than the expected value of a random assignment. But there&#8217;s a problem. There&#8217;s no guarantee that everyone gets anything, and in this case, person #3 gets almost twice as much utility as person #2. Unfair! We need to enforce an additional constraint, that the difference between the maximum utility that any one person gets and the minimum utility that any one person gets is not too high. This is sometimes called a parity constraint. Adding parity constraints is a little tricky, but the basic idea here is to add two more variables to the 18 I&#8217;ve already defined. These variables are positive real numbers, and they are forced by constraints to be the maximum and minimum total utilities per person. In the objective function, then, they are weighted so that their difference is not to big. So, that expression becomes: <img src='http://s.wordpress.com/latex.php?latex=%5Cmathbf%7Bc%7D%5ET%5Cmathbf%7Bx%7D%20-%20%5Clambda%20x_%7B19%7D%20-%20-%20%5Clambda%20x%5E%7B20%7D&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\mathbf{c}^T\mathbf{x} - \lambda x_{19} - - \lambda x^{20}' title='\mathbf{c}^T\mathbf{x} - \lambda x_{19} - - \lambda x^{20}' class='latex' />. The first variable (the maximum utility of any person) is minimized, while the second variable is maximized. The <img src='http://s.wordpress.com/latex.php?latex=%5Clambda&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\lambda' title='\lambda' class='latex' /> free parameter defines how much to trade off parity with total utility, and I&#8217;ll set it to 1 for now.</p>
<p>For the existing rows of the constraint matrix, these new variables get 0&#8242;s. But two more rows need to be added, per person, to force their values to be no bigger/smaller (and thus the same as) the maximum/minimum of any person&#8217;s assigned utility.</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code10'); return false;">View Code</a> RSLANG</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p11010"><td class="code" id="p110code10"><pre class="rslang" style="font-family:monospace;"># now for those upper and lower variables
# \forall p, \sum_i u_i x_{i,p} - d.upper \le 0
# \forall p, \sum_i u_i x_{i,p} - d.lower \ge 0
# so, two more rows per person
d.constraint &lt;- function(iperson, ul) { # ul = 1 for upper, 0 for lower
  x &lt;- mat.utility.0
  x[, iperson ] &lt;- 1
  x &lt;- x * obj.utility
  c(as.double(x), (if (ul) c(-1,0) else c(0,-1)))
}
mat &lt;- rbind(mat, maply(expand.grid(iperson=1:num.pers, ul=c(1,0)), d.constraint, .expand=FALSE))
dir &lt;- c(dir, c(rep('&lt;=', num.pers), rep('&gt;=', num.pers)))
rhs &lt;- c(rhs, rep(0, num.pers*2))</pre></td></tr></table></div>

<p>The constraint inequalities then becomes as follows:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code11'); return false;">View Code</a> TEXT</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p11011"><td class="code" id="p110code11"><pre class="text" style="font-family:monospace;">&gt; print(mat, digits=2)
     1     2     3     4    5     6    7    8     9   10   11    12   13  14    15    16     17   18  19 20
  1.00 0.000 0.000 0.000 0.00 0.000 1.00 0.00 0.000 0.00 0.00 0.000 1.00 0.0 0.000 0.000 0.0000 0.00  0  0
  0.00 1.000 0.000 0.000 0.00 0.000 0.00 1.00 0.000 0.00 0.00 0.000 0.00 1.0 0.000 0.000 0.0000 0.00  0  0
  0.00 0.000 1.000 0.000 0.00 0.000 0.00 0.00 1.000 0.00 0.00 0.000 0.00 0.0 1.000 0.000 0.0000 0.00  0  0
  0.00 0.000 0.000 1.000 0.00 0.000 0.00 0.00 0.000 1.00 0.00 0.000 0.00 0.0 0.000 1.000 0.0000 0.00  0  0
  0.00 0.000 0.000 0.000 1.00 0.000 0.00 0.00 0.000 0.00 1.00 0.000 0.00 0.0 0.000 0.000 1.0000 0.00  0  0
  0.00 0.000 0.000 0.000 0.00 1.000 0.00 0.00 0.000 0.00 0.00 1.000 0.00 0.0 0.000 0.000 0.0000 1.00  0  0
  1.00 1.000 0.000 0.000 0.00 0.000 1.00 1.00 0.000 0.00 0.00 0.000 1.00 1.0 0.000 0.000 0.0000 0.00  0  0
  0.47 0.030 0.019 0.043 0.41 0.029 0.00 0.00 0.000 0.00 0.00 0.000 0.00 0.0 0.000 0.000 0.0000 0.00 -1  0
  0.00 0.000 0.000 0.000 0.00 0.000 0.22 0.25 0.033 0.35 0.05 0.095 0.00 0.0 0.000 0.000 0.0000 0.00 -1  0
  0.00 0.000 0.000 0.000 0.00 0.000 0.00 0.00 0.000 0.00 0.00 0.000 0.22 0.5 0.030 0.016 0.0096 0.23 -1  0
  0.47 0.030 0.019 0.043 0.41 0.029 0.00 0.00 0.000 0.00 0.00 0.000 0.00 0.0 0.000 0.000 0.0000 0.00  0 -1
  0.00 0.000 0.000 0.000 0.00 0.000 0.22 0.25 0.033 0.35 0.05 0.095 0.00 0.0 0.000 0.000 0.0000 0.00  0 -1
  0.00 0.000 0.000 0.000 0.00 0.000 0.00 0.00 0.000 0.00 0.00 0.000 0.22 0.5 0.030 0.016 0.0096 0.23  0 -1
&gt; dir
 [1] &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot; &quot;&gt;=&quot; &quot;&lt;=&quot; &quot;&lt;=&quot;
&gt; rhs
 [1] 1 1 1 1 1 1 1 0 0 0 0 0 0</pre></td></tr></table></div>

<p>Looking at just the last row, this constraint says that the sum of the utilities of any assigned items for person #3, minus the lower limit, must be at least 0. That is essentially the definition of the lower limit, that that constraint holds true for all three people in this problem. Similar logic applies for the upper limit.</p>
<p>Running the solver with this set of inputs gives the following:</p>

<div class="wp_codebox_msgheader"><span class="right"><sup><a href="http://www.ericbess.com/ericblog/2008/03/03/wp-codebox/#examples" target="_blank" title="WP-CodeBox HowTo?"><span style="color: #99cc00">?</span></a></sup></span><span class="left"><a href="javascript:;" onclick="javascript:showCodeTxt('p110code12'); return false;">View Code</a> TEXT</span><div class="codebox_clear"></div></div><div class="wp_codebox"><table><tr id="p11012"><td class="code" id="p110code12"><pre class="text" style="font-family:monospace;">$solution
 [1] 0.000 0.000 1.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 0.000 0.000
[18] 0.000 0.498 0.433
&nbsp;
$objval
[1] 1.31
&nbsp;
$status
TM_OPTIMAL_SOLUTION_FOUND
                        0 
&nbsp;
Person #1 got Items 3, 5 worth 0.43
Person #2 got Items 4, 6 worth 0.44
Person #3 got Items 2 worth 0.50</pre></td></tr></table></div>

<p>The last two numbers in the solution are the values of the upper and lower bounds. Note that the objective value is only 41% higher than a random assignment, but the utilities assigned to each person are much closer. Dropping the <img src='http://s.wordpress.com/latex.php?latex=%5Clambda&#038;bg=ffffff&#038;fg=000000&#038;s=0' alt='\lambda' title='\lambda' class='latex' /> value to something closer to 0 causes the weights of the parity bounds to be less important, and the solution tends to be closer to the initial result.</p>
<p>Scaling this up to include constraints in pricing, farm preferences, price vs. preference meta-preferences, etc., is not conceptually difficult, but would just entail careful programming. It is left as an exercise for the well-motivated reader!</p>
<p>If you&#8217;ve made it this far, I&#8217;d definitely appreciate any feedback about this idea, corrections to my formulation or code or terminology, etc!</p>
<p>(Thanks to Paul Ruben and others on <a href="http://www.or-exchange.com/" target="_blank">OR-Exchange</a>, who helped me <a href="http://www.or-exchange.com/questions/2750/assignment-problem-maximizing-utility-equitably" target="_blank">figure out how to think about the parity problem</a>, and to the authors of <a href="http://wordpress.org/extend/plugins/wp-codebox/" target="_blank">WP-codebox</a> and <a href="http://wordpress.org/extend/plugins/wp-latex/" target="_blank">WP LaTeX</a> for giving me tools to put nice scrollable R code and math in this post!)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/05/optimizing-meat-shares-details/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>making meat shares more efficient</title>
		<link>http://www.harlan.harris.name/2011/05/meat-share-optimization/</link>
		<comments>http://www.harlan.harris.name/2011/05/meat-share-optimization/#comments</comments>
		<pubDate>Mon, 09 May 2011 18:07:04 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Personal]]></category>
		<category><![CDATA[Professional]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[csa]]></category>
		<category><![CDATA[food]]></category>
		<category><![CDATA[operations research]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=102</guid>
		<description><![CDATA[A personal interest I have is the ethical and sustainable production of food. I&#8217;ve been a member of and helped run Community Supported Agriculture groups, and my wife and I currently purchase the majority of our meat from a group of upstate NY pastured-livestock producers who sell their products through CSAs. It&#8217;s an ala-carte business [...]]]></description>
			<content:encoded><![CDATA[<p>A personal interest I have is the ethical and sustainable production of food. I&#8217;ve been a <a title="Prairieland CSA" href="http://www.prairielandcsa.org/" target="_blank">member of</a> and <a title="Hellgate CSA" href="http://hellgatecsa.net/" target="_blank">helped run</a> <a title="Just Food on CSAs" href="http://www.justfood.org/csa" target="_blank">Community Supported Agriculture</a> groups, and my wife and I currently purchase the majority of our meat from a <a title="Lewis Waite Farm CSA" href="http://www.csalewiswaitefarm.com/" target="_blank">group of upstate NY pastured-livestock producers</a> who sell their products through CSAs. It&#8217;s an ala-carte business model, where I place an order on a website, and the next week I pick up the frozen products cut and packaged as if for retail.</p>
<p>A related way to get meat has become fairly popular recently &#8212; the meat CSA or meat share. As the <a href="http://www.meatshare.com/" target="_blank">NYC Meatshare</a> group describes it, &#8220;Looking for healthy meat raised on pasture by small local farms? It&#8217;s  expensive, but by banding together to buy whole animals we can support  farmers and save money.&#8221; Members of a meatshare all pitch in to buy a whole animal, which is then butchered and split among the members. Here&#8217;s how a meatshare event described the 10th of a hog each member got: &#8220;Each person will get an equal amount of bacon and sausage (about 2 lbs  each), chops (center &amp; butt), and will divide the other cuts up as  equally as possible (including ham steak, loin, organs, etc.)  If you  have preferences please let me know, I will do my best to accommodate.   Or, you can swap with other members at my place.&#8221;</p>
<p>These two business models put a substantial burden on either the farmer (in the first case) or the consumer (in the second case). The retail model requires the farmer, or a collective of farmers, to put together a retail-ordering web site, a butchery and inventory system, and a delivery and distribution system. The meat share model takes these burdens off the farmers, but requires the consumers to set up and organize the purchase and payment system, meet at a common location, and either take what is available or perform ad hoc swaps. In a more traditional producer-consumer relationship, the supply chain, payment, inventory, and preferences-matching process is taken care of by the comodification of the animals (all cows are the same) and the services provided by a retail grocery store.</p>
<p>One could argue that that&#8217;s the third option &#8212; Whole Foods &#8212; but it sorta defeats the purpose of non-commidified, high-quality meat, and it tends to defeat the pocketbook too. No connection with the farm, just a promise of ethical standards (probably including the pointless &#8220;organic&#8221; label), and a substantial cut by middlemen. Not really an option at all.</p>
<p>So what else could be done to build sustainable relationships between animal producers and people who value high-quality, ethically produced meat? Why not leverage technology? <a href="http://www.harlan.harris.name/wp-content/uploads/2011/05/meatshare.jpg"><img class="alignright size-medium wp-image-112" title="meatshare" src="http://www.harlan.harris.name/wp-content/uploads/2011/05/meatshare-300x222.jpg" alt="" width="300" height="222" /></a>And not just selling via web sites, but the kind of logistics technology that allows Whole Foods (and <a href="http://www.youtube.com/watch?v=mRAHa_Po0Kg" target="_blank">UPS</a>, and Walmart) to efficiently get huge varieties of goods from place to place? A group at a recent <a href="http://foodhack.eventbrite.com/" target="_blank">food-tech hack-a-thon</a> had the start of this idea. They put together<a href="http://groupme.at/" target="_blank"> a quick demo of a front-end web site</a> (&#8220;groupme.at&#8221; &#8212; clever!) that would allow consumers to choose smaller sets of cuts in such a way that the whole neatly ends up with a whole animal. By setting up a platform that can be easily connected with many small producers all over the country, the problem of every producer needing to be a webmaster is eliminated. And the system to get all of the pieces to add up to whole animals reduces risk for the farmer. It&#8217;s a great start. But by leveraging additional open-source tools and some ingenuity, I think it should be possible to do even more.</p>
<p>Imagine a similar web site, but instead of selecting a pre-selected package of cuts, you instead indicate your preferences and price range. As animals become available, you get an emailed notification of a delivery with a set of products that are very similar to the preferences you specified. You might love pork belly and boneless loin. Your neighbor might love cured pork belly (bacon) and chops. You might hate liver, but you&#8217;d accept some pig ears every once in a while for your dog. And your neighbor might really like the fatback to render for lard, while you&#8217;d find that useless. Everyone who might be sharing in an animal indicates their preferences, and the web site would automatically give everyone as much as possible of what they like the most. Equally important, all of the parts add up to whole animals, so the farmer is not stuck with the risk of unsold inventory.</p>
<p>Now imagine that after a few months, you&#8217;ve ranked the cuts of meat from Alice&#8217;s Farm 5 stars, but the ones from Bob&#8217;s Ranch only 3 stars. And you&#8217;ve told the system that you&#8217;re willing to pay more to get more of what you really want, but you neighbor tells the web site that he&#8217;s willing to make trade-offs to spend less money. You&#8217;ve essentially added other constraints, that if balanced well, will make everyone as happy as possible. Also, notice that I mentioned both boneless loin and pork chops? They&#8217;re more-or-less the same part of the animal cut different ways, so you can&#8217;t sell them both off the same half of the same animal. Now you have exclusive constraints to add into the mix. Maybe everyone&#8217;s better off if you get the boneless loin, or maybe everyone&#8217;s better off if your neighbor gets the chops. It&#8217;s easy to imagine collecting all of this information, but how do you combine it all and optimize the outcome in a <a href="http://en.wikipedia.org/wiki/Utilitarianism" target="_blank">utilitarian</a> way?</p>
<p>Why, operations research and computational optimization! Write some software that plugs everyone&#8217;s constraints into a set of equations, push a button, let the computer think for a second or two, and wham, you get a solution that balances the constraints as fairly as possible! Send the cut list to the slaughterhouse and email the product lists and bills to the customers, and you&#8217;re basically done.</p>
<p>In the past, this sort of supply-chain optimization required massive computing power and complex software design. But now, there are <a href="http://www.coin-or.org/" target="_blank">open-source code bases</a> for solving this sort of problem, at least at the scale needed to balance the preferences of a few farmers and a few dozen or hundred customers at a time.</p>
<p>This is the next step in leveraging technology to make at least some aspects of the supply chain for small-scale meat operations as efficient as what Purdue does, but maintaining the high quality and personal connection to the farm that many people want now. All that&#8217;s needed are some enterprising hackers to write the code and set up a scalable, configurable web platform for preference-based meatshares.</p>
<p>In my <a href="http://www.harlan.harris.name/2011/05/optimizing-meat-shares-details/">next post</a>, I&#8217;ll demonstrate how to write code that uses one of those open-source optimization libraries to solve a small version of this problem. If you&#8217;re interested in reading R code, stay tuned!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/05/meat-share-optimization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>intuitive visualizations of categorization for non-technical audiences</title>
		<link>http://www.harlan.harris.name/2011/04/visualizing-categorization-models/</link>
		<comments>http://www.harlan.harris.name/2011/04/visualizing-categorization-models/#comments</comments>
		<pubDate>Mon, 25 Apr 2011 12:45:48 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[analytics]]></category>
		<category><![CDATA[dataviz]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[predictive]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[statistics]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=88</guid>
		<description><![CDATA[For a project I&#8217;m working on at work, I&#8217;m building a predictive model that categorizes something (I can&#8217;t tell you what) into two bins. There is a default bin that 95% of the things belong to and a bin that the business cares a lot about, containing 5% of the things. Some readers may be [...]]]></description>
			<content:encoded><![CDATA[<p>For a project I&#8217;m working on at work, I&#8217;m building a predictive model that categorizes something (I can&#8217;t tell you what) into two bins. There is a default bin that 95% of the things belong to and a bin that the business cares a lot about, containing 5% of the things. Some readers may be familiar with the use of predictive models to identify better sales leads, so that you can target the leads most likely to convert and minimize the amount of effort wasted on people who won&#8217;t purchase your product. Although my situation doesn&#8217;t have to do with sales leads, I&#8217;m going to pretend it does, as it&#8217;s a common domain.</p>
<p>My data is many thousands of &#8220;leads&#8221;, for which I&#8217;ve constructed hundreds of predictive features (mostly 1/0, a few numeric) each. I can plug this data into any number of common statistical and machine learning systems which will crunch the numbers and provide a black box that can do a pretty good job of separating more-valuable leads from less valuable leads. That&#8217;s great, but now I have to communicate what I&#8217;ve done, and how valuable it is, to an audience that struggles with relatively simple statistical concepts like correlation. What can I do?</p>
<p><span id="more-88"></span></p>
<p>I&#8217;m generally interested in finding better ways to build clean, intuitive, and informative visualizations of data, especially when the visualizations can leverage intuitions and skills that everyone has. For example, almost everyone has a surprisingly good <a href="http://www.nytimes.com/2008/09/16/science/16angi.html" target="_blank">approximate number sense</a>, the ability to quickly identify about how many items are in a largish group. For example, if shown a photo of 30 oranges and a photo of 20 oranges, you would be able to immediately say that there were more oranges in the first photo, and you would happily say that that photo had a few dozen oranges in it. This psychological skill can be used to make more effective visualizations of certain types of data. Instead of comparing two quantities by lines in a chart, or even a number in a table, it may be useful to compare <em>visual density</em>.</p>
<p>How can this be used to make better visualizations of prediction quality? Consider the standard ways that predictive model quality is reported. I have obfuscated the test set data from the problem I mentioned above, and placed it in a <a href="http://dl.dropbox.com/u/7644953/classifier-visualization.Rdata">public Dropbox</a> in Rdata format. I&#8217;ve also put together an <a href="http://www.r-project.org/" target="_blank">R</a> script to demonstrate various ways of looking at the predictions and put it in a <a href="http://gist.github.com/937821" target="_blank">Github gist</a>. Follow along if you&#8217;d like.</p>
<p>First, take a look at the data frame and some summary statistics:</p>
<pre>&gt; head(pred.df)
      predicted actual actual.bin
7379  0.6020833    yes          1
5357  0.5791667    yes          1
7894  0.5791667    yes          1
5893  0.5604167    yes          1
16093 0.5541667    yes          1
2883  0.5520833    yes          1

&gt; summary(pred.df)
   predicted        actual       actual.bin
 Min.   :0.000000   no :7785   Min.   :0.0000
 1st Qu.:0.004167   yes: 366   1st Qu.:0.0000
 Median :0.016667              Median :0.0000
 Mean   :0.040827              Mean   :0.0449
 3rd Qu.:0.041667              3rd Qu.:0.0000
 Max.   :0.602083              Max.   :1.0000</pre>
<p>The mode predicts about 4% of the items will be in the &#8220;yes&#8221; category, which is similar to the 4.5% that actually were. Using the very flexible <a href="http://cran.r-project.org/web/packages/ROCR/index.html" target="_blank">ROCR</a> package, I can quickly and easily convert this data frame into an object that can then be used to calculate any number of standard measures of predictiveness. First, I calculate the <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic" target="_blank">AUC </a>value, which has a very intuitive interpretation. Consider sorting the list of items from most-predicted-to-be-&#8221;yes&#8221; to least. If the predictions are good, most of the &#8220;yes&#8221; values will be relatively high in the list. The AUC is equivalent to asking, if I randomly pick a &#8220;yes&#8221; item and a &#8220;no&#8221; item out of the list, how likely is the &#8220;yes&#8221; item to be higher on the list? If the list was randomly shuffled, it would 0.5; if it were perfectly shuffled with 20/20 hindsight, the AUC would be 1.0.</p>
<pre>&gt; # convert to their object type (labels should be some sort of ordered type)
&gt; pred.rocr &lt;- prediction(pred.df$predicted, pred.df$actual)
&gt; # Area Under the ROC Curve
&gt; performance(pred.rocr, 'auc')@y.values[[1]]
[1] 0.8237496</pre>
<p>In this case, it&#8217;s about .82, which is probably valuable but far from perfect. Another common way of looking at this type of predictions comes from business uses, where the goal is to identify leads (or whatever) that are likely to convert to purchases. From this point of view, the goal is to <em>lift</em> the leads higher in the list, so that you can focus on the top of the list and got more benefit from sales effort with less work. Two common ways of looking at lift are with a decile table, which shows how much value you get by focusing on the top 10%, 20%, etc. of the list, sorted by the predictive model, and the lift chart, which visualizes the same thing by showing how much benefit over random guessing you get by looking at more or less of the sorted list. Here they are for this data:</p>
<pre># decile table
dec.table &lt;- ldply((1:10)/10, function(x) data.frame(
    decile=x,
    prop.yes=sum(pred.df$actual.bin[1:ceiling(nrow(pred.df)*x)])/sum(pred.df$actual.bin),
    lift=mean(pred.df$actual.bin[1:ceiling(nrow(pred.df)*x)])/mean(pred.df$actual.bin)))
print(dec.table, digits=2)

   decile prop.yes lift
1     0.1     0.61  6.1
2     0.2     0.69  3.4
3     0.3     0.76  2.5
4     0.4     0.80  2.0
5     0.5     0.84  1.7
6     0.6     0.90  1.5
7     0.7     0.92  1.3
8     0.8     0.95  1.2
9     0.9     0.99  1.1
10    1.0     1.00  1.0

# Lift Curve
plot(performance(pred.rocr, 'lift', 'rpp'))</pre>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/liftchart.png"><img class="aligncenter size-full wp-image-91" title="Lift Chart" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/liftchart.png" alt="" width="400" height="350" /></a><br />
This graph shows, not particularly intuitively in my view, that if you focus on the top 10% of the data, you get more 5 times the bang for the buck than if you focus evenly on the whole set of items. The decile table shows the same thing &#8212; the top decile is lifted by a factor 0f 6.1, and in fact you get 61% of the &#8220;yes&#8221; items in that top 10% of the data. These are very useful numbers to know, but I think there are considerably more intuitive ways of showing how the predictive model pulls the &#8220;yes&#8221; values away from the 5% base rate.</p>
<p>These more intuitive ways are <em>not</em> the standard graphs used in statistics and machine learning, such as the sensitivity/specificity curve and the ROC curve. Those graphs, shown below, illustrate trade-offs between accepting false positives and false negatives. Useful, yes, but to understand them you have to think about the ways you could set a threshold and what effect that threshold would have on the nature of your predictions. That&#8217;s not particularly intuitive, and the visualization doesn&#8217;t visually contrast two things, so it&#8217;s difficult to get an intuitive understanding of what has been gained.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/specsens.png"><img class="aligncenter size-full wp-image-94" title="specsens" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/specsens.png" alt="" width="367" height="276" /></a><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/roc.png"><img class="aligncenter size-full wp-image-95" title="ROC" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/roc.png" alt="" width="376" height="270" /></a>I&#8217;ve put some thought and some tinkering into potentially better ways of visualizing the output of predictive models. The key, I think, is to use a visualization that builds on the scatter graph. Scatter graphs are great for less-technical audiences, because you can tell them that every individual dot is a customer (widget, whatever). They can immediately see the number of items in question, and if you can plot the points on axes that make sense to them, they can go from &#8220;that dot there represents one person with this level of X and this level of Y&#8221;, to &#8220;this set of dots represents a set of people with similar levels of X and Y&#8221;, to &#8220;this graph represents everyone, and their respective levels of X and Y.&#8221; And because of skills like the approximate number sense and the ability to quickly understand visual density, scatter graphs can give a vastly better understanding of the range of a data set than summary graphs that just plot a line and maybe some error bars.</p>
<p>Here are several versions of a graph that illustrates how the predictive model smears out the set of dots from the 5% base rate, disproportionately pulling the &#8220;yes&#8221; items to the right, separating at least some of them from the much larger set of &#8220;no&#8221; items. One key change from a basic scatter graph is to jitter the Y position of each point randomly, which I think makes these graphs look a little like a <a href="http://en.wikipedia.org/wiki/Agarose_gel_electrophoresis" target="_blank">PCR gel image</a>.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/dual.png"><img class="aligncenter size-full wp-image-97" title="dual" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/dual.png" alt="" width="330" height="400" /></a>This first approach is built around a basic scatter graph, where the X axis is the predicted likelihood of being a &#8220;yes&#8221;, and the Y axis is 0 for actual &#8220;no&#8221; and 1 for actual &#8220;yes&#8221; items. On top of that is an orange line representing the base rate of about 5%, a blue line showing the smoothed ratio between &#8220;yes&#8221; and &#8220;no&#8221; items at each level of prediction, and a thin grey line showing where the blue line ought to be. In this case, the model tends to underestimate the likelihood that some items are to be &#8220;yes&#8221; items. At 50%, half of the items should be &#8220;yes&#8221; and half should be &#8220;no&#8221;, but it&#8217;s more like 3:1.</p>
<p>I like this graph as it intuitively lets people see the extent to which the predictive model is separating the categories, and how much better it does than just assuming the base rate. My second approach at this combines the &#8220;smears&#8221; with another way of visualizing lift.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/single.png"><img class="aligncenter size-full wp-image-96" title="single" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/single.png" alt="" width="400" height="400" /></a>In this graph, the smeared real data is at the bottom of the graph, and the black line represents the lift, or how much better you are at identifying &#8220;yes&#8221; items by using the predictions. It&#8217;s also an intuitive way of motivating the need to draw a boundary to focus effort. When trying to convert the points at the 25% level or above, you may be ineffective 75% of the time, but you&#8217;re also more than 10 times more efficient than you would be otherwise.</p>
<p>My final attempt worth sharing is this one, which combines the dual smear approach with the cumulative value numbers from the lift table.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2011/04/dualcum.png"><img class="aligncenter size-full wp-image-98" title="dualcum" src="http://www.harlan.harris.name/wp-content/uploads/2011/04/dualcum.png" alt="" width="450" height="400" /></a>Now, in addition to being able to see the density of yes and no items for various levels of the prediction, you can see what proportion of the potential &#8220;yes&#8221; values exist to the <em>right </em>of each level of the prediction. For example, at a threshold of 25%, you capture 40 or 45% of the &#8220;yes&#8221; items. At a 5% threshold you capture more than 70% of the &#8220;yes&#8221; items.</p>
<p>I&#8217;d love some feedback on these graphs! Do you agree with my assertion that scatter graphs are more visually intuitive and  easier to motivate to non-technical audiences? Do these variations on lift charts seem clearer or more valuable than traditional alternatives to you? Have I re-invented something that should be cited?</p>
<p>The R code for these graphs is available in the <a href="https://gist.github.com/937821" target="_blank">Github gist</a>. I used <a href="http://www.harlan.harris.name/tag/ggplot2/" target="_blank">ggplot2</a>, naturally, which is an essential tool for exploring the space of possible visualizations without being tied down by traditional graph structures.</p>
<p>Incidentally, for people interested in building graphs that leverage people&#8217;s innate visual capabilities, I recommend Kosslyn&#8217;s book, <a href="http://www.amazon.com/Graph-Design-Mind-Stephen-Kosslyn/dp/0195311841" target="_blank">Graph Design for the Eye and Mind</a>.</p>
<p>Also incidentally, the question of how to communicate or visualize the potentially incredibly complex sets of rules/weights/whatever inside the categorization black box is another fascinating issue, the subject of ongoing research, and maybe something I&#8217;ll write about soon.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/04/visualizing-categorization-models/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On &#8220;Analytics&#8221; and related fields</title>
		<link>http://www.harlan.harris.name/2011/04/on-analytics-and-related-fields/</link>
		<comments>http://www.harlan.harris.name/2011/04/on-analytics-and-related-fields/#comments</comments>
		<pubDate>Fri, 15 Apr 2011 21:30:30 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[conferences]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[job titles]]></category>
		<category><![CDATA[operations research]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=83</guid>
		<description><![CDATA[I recently attended the INFORMS Conference on Business Analytics and Operations Research, aka &#8220;INFORMS Analytics 2011&#8243;, conference in Chicago. This deserves a little bit of an explanation. INFORMS is the professional organization for Operations Research (OR) and Management Science (MS), which are terms describing approaches to improving business efficiency by use of mathematical optimization and [...]]]></description>
			<content:encoded><![CDATA[<p>I recently attended the <a href="http://meetings2.informs.org/Analytics2011/" target="_blank">INFORMS Conference on Business Analytics and Operations Research</a>, aka &#8220;INFORMS Analytics 2011&#8243;, conference in Chicago. This deserves a little bit of an explanation. <a href="http://www.informs.org/" target="_blank">INFORMS</a> is the professional organization for Operations Research (OR) and Management Science (MS), which are terms describing approaches to improving business efficiency by use of mathematical optimization and simulation tools. OR is perhaps best known for the technique of Linear Programming (read &#8220;Programming&#8221; as &#8220;Planning&#8221;), which is a method for optimizing a useful class of mathematical expressions under various constraints extremely efficiently. You can, for example, solve scheduling, assignment, transportation, factory layout, and similar problems with millions of variables in seconds. These techniques came out of large-scale government and especially military logistics and decision-making needs of the mid-20th century, and have now been applied extensively in many industries. Have you seen the <a href="http://www.youtube.com/watch?v=mRAHa_Po0Kg" target="_blank">UPS &#8220;We (heart) Logistics&#8221; ad</a>? That&#8217;s OR.</p>
<p>OR is useful, but it&#8217;s not sexy, despite UPS&#8217; best efforts. Interest in OR programs in universities (often specialties of Industrial Engineering departments) has been down in recent years, as has been attendance at INFORMS conferences. On the other hand, if you ignore the part about &#8220;optimization&#8221; and just see OR as &#8220;improving business efficiency by use of mathematical processes,&#8221; this makes no sense at all! Hasn&#8217;t <a href="http://ngrams.googlelabs.com/graph?content=analytics%2Coperations+research&amp;year_start=1988&amp;year_end=2008&amp;corpus=0&amp;smoothing=3" target="_blank">Analytics</a> been a buzzword for the past few years? (&#8220;<a href="http://www.google.com/search?q=analytics+buzzword" target="_blank">analytics buzzword</a>&#8221; gets 2.4 million results on Google.) Haven&#8217;t there been <a href="http://www.amazon.com/Super-Crunchers-Thinking-Numbers-Smart/dp/0553805401" target="_blank">bestselling</a> <a href="http://www.amazon.com/Competing-Analytics-New-Science-Winning/dp/1422103323" target="_blank">business</a> <a href="http://www.amazon.com/Moneyball-Art-Winning-Unfair-Game/dp/0393324818" target="_blank">books</a> about mathematical tools being used in all sorts of industries? (That last link is about baseball.) Hasn&#8217;t the use of statistical and mathematical techniques in business been called &#8220;<a href="http://www.mckinseyquarterly.com/Hal_Varian_on_how_the_Web_challenges_managers_2286">sexy</a>&#8221; by Google&#8217;s Chief Economist? How could a field and an industry that at some level seems to be the very <em>definition</em> of what&#8217;s cool in business and technology right now be seen as a relic of <a href="http://www.imdb.com/title/tt0317910/" target="_blank">McNamara&#8217;s vision of the world</a>?</p>
<p>To answer that rhetorical question, I think it&#8217;s worth considering the many ways that organizations can use data about their operations to improve their effectiveness. SAS has a really useful hierarchy, which it calls the <a href="http://www.sas.com/news/sascom/2008q4/column_8levels.html" target="_blank">Eight levels of analytics</a>.</p>
<ol>
<li>Standard Reports &#8211; pre-processed, regular summaries of historical data</li>
<li>Ad Hoc Reports &#8211; the ability for analysts to ask new questions and get new answers</li>
<li>Query Drilldown &#8211; the ability for non-technical users to slice and dice data to see results interactively</li>
<li>Alerts &#8211; systems that detect atypical conditions and notify people</li>
<li>Statistical Analysis &#8211; use of regressions and similar to find trends and correlations in historical data</li>
<li>Forecasting &#8211; ability to extrapolate from historical data to estimate future business</li>
<li>Predictive Analytics &#8211; advanced forecasting, using statistical and machine-learning tools and large data sets</li>
<li>Optimization &#8211; balance competing goals to maximize results</li>
</ol>
<p>I like this hierarchy because it distinguishes among a bunch of different disciplines and technologies that tend to run together. For example, what&#8217;s often called &#8220;Business Intelligence&#8221; is a set of tools for doing items #1-#4. No statistics per se are involved, just the ability to provide useful summaries of data to people who need in various ways. At its most statistically advanced, BI includes tools for data visualization that are informed by research, and at its most technologically advanced, BI includes sophisticated database and data management systems to keep everything running quickly and reliably. These are not small accomplishments, and this is a substantial and useful thing to be able to do.</p>
<p>But it&#8217;s not what &#8220;data scientists&#8221; in industry do, or at least, it&#8217;s not what makes them sexy and valuable. When you apply the tools of scientific inquiry, statistical analysis, and machine learning to data, you get the abilities in levels #5-#7. Real causality can be separated from random noise. Eclectic data sources, including unstructured documents, can be processed for valuable predictive features. Models can <a href="http://www.gladwell.com/2006/2006_10_16_a_formula.html" target="_blank">predict movie revenue</a> or <a href="http://www.netflixprize.com/" target="_blank">recommend movies you want to see</a> or any number of other fascinating things. Great stuff. Not BI.</p>
<p>And not really OR either, unless you redefine OR. OR is definitely #8, the ability to build sophisticated mathematical models that can be used not just to predict the future, but to find a way to get to the future you want.</p>
<p>So why did I go to an INFORMS conference with the work Analytics in its title? This same conference in the past used to be called &#8220;The INFORMS Conference on OR Practice&#8221;. Why the change? This has been the topic of constant conversation recently, among the leaders of the society, as well as among the attendees of the conference. There are a number of possible answers, from jumping on a bandwagon, to trying to protect academic turf, to trying to let &#8220;data geeks&#8221; know that there&#8217;s a whole world of &#8220;advanced&#8221; analytics beyond &#8220;just&#8221; predictive modeling.</p>
<p>I think all of those are right, and justifiable, despite the pejorative slant. SAS&#8217; hierarchy does define a useful progression among useful analytic skills. INFORMS recently hired <a href="http://www.us.capgemini.com/" target="_blank">consultants</a> to help them figure out how to place themselves, and identified a similar set of overlapping distinctions:</p>
<ul>
<li>Descriptive Analytics &#8212; Analysis and reporting of patterns in historical data</li>
<li>Predictive Analytics &#8212; Predicts future trends, finds complex relationships in data</li>
<li>Prescriptive Analytics &#8212; Determines better procedures and strategies, balances constraints</li>
</ul>
<p>They also have been using &#8220;Advanced Analytics&#8221; for the Predictive and Prescriptive categories.</p>
<p>I do like these definitions. But do I like the OR professional society trying to add Predictive Analytics to the scope of their domain, or at least of their Business-focused conference? I&#8217;m on the fence. It&#8217;s clearly valuable to link optimization to prediction, in business as well as other sorts of domains. (In fact, I have a recent Powerpoint slide that says &#8220;You can&#8217;t optimize what you can&#8217;t predict&#8221;!) And crosstalk among practitioners of these fields can be nothing but positive. I certainly have learned a lot about appropriate technologies from my membership in a variety of professional organizations.</p>
<p>But the whole scope of &#8220;analytics&#8221; is a lot of ground, and the underlying research and technology spans several very different fields. I&#8217;d be surprised if there were more than a dozen people at INFORMS with substantial expertise in <a href="http://en.wikipedia.org/wiki/Text_mining" target="_blank">text mining</a>, for example. There almost needs to be a <em>new</em> business-focused advanced analytics conference, sponsored jointly by the professional societies of the <a href="http://www.machinelearning.org/" target="_blank">machine learning</a>, <a href="http://www.amstat.org/" target="_blank">statistics</a>, and <a href="http://www.informs.org/" target="_blank">OR</a> fields, covering everything that businesses large and small do with data that is more mathematically sophisticated (though not necessarily more useful) than the material covered by the <a href="http://www.google.com/search?q=business+intelligence+conferences" target="_blank">many business intelligence conferences and trade shows</a>. Would that address the problem of advanced analytics better than trying to expand the definition of OR?</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/04/on-analytics-and-related-fields/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>&#8220;Data Scientist&#8221; and other titles</title>
		<link>http://www.harlan.harris.name/2011/02/data-scientist-and-other-titles/</link>
		<comments>http://www.harlan.harris.name/2011/02/data-scientist-and-other-titles/#comments</comments>
		<pubDate>Sun, 13 Feb 2011 20:20:07 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[data science]]></category>
		<category><![CDATA[job titles]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=81</guid>
		<description><![CDATA[Neil Saunders has an interesting (to me) blog post up this morning, with the title &#8220;Dumped on by data scientists.&#8221; He uses the use of &#8220;data scientist&#8221; in a Chronicle of Higher Ed article to rant a little bit about the term. For Neil, it&#8217;s redundant, as the act of doing science necessarily requires data; [...]]]></description>
			<content:encoded><![CDATA[<p>Neil Saunders has an interesting (to me) blog post up this morning, with the title &#8220;<a href="http://nsaunders.wordpress.com/2011/02/13/dumped-on-by-data-scientists/">Dumped on by data scientists</a>.&#8221; He uses the use of &#8220;data scientist&#8221; in a <a href="http://chronicle.com/article/Dumped-On-by-Data-Scientists/126324/">Chronicle of Higher Ed article</a> to rant a little bit about the term. For Neil, it&#8217;s redundant, as the act of doing science necessarily requires data; it&#8217;s insulting, as if &#8220;scientist&#8221; wasn&#8217;t cool enough and you have to add &#8220;data&#8221;; and it&#8217;s misleading, as many people who call themselves &#8220;data scientists&#8221; are actually dealing with business data rather than scientific data.</p>
<p>Without disagreeing that there&#8217;s a terminological sprawl going on, I did want to address the use of the term, and partially disagree with Neil.</p>
<p>As someone with scientific training who uses those tools to solve business problems, I certainly struggle with a description of my role. “Data Scientist” or &#8220;Statistical Data Scientist&#8221; is actually pretty good, as it correctly indicates that I use scientific techniques (controlled experiments, sophisticated statistics) to understand our company’s data. I often describe myself as a “Statistician”, too, which gets across some of the same ideas without people having to do a double take and parse a new phrase. I also sometimes describe myself as doing “Operations Research” (aka “Management Science”, although I don’t use that term), since I use some of the tools of that field, as well as of Artificial Intelligence/Machine Learning, to optimize certain objective functions. </p>
<p>&#8220;Business Intelligence&#8221; actually is not that good a term for what I do, as most of what is usually called BI is about tools for better/more relevant/faster access to data for business people to use. This is not a bad thing to be doing, at all, but it&#8217;s different from the predictive and inferential statistical methods that I use in my job.</p>
<p>I don’t know what the right answer is. It might depend on the precise person and their precise role. My title, for instance, is the result of a back-and-forth with my boss, HR, and others, trying to find words that have both appropriate internal and external meanings. &#8220;Technical Lead&#8221; is a rank, indicating that I run technical projects without (formally) managing people. &#8220;Inventory Optimization and Research&#8221; covers a variety of areas. &#8220;Inventory&#8221; here means &#8220;sellable units&#8221;, like boxes on a shelf, or in this case, like scheduled airline flights. Probably baffling for an external audience without an explanation, but extremely clear inside the company. &#8220;Optimization&#8221; means what it sounds like, both in a technical and a non-technical sense, and for both internal and external audiences. &#8220;Research&#8221; indicates a focus on the development of long-term and cutting-edge systems. &#8220;Data Scientist&#8221; didn&#8217;t end up in there, but it could have.</p>
<p>For people using Big Data tools and scientific methods to study topics inside academia, the right answer seems to me to put the field of study first. You&#8217;re not a &#8220;Data Scientist&#8221;, you&#8217;re an astrophysicist, or a bioinformatician, or a neuroscientist, with a specialization in statistical methods. If you&#8217;re a generalist inside the academy, you&#8217;re probably a statistician. Perhaps &#8220;Data Scientist&#8221; should be restricted to people applying scientific tools and techniques to problems of non-academic interest? That might work, as long as it included people who do things like apply predictive analytic tools to hospital admissions data. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2011/02/data-scientist-and-other-titles/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>how to speak ggplot2 like a native, and Predictive Analytics World</title>
		<link>http://www.harlan.harris.name/2010/10/how-to-speak-ggplot2-like-a-native-and-predictive-analytics-world/</link>
		<comments>http://www.harlan.harris.name/2010/10/how-to-speak-ggplot2-like-a-native-and-predictive-analytics-world/#comments</comments>
		<pubDate>Sun, 24 Oct 2010 23:17:08 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[meetup]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=71</guid>
		<description><![CDATA[I was recently given the opportunity to re-present my ggplot2 talk, which I originally gave to the NYC R Meetup, to the DC R Meetup group. The Meetup was held co-located with the Predictive Analytics World conference in Alexandria, VA. (More on my thoughts on PAW below&#8230;) Contentwise, I made only small changes, changing a [...]]]></description>
			<content:encoded><![CDATA[<p>I was recently given the opportunity to re-present <a href="http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/">my ggplot2 talk, which I originally gave</a> to <a href="http://www.meetup.com/nyhackr/">the NYC R Meetup</a>, to <a href="http://www.meetup.com/R-users-DC/">the DC R Meetup </a>group. The Meetup was held co-located with the <a href="http://www.predictiveanalyticsworld.com/">Predictive Analytics World </a>conference in Alexandria, VA. (More on my thoughts on PAW below&#8230;) Contentwise, I made only small changes, changing a bit of patter and adding more examples at the end. I still love ggplot, with some frustration at the way it is typically introduced. Some of the audience had no R experience at all, while others were experts. One person, a grad student at U. of Maryland, had had very similar difficulty as I had when originally learning ggplot2, and his enthusiastic nods during my presentation were very validating! For reference, <a href="http://www.meetup.com/R-users-DC/calendar/14236478/">the Meetup page is here</a>, and I stuck the current version of the slides in a public <a href="http://www.dropbox.com/">Dropbox</a>, <a href="http://dl.dropbox.com/u/7644953/ggplotIntro%20-%20PAW2010.pptx">located here</a>.</p>
<p>And a few thoughts about PAW. The conference was well-run (although I have my gripes with the hotel and its location!) and there were an interesting and eclectic lineup of speakers, from a variety of industries. Compared to academic conferences I&#8217;ve attended, I missed having all the grad students around. At PAW, I felt rather young, which had not been true at academic conferences in quite a long time! The content of the conference focused on people using predictive methods (statistics, data mining, machine learning) at the individual-customer level, for marketing or retainment or other purposes. That&#8217;s not my primary interest right now &#8212; my work is focused at a slightly higher <a href="http://en.wikipedia.org/wiki/Operations_research">operations-research</a>-y level, trying to make sure that customers in the aggregate have good options. But I enjoyed learning about what other people are doing using somewhat similar methods. Next year, though, I think I&#8217;ll try to go to a different conference, perhaps <a href="http://www.warwick.ac.uk/statsdept/useR-2011/">UseR!</a> in the UK, or <a href="http://meetings2.informs.org/Practice2011/">INFORMS&#8217; applied conference</a>&#8230;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2010/10/how-to-speak-ggplot2-like-a-native-and-predictive-analytics-world/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Prediction with Multilevel Regression Models, and Pizza</title>
		<link>http://www.harlan.harris.name/2010/10/prediction-with-multilevel-regression-models-and-pizza/</link>
		<comments>http://www.harlan.harris.name/2010/10/prediction-with-multilevel-regression-models-and-pizza/#comments</comments>
		<pubDate>Fri, 15 Oct 2010 15:45:08 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[meetup]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=67</guid>
		<description><![CDATA[The Meetup phenomenon, which is now substantial and longstanding enough to be more of a cultural change than a flash in the pan, continues to impress me. Even more so than tools like LinkedIn, Meetups have changed the nature of professional networking, making it more informal, diverse, and decentralized. Last night, statistics consultant (and cheap [...]]]></description>
			<content:encoded><![CDATA[<p>The <a href="http://www.meetup.com/" target="_blank">Meetup</a> phenomenon, which is now substantial and longstanding enough to be more of a cultural change than a <a href="http://www.foursquare.com/" target="_blank">flash in the pan</a>, continues to impress me. Even more so than tools like <a href="http://www.linkedin.com/">LinkedIn</a>, Meetups have changed the nature of professional networking, making it more informal, diverse, and decentralized. Last night, statistics consultant (and cheap eats guru) <a href="http://www.jaredlander.com/" target="_blank">Jared Lander</a> and I presented a talk on a statistical technique tangentially related to my professional work (more closely associated with Jared&#8217;s). The origin of this presentation is worth noting. On Meetup&#8217;s web site, members of a group can suggest topics for meetings. Before even attending a single <a href="http://www.meetup.com/NYC-Predictive-Analytics/">NYC Predictive Analytics</a> event, I posted several <a href="http://www.meetup.com/NYC-Predictive-Analytics/ideas/">topics</a> that I thought might be interesting for the group. A bit later, the organizers (<a href="http://www.meetup.com/NYC-Predictive-Analytics/members/9260862/">Bruno</a> and <a href="http://www.meetup.com/NYC-Predictive-Analytics/members/9260860/">Alex</a>) contacted me to see if I&#8217;d be willing to present on prediction with Multilevel models. I said that I would, but only if I could co-present with <span style="text-decoration: line-through;">someone who actually knew something about the topic</span> a complementary set of skills and experiences. Knowing Jared from the <a href="http://www.meetup.com/nyhackr/">NYC R Meetup</a> group, and knowing that he learned about multilevel models from the <a href="http://www.stat.columbia.edu/~gelman/blog/">professor</a> who wrote <a href="http://www.stat.columbia.edu/~gelman/arm/">the best book on the topic</a>, and knowing that he&#8217;s pretty good in front of an audience, I suggested we collaborate.</p>
<p>Despite requiring a lot of work, and a lot of learning of details on my part, we managed to throw together a pretty decent talk. (As of this morning, there&#8217;s four ratings of <a href="http://www.meetup.com/NYC-Predictive-Analytics/calendar/14476011/">the event on Meetup</a>, and we got 5/5 stars! Yay us! Not statistically conclusive, though&#8230;) We used as an example topic for data analysis the difficult and critically important problem of predicting reviews of pizza restaurants in downtown NYC. Jared is actually an expert on this topic, having written his Masters thesis on ratings from <a href="http://www.menupages.com/" target="_blank">Menupages.com</a>. For the talk, Jared would present a few slides, then I&#8217;d present a few. In a few cases we&#8217;d both try to explain topics from slightly different points of view. I&#8217;d repeatedly try to use the keyboard instead of the remote-control gadget to control Powerpoint, causing the computer to melt down into a pile of slag and refuse to change the slide. Jared would send me withering glares when I started to move towards the keyboard. It ended up OK, though, we got through everything, and even answered about half of the (excellent) questions! Oh, and shout-out to the AV guy at AOL HQ. I don&#8217;t know how they pay his salary, but he rocked.</p>
<p>Jared has posted the slides from the talk <a href="http://www.jaredlander.com/wordpress/wordpress-2.9.2/wordpress/wp-content/uploads/2010/10/NYC-PA-Meetup-Multilevel-Models.ppt" target="_blank">here</a> (ppt), and I&#8217;ve put the data we made up (for pedagogical purposes) and the code we used to analyze it and generate graphs for the talk <a href="http://github.com/HarlanH/nyc-pa-meetup-multilevel-pizza">here on Github</a>. Alex video-recorded the presentation, and I&#8217;ll update this sentence to link to the video once it&#8217;s posted somewhere. Hope folks find it valuable!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2010/10/prediction-with-multilevel-regression-models-and-pizza/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>ggplot and concepts &#8212; what&#8217;s right, and what&#8217;s wrong</title>
		<link>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/</link>
		<comments>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/#comments</comments>
		<pubDate>Sun, 07 Mar 2010 21:52:16 +0000</pubDate>
		<dc:creator>Harlan</dc:creator>
				<category><![CDATA[Professional]]></category>
		<category><![CDATA[ggplot]]></category>
		<category><![CDATA[ggplot2]]></category>
		<category><![CDATA[graphics]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.harlan.harris.name/?p=47</guid>
		<description><![CDATA[A few months back I gave a presentation to the NYC R Meetup. (R is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on ggplot2, a popular package for generating graphs of data and statistics. In the talk (which you can see here, including [...]]]></description>
			<content:encoded><![CDATA[<p>A few months back I gave a presentation to the <a href="http://www.meetup.com/nyhackr/">NYC R Meetup</a>. (<a href="http://www.r-project.org/">R</a> is a statistical programming language. If this means nothing to you, feel free to stop reading now.) The presentation was on <a href="http://had.co.nz/ggplot2/">ggplot2</a>, a popular package for generating graphs of data and statistics. In the talk (<a href="http://www.vcasmo.com/video/drewconway/7017">which you can see here</a>, including both my slides and my patter!) I presented both the really great things about ggplot2 and some of its downsides. In this blog post, I wanted to expand a bit on my thinking on ggplot, the Grammar of Graphics, and how peoples&#8217; conceptual representations of graphs, data, ggplot, and R all interact. ggplot is both incredibly elegant and unfortunately difficult to learn to use well, I think as a consequence of the variety of representations.<span id="more-47"></span></p>
<p>The ggplot package, written by the overachieving and remarkable <a href="http://had.co.nz/">Hadley Wickham</a>, is based on <a href="http://books.google.com/books?id=_kRX4LoFfGQC&amp;dq=grammar+of+graphics&amp;printsec=frontcover&amp;source=bn&amp;hl=en&amp;ei=7kZsS8-lDI_e8Qb4hcD2BQ&amp;sa=X&amp;oi=book_result&amp;ct=result&amp;resnum=4&amp;ved=0CB8Q6AEwAw#v=onepage&amp;q=&amp;f=false">earlier more theoretical work by Leland Wilkinson</a>. Wilkinson abstracted the process of putting data onto an image, and created a Grammar of Graphics, which describes <em>how</em> the data maps to the parts of a graph, rather than describing the final graph itself. For example, here&#8217;s how to create a pie chart, clipped from Wilkinson&#8217;s book:</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-11.33.36-AM.png"><img class="aligncenter size-full wp-image-50" title="Wilkinson Pie Graph Example" src="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-11.33.36-AM.png" alt="" width="607" height="393" /></a>Don&#8217;t worry about the details, but briefly, a pie chart is just a stacked bar graph (summary.proportion) plotted in polar coordinates (polar.theta). If you took the time to learn this grammar, you would realize that the hierarchical structure of a graph on a page (elements have positions and labels and visual properties like color, each of which have their own abstract structure) maps cleanly to the hierarchical structure of the grammar, and that variables in the grammar map cleanly to the linear structure of the data. As a user of this system, you would be able to see all three key representations at once: the <span style="text-decoration: underline;">data</span>, the <span style="text-decoration: underline;">grammatical mapping</span> from data to graph, and the <span style="text-decoration: underline;">graph</span> itself.</p>
<p>Now consider ggplot, the implementation of the Grammar of Graphics in the R programming language. Does ggplot maintain three visible representations, all straightforwardly mappable to each other? Sadly, it does not. Instead, users of ggplot must map among four representations: the <span style="text-decoration: underline;">data</span> (a standard data.frame object), the <span style="text-decoration: underline;">R syntax</span> for ggplot2 (which has some quirks), an <span style="text-decoration: underline;">underlying ggplot object</span> (similar to the Grammar of Graphics, but vastly more complex and impossible to examine directly), and the generated <span style="text-decoration: underline;">graph</span>.</p>
<p>Consider the simple pie graph, below.</p>
<p><a href="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-1.56.41-PM.png"><img class="aligncenter size-full wp-image-53" title="Simple Pie Chart" src="http://www.harlan.harris.name/wp-content/uploads/2010/02/Screen-shot-2010-02-05-at-1.56.41-PM.png" alt="" width="249" height="183" /></a>This chart is generated in ggplot2 by the following R code:</p>
<pre>&gt; zz &lt;- data.frame(cat=c("a", "b"), val=c(5,3))

&gt; zz
 cat val
1   a   5
2   b   3
&gt; pp &lt;- ggplot(zz, aes(x="", y=val, fill=cat)) + geom_bar(width=1) +
        coord_polar("y")
&gt; print(pp)</pre>
<p>The print() function is optional within an R interpreter session, but I include because it illustrates a point that&#8217;s not initially obvious to many users. Unlike the built-in R plotting tools, the ggplot() function and its associated functions don&#8217;t plot anything on the screen, they just construct an object of type &#8220;ggplot&#8221;. Almost all of the actual work of mapping the data to stuff on your screen occurs when you print that object, using print() or ggsave().</p>
<p>So what does that object look like? If you type str(pp), you&#8217;ll get an answer, but it&#8217;s about a hundred lines of undecipherable hierarchical object and list structure, not intended to be examined by mere mortals. But there&#8217;s something critically important about that structure &#8212; like the original Grammar of Graphics, and unlike the R syntax above, it&#8217;s hierarchically structured.</p>
<p>In the R syntax, you create a base ggplot structure with the ggplot() call, then you abuse the &#8220;+&#8221; operator to make changes to that structure. The geom_bar() function adds a layer to the ggplot() object, where a layer is just what it sounds like, a set of information about one of potentially many overlaid layers of content that will be put on the graph. So you construct a ggplot object by first initializing everything about the basic plot, then tack on layers with +, right? Actually no, because the coord_polar() call doesn&#8217;t create or modify a layer at all, it modifies the base object! Even if you&#8217;ve acquired the nonobvious intuition that ggplot objects are hierarchical and are created by concatenating layers, you now have to break the analogy again to fully understand what + is doing!</p>
<p>There is a way to partially see the structure directly, but it&#8217;s not well thought-out from the point of view of someone trying to learn how to use the package. The summary() method on ggplot objects tells you about things you didn&#8217;t specify (faceting?), it&#8217;s incomplete, and it doesn&#8217;t map well to the R syntax. If something in your plot isn&#8217;t working the way you want it to, summary() won&#8217;t help you.</p>
<pre>&gt; summary(pp)
mapping:  x = , y = val, fill = cat
faceting: facet_grid(. ~ ., FALSE)
-----------------------------------
geom_bar:
stat_bin: width = 1
position_stack: (width = NULL, height = NULL)</pre>
<p>Another shortcut that leads to conceptual problems by ggplot beginners is the use of qplot(). The qplot() function is a wrapper around ggplot(). Unlike ggplot(), you can give qplot() data that is not in the form of a data.frame, and the syntax is somewhat different. There&#8217;s nothing wrong with some syntactic sugar to make life easier, but in this case, learning ggplot by starting with qplot is like trying to learn a foreign language by starting with contractions and slang. You may be able to say a few essential things on your vacation, but you won&#8217;t be able to creatively construct new sentences as new situations arise. The brilliance of the Grammar of Graphics is exactly that it&#8217;s a grammar &#8212; you can construct new graphs and new types of graphs as new situations arise! But tutorials that start with qplot, with <a href="http://had.co.nz/ggplot2/book/" target="_blank">the ggplot book </a>an unfortunate (but in other ways excellent) example, send their learners down a linguistic garden path. To fully use the power of the system requires unlearning the conceptual structures that map the slang to charts on a screen, and starting over with learning the new, more powerful ggplot() grammar and hierarchical representations.</p>
<p>I&#8217;d like to conclude this overlong rant with two notes. First, just today <a href="http://pleasescoopme.com/2010/03/07/jjplot-yet-another-plotting-library-for-r/" target="_blank">a new graphics package for R was introduced</a>. <a href="http://code.google.com/p/jjplot/" target="_blank">jjplot</a> uses many of the ideas of the Grammar of Graphics and ggplot2, but seems to avoid at least a few of the conceptual problems. The + operator is not overloaded in conceptually confusing ways, and there is no distracting qplot function to mislead new users. Additionally, a quick look at the source code finds it much, much simpler than ggplot2&#8242;s source, which will likely lead to a more active base of contributors. I look forward to trying jjplot and watching its continuing development, and hope the authors learn from both the remarkable successes and frustrating failures of ggplot. Second, I use ggplot extensively in my work. It&#8217;s simply the best available tool for quickly generating elegant graphs of data in R, especially if that generation needs to happen automatically in code. Hadley Wickham deserves extensive praise for the amount of effort he has put into developing and popularizing the Grammar of Graphics. If you want to be maximally effective when visualizing data in R, take the time to learn ggplot2, but do so while keeping in mind that the learning process will be easiest if you skip qplot and other shortcuts, think hierarchically, and prepare for some frustration. Fortunately, the support communities on the <a href="http://groups.google.com/group/ggplot2" target="_blank">ggplot mailing list </a>and <a href="http://stackoverflow.com/questions/tagged/ggplot2" target="_blank">Stack Overflow </a>are extremely helpful, as is Hadley himself.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.harlan.harris.name/2010/03/ggplot-and-concepts-whats-right-and-whats-wrong/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
	</channel>
</rss>

