Parameterizable Reproducible Research

November 20th, 2014 No comments

The below is a public version of a post originally posted on an internal blog at the Education Advisory Board (EAB), my current employer. We don’t yet have a public tech blog, but I got permission to edit and post it here, along with the referenced code. 

Data Science teams get asked to do a lot of different sorts of things. Some of what the team that I’m part of builds is enterprise-scale predictive analytics, such as the Student Risk Model that’s part of the Student Success Collaborative. That’s basically software development with a statistical twist and machine-learning core. Sometimes we get asked to do quick-and-dirty, one-off sorts of things, to answer a research question. We have a variety of tools and processes for that task. But there’s a third category that I want to focus on – frequently requested but slightly-different reports.

what is it

There’s a relatively new theme in the scientific research community called reproducible research. Briefly, the idea is that it should be possible to re-do all steps after data collection automatically, including data cleaning and reformatting, statistical analyses, and even the actual generation of a camera-ready report with charts, graphs, and tables. This means that if you realized that, say, one data point in your analysis was bogus and needed to be removed, you could remove that data point, press a button, and in a minute or two have a shiny new PDF with all of the results automatically updated.

This type of reproducible research has been around for a while, although it’s having a recent resurgence in part due to the so-called “statistical crisis“. The R (and S) statistical programming languages have supported LaTeX, the scientific document creation/typesetting tool, for many years. Using a tool called Sweave, a researcher “weaves” chunks of text and chunks of R code together. The document is then “executed”, where the R code chunks are executed and the results are converted into a single LaTeX document, which is then compiled into a PDF or similar. The code can generate charts and tables, so no manual effort is needed to rebuild a camera-ready document.

This is great, a huge step forward towards validation of often tricky and complex statistical analyses. If you’re writing a conference paper on, say, a biomedical experiment, a reproducible process can drastically improve your ability to be confident in your work. But data scientists often have to generate this sort of thing repeatedly, from different sources of data or with different parameters. And they have to do so efficiently.

Parameterizable reproducible research, then, is a variant of reproducible research tools and workflows where it is easy to specify data sources, options, and parameters to a standardized analytical report, even one that includes statistical or predictive analyses, data manipulation, and graph generation. The report can be emailed or otherwise sent to people, and doesn’t seem as public as, say, a web-based app developed in Shiny or another technology. This isn’t a huge breakthrough or anything, but it’s a useful pattern that seems worth sharing.

Read more…

Inauthenticity

September 7th, 2014 No comments

Let me unpack that a bit…

Hugh and Crye t-shirt

Recently, Hugh & Crye, a DC-based clothing firm for men, with a novel take on sizing, recently did a Kickstarter campaign for their new line of fitted t-shirts. What the hell? H&C has been around for about 5 years, and based on their product growth and hiring seems to be doing quite well. I like their stuff. Why do they need a Kickstarter? The original goal of Kickstarter was to “kickstart” new products by providing crowdsourced seed funding so that you (you!) can ensure that a great idea gets off the ground. And if a project doesn’t make its goals, no harm done, and no money wasted. A fantastic example is the Oculus Rift, which was a Kickstarted Virtual Reality rig, and is now a subsidiary of Facebook. Kickstarting a project is a rather labor-intensive alternative to trying to get a bank loan, or maxing out your credit cards, but with much less risk. It’s a very community-driven, authentic way of getting support for a new venture, moving it from the prototype phase to the initial manufacturing round.

Read more…

INFORMS Business Analytics 2014 Blog Posts

August 2nd, 2014 No comments

Earlier this year, I attended the INFORMS Conference on Business Analytics & Operations Research, in Boston. I was asked beforehand if I wanted to be a conference blogger, and for some reason I said I would. This meant I was able to publish posts on the conference’s WordPress web site, and was also obliged to do so!

Here are the five posts that I wrote, along with an excerpt from each. Please click through to read the full pieces:

Operations Research, from the point of view of Data Science

  • more insight, less action — deliverables tend towards predictions and storytelling, versus formal optimization
  • more openness, less big iron — open source software leads to a low-cost, highly flexible approach
  • more scruffy, less neat — data science technologies often come from black-box statistical models, vs. domain-based theory
  • more velocity, smaller projects — a hundred $10K projects beats one $1M project
  • more science, less engineering — both practitioners and methods have different backgrounds
  • more hipsters, less suits — stronger connections to the tech industry than to the boardroom
  • more rockstars, less teams — one person can now (roughly) do everything, in simple cases, for better or worse

What is a “Data Product”?

DJ Patil says “a data product is a product that facilitates an end goal through the use of data.” So, it’s not just an analysis, or a recommendation to executives, or an insight that leads to an improvement to a business process. It’s a visible component of a system. LinkedIn’s People You May Know is viewed by many millions of customers, and it’s based on the complex interactions of the customers themselves.

Healthcare (and not Education) at INFORMS Analytics

[A]s a DC resident, we often hear of “Healthcare and Education” as a linked pair of industries. Both are systems focused on social good, with intertwined government, nonprofit, and for-profit entities, highly distributed management, and (reportedly) huge opportunities for improvement. Aside from MIT Leaders for Global Operations winning the Smith Prize (and a number of shoutouts to academic partners and mentors), there was not a peep from the education sector at tonight’s awards ceremony. Is education, and particularly K-12 and postsecondary education, not amenable to OR techniques or solutions?

What’s Changed at the Practice/Analytics Conference?

In 2011, almost every talk seemed to me to be from a Fortune 500 company, or a large nonprofit, or a consulting firm advising a Fortune 500 company or a large nonprofit. Entrepeneurship around analytics was barely to be seen. This year, there are at least a few talks about Hadoop and iPhone apps and more. Has the cost of deploying advanced analytics substantially dropped?

Why OR/Analytics People Need to Know About Database Technology

It’s worthwhile learning a bit about databases, even if you have no decision-making authority in your organization, and don’t feel like becoming a database administrator (good call). But by getting involved early in the data-collection process, when IT folks are sitting around a table arguing about platform questions, you can get a word in occasionally about the things that matter for analytics — collecting all the data, storing it in a way friendly to later analytics, and so forth.

All in all, I enjoyed blogging the conference, and recommend the practice to others! It’s a great way to organize your thoughts and to summarize and synthesize your experiences.

Why a Data Community is Like a Music Scene — Resources

October 26th, 2013 No comments

On Monday, October 28th, 2013, I gave a 5-minute Ignite talk entitled “Why a Data Community is Like a Music Scene” at an event associated with the Strata conference. Here’s the video:

And here are the acknowledgements and references for the talk:

Data Community DC

How Music Works, by David Byrne

my slides for the Ignite talk

my blog post (written first)

Photos:

 

Bikeshare hills, incentives, and rewards

June 23rd, 2013 1 comment

A topographic map of Washington in 1791 by Don Alexander Hawkins. I live on the top edge of the map, on one of those hills.

I’m a generally happy user of DC’s Capital Bikeshare system — just renewed my annual membership today in fact. But I don’t use it as much as I’d like to, for one critical reason. I live on top of a hill. Riders are happy to take bikes from the neighborhood to their jobs downhill, but are much less likely to ride them uphill. As a result, the bike racks in my neighborhood are frequently completely empty by 8:00 or 8:30am, despite the many stores and businesses in the area. The only days I can reliably take a bike into work are when I leave at 7:15 for 8:00 meetings, which is thankfully not too often. On several occasions I have looked at the handy real-time map of bikeshare bikes, only to observe that there are no bikes available within a 15 minutes walk of my home!

What should Bikeshare do to solve this problem? Well, they already do one thing, which is that they hire people to put bicycles in the back of a big van, then drive them up the hill to rebalance the system. This works, but it’s expensive for the system, and it’s not very timely or efficient. In other transportation problems, incentives are used to balance demand. For instance, airlines and Amtrak use pricing to incentivize people who are flexible in their schedule to take off-peak trips. But that won’t work for Bikeshare, as most rides are free. (I pay $75/year, but all trips of 30 minutes or less are free. My rides are mostly 15-25 minutes long.) So people happily ride downhill to their downtown jobs in the mornings, but don’t ride uphill to their reverse-commute jobs, and don’t as often ride uphill in the evening home either. The end result is unhappy customers and excessive costs for the Bikeshare system.

ch_map

Rough map of a possible incentive line for North-Central DC.

So if you can’t give people the usual financial incentives to drop off bikes in the Columbia Heights rack at 8am, what can you do to reduce the need for rebalancing and provide reasons for people to want to help solve Bikeshare’s problem? I think the answer is swag. Imagine that there were lines on the Bikeshare map. Every time you crossed the line going in an uphill direction (reducing the need for rebalancing), you’d earn some points. If you earned enough points, you could redeem them for Bikeshare-branded, limited-edition swag. Imagine a t-shirt in official CaBi colors that said “I bike up hills”, available only through this point system. Who wouldn’t want that?

It’s easy for Bikeshare to figure this out, as they know exactly where you picked up each bike, and where you dropped it off. Determining whether you crossed a line, and thus biked uphill, is easy. And in addition to making people excited about biking up hills, you get them wearing branded items of clothing, which can only help market the system more broadly. They already sell swag through a Cafepress shop, so much of the infrastructure is in place. It’s a win-win.

Bikeshare people, if you read this and think it’s a good idea, please run with it!

Screenshot - 6_23_2013 , 3_08_23 PM

What the swag might look like.

 

 

On .name and third-level domains

May 15th, 2013 5 comments

And, we’re back! After being off-line for several weeks, this site is now live again! I can’t imagine you missed it.

Here’s what happened. Let’s start at the beginning. In 2003, ICANN added .name to the list of top-level domains (like .com, .edu, etc.). The idea is that individuals would use it for personal sites and email addresses. You can still do this, but (in case you haven’t noticed), it’s not very popular, and most domain name registrars don’t even sell .name addresses.

I purchased harlan.harris.name in 2003. Unlike .com addresses, you don’t generally buy second-level domains in .name, you buy third-level domains. (.name is the top-level domain, harris.name is the second-level domain, which you can’t buy, and harlan.harris.name is the third-level domain.) A cool feature is that if you buy a.b.name, you can get the email address a@b.name, not something like me@a.b.name (although you can set that up too). So my email address has been harlan@harris.name for ten years.

Fast forward to April, 2013. I notice that my personal web site (where you are now) has been replaced by a generic sales screen. You know, with a bunch of random keywords, a stock photo, and “buy this domain!” in big red print. Not good. At first I thought that my WordPress site (which hosts this blog) had gotten hacked, but no such luck. It turns out to be a convoluted mess of broken technology and confused customer support reps. The fortunate thing is that I don’t use this site extensively, and the problem with the web forwarding didn’t seem to affect my email address forwarding, so I didn’t lose any email.

The simplified version of what happened is that the company I bought the domain from in 2003, PersonalNames, merged with a company called Dotster a year or two ago. They presumably merged their technical systems together, which makes sense. But they for some reason failed to properly set up a system for third-level .name domain administration. And so my account failed to get properly transferred into their systems, and they stopped sending me notices about problems.

Although I still technically owned harlan.harris.name, I could no longer log in and administer it, and the redirection to this web site (at another company, HostGator) was reset at some point for still-unknown reasons.

It took a week and a dozen email messages and several hours on the phone for Dotster to figure out that yes, they owned this domain, but no, they didn’t have the technical chops to administer it.

I then set up an account with another company, eNom (nom, nom…), that does support third-level .name domains. Transferring the domain took another week and three attempts, due to errors on both sides. Add 48 hours for DNS forwarding to propagate around the Internet, and I’m finally back online yesterday!

Except that although my email forwarding still works, I don’t yet have control over that, because Dotster seemingly neglected to transfer email forwarding rights at the same time as the rest of the domain. So if you need me tomorrow, I’ll be back on the phone with tech support.

Sad Rain

Categories: Personal Tags: ,

More posts on the Data Community DC blog

February 21st, 2013 No comments

For those people (or, more likely, 0 or 1 persons) who follow this blog to catch up on my professional thoughts: I’ve been doing a little bit of writing on the Data Community DC blog. Here are all my posts over there: http://datacommunitydc.org/blog/author/harlan/ I’d definitely encourage you to read everyone else’s work on the DC2 blog too!

Two titles of my own:

And three  of others’:

There are also weekly round-up posts on data topics generally, and on data visualization specifically, as well as event previews and reviews, etc.

 

Categories: Professional Tags: ,

Pretzel Whoopie Pies with Vanilla Stout Filling

October 22nd, 2012 No comments

My newish cooking club had a dinner yesterday with the theme American Beer. I was tasked with dessert, and came up with this recipe for Pretzel Whoopie Pies. They turned out extremely well, so I thought I’d share the recipe here.

Sources:

Ingredients:

  • 2 egg yolks
  • 1/2 c minus 1 T sugar
  • 1 T light corn syrup
  • 1/2 c finely ground unsalted mini pretzels
  • 1/2 c cake flour
  • 1 t baking powder
  • 1/8 t salt
  • 4 T butter, softened
  • 1/3 c milk
  • kosher salt
  • 1 c stout beer (Breckenridge Vanilla Porter is excellent)
  • 4 T butter, softened
  • 8 oz powdered sugar

Recipe:

  1. Beat egg yolks, sugar, and corn syrup until lightened.
  2. Mix pretzel flour, cake flour, baking powder, and salt. Beat in butter and milk. Add to liquid mixture and mix thoroughly.
  3. Refrigerate dough for 30 minutes to hydrate evenly. Preheat oven to 350 F.
  4. Drop 12 evenly-shaped cookies onto a silpat-covered pan. A ring mold is helpful.  Sprinkle kosher salt over the top lightly.
  5. Bake about 14 minutes, until starting to brown around the edges, but still soft.
  6. Cool thoroughly on wire racks.
  7. Boil beer in a saucepan large enough to deal with foaming up, and reduce to 1/2 c. Cool to room temperature.
  8. Mix butter, powdered sugar, and reduced beer into a creamy, delicious frosting.
  9. Make sandwiches out of cookies and frosting.

Makes 6 whoopie pies.

Categories: Personal Tags: ,

Communication and the Data Scientist

September 23rd, 2012 No comments

I recently gave a presentation on communication issues around the terms “Data Science” and “Data Scientist”, based in part on a survey that I did with my Meetup colleagues Marck and Sean. The basic idea is that these new, extremely-broad buzzwords have resulted in confusion, which has impacted the ability of people with skills and people with data to meet and effectively communicate about who does what and what appropriate expectations should be. The survey was an attempt to bring some clarity to the issue of who are the people in this newly-reformulated community, and how do they view themselves and their skills. For more on the survey, see our post on the Data Community DC blog. Here’s the video of my presentation at DataGotham:

integrating R with other systems

June 16th, 2012 No comments

I just returned from the useR! 2012 conference for developers and users of R. One of the common themes to many of the presentations was integration of R-based statistical systems with other systems, be they other programming languages, web systems, or enterprise data systems. Some highlights for me were an update to Rserve that includes 1-stop web services, and a presentation on ESB integration. Although I didn’t see it discussed, the new httr package for easier access to web services is also another outstanding development in integrating R into large-scale systems.

Coincidentally, I just a week or so ago had given a short presentation to the local R Meetup entitled “Annotating Enterprise Data from an R Server.” The topic for the evening was “R in the Enterprise,” and others talked about generating large, automated reports with knitr, and using RPy2 to integrate R into a Python-based web system. I talked about my experiences building and deploying a predictive system, using the corporate database as the common link. Here are the slides: