The below is a public version of a post originally posted on an internal blog at the Education Advisory Board (EAB), my current employer. We don’t yet have a public tech blog, but I got permission to edit and post it here, along with the referenced code.
Data Science teams get asked to do a lot of different sorts of things. Some of what the team that I’m part of builds is enterprise-scale predictive analytics, such as the Student Risk Model that’s part of the Student Success Collaborative. That’s basically software development with a statistical twist and machine-learning core. Sometimes we get asked to do quick-and-dirty, one-off sorts of things, to answer a research question. We have a variety of tools and processes for that task. But there’s a third category that I want to focus on – frequently requested but slightly-different reports.
what is it
There’s a relatively new theme in the scientific research community called reproducible research. Briefly, the idea is that it should be possible to re-do all steps after data collection automatically, including data cleaning and reformatting, statistical analyses, and even the actual generation of a camera-ready report with charts, graphs, and tables. This means that if you realized that, say, one data point in your analysis was bogus and needed to be removed, you could remove that data point, press a button, and in a minute or two have a shiny new PDF with all of the results automatically updated.
This type of reproducible research has been around for a while, although it’s having a recent resurgence in part due to the so-called “statistical crisis“. The R (and S) statistical programming languages have supported LaTeX, the scientific document creation/typesetting tool, for many years. Using a tool called Sweave, a researcher “weaves” chunks of text and chunks of R code together. The document is then “executed”, where the R code chunks are executed and the results are converted into a single LaTeX document, which is then compiled into a PDF or similar. The code can generate charts and tables, so no manual effort is needed to rebuild a camera-ready document.
This is great, a huge step forward towards validation of often tricky and complex statistical analyses. If you’re writing a conference paper on, say, a biomedical experiment, a reproducible process can drastically improve your ability to be confident in your work. But data scientists often have to generate this sort of thing repeatedly, from different sources of data or with different parameters. And they have to do so efficiently.
Parameterizable reproducible research, then, is a variant of reproducible research tools and workflows where it is easy to specify data sources, options, and parameters to a standardized analytical report, even one that includes statistical or predictive analyses, data manipulation, and graph generation. The report can be emailed or otherwise sent to people, and doesn’t seem as public as, say, a web-based app developed in Shiny or another technology. This isn’t a huge breakthrough or anything, but it’s a useful pattern that seems worth sharing.