The below is a public version of a post originally posted on an internal blog at the Education Advisory Board (EAB), my current employer. We don’t yet have a public tech blog, but I got permission to edit and post it here, along with the referenced code.
Data Science teams get asked to do a lot of different sorts of things. Some of what the team that I’m part of builds is enterprise-scale predictive analytics, such as the Student Risk Model that’s part of the Student Success Collaborative. That’s basically software development with a statistical twist and machine-learning core. Sometimes we get asked to do quick-and-dirty, one-off sorts of things, to answer a research question. We have a variety of tools and processes for that task. But there’s a third category that I want to focus on – frequently requested but slightly-different reports.
what is it
There’s a relatively new theme in the scientific research community called reproducible research. Briefly, the idea is that it should be possible to re-do all steps after data collection automatically, including data cleaning and reformatting, statistical analyses, and even the actual generation of a camera-ready report with charts, graphs, and tables. This means that if you realized that, say, one data point in your analysis was bogus and needed to be removed, you could remove that data point, press a button, and in a minute or two have a shiny new PDF with all of the results automatically updated.
This type of reproducible research has been around for a while, although it’s having a recent resurgence in part due to the so-called “statistical crisis“. The R (and S) statistical programming languages have supported LaTeX, the scientific document creation/typesetting tool, for many years. Using a tool called Sweave, a researcher “weaves” chunks of text and chunks of R code together. The document is then “executed”, where the R code chunks are executed and the results are converted into a single LaTeX document, which is then compiled into a PDF or similar. The code can generate charts and tables, so no manual effort is needed to rebuild a camera-ready document.
This is great, a huge step forward towards validation of often tricky and complex statistical analyses. If you’re writing a conference paper on, say, a biomedical experiment, a reproducible process can drastically improve your ability to be confident in your work. But data scientists often have to generate this sort of thing repeatedly, from different sources of data or with different parameters. And they have to do so efficiently.
Parameterizable reproducible research, then, is a variant of reproducible research tools and workflows where it is easy to specify data sources, options, and parameters to a standardized analytical report, even one that includes statistical or predictive analyses, data manipulation, and graph generation. The report can be emailed or otherwise sent to people, and doesn’t seem as public as, say, a web-based app developed in Shiny or another technology. This isn’t a huge breakthrough or anything, but it’s a useful pattern that seems worth sharing.
why do it
Our current best example of this structure is a report we generate that goes into per-customer technical details about their customized statistical model. It compares the version of the model in production with several “toy” models, along with a lot of explanation and education. The process of generating the report includes pulling data from remote servers, running statistical analyses, computing various metrics on the statistical models, then interleaving standard text, parameterized text (e.g., the name of the customer), and the results of computation, as well as auto-generated charts and graphs.
To do this process manually would require following a 10 to 20-step checklist, would require a substantial amount of effort — at least 30 minutes per report — and would be error-prone.
Our team now generates these reports in about 3 minutes per member, based on a simple configuration file. Here’s what a configuration file might look like:
user = "hharris" data_source = "SOURCENAMEHERE" data_location = "where in the source to pull data from" cust_id = "9999" cust_name = "State College"
That’s it. And here’s the command that generates the report:
./risk-model-report.R --verbose --config myconfig.R
That’s it! The result is a standalone HTML file customized for that specific customer, ready to be sent.
Should you use this pattern? If you use R to create reports, you should definitely be using Rmarkdown. If you have to generate these reports repeatedly, with subtle variations each time, you should strongly consider this framework or an equivalent!
how we do it
We use a relatively straightforward pattern available in the R programming language. As mentioned above, the standard reproducible research workflow is to create one document per analysis. We wrap that document in a separate script which is responsible for reading and validating a configuration file, then building the parameterized document with the appropriate configuration variables. The build process is multi-step under the hood, but most of the heavy lifting is performed by the Rmarkdown package, which runs Pandoc, a cross-platform system for converting document formats.
Here’s what it the process looks like:
For an example, see this HTML document (it should work in any web browser). Note that the image is embedded, meaning that the HTML document stands alone. It’s relatively easy to generate other formats, such as PDF or even Word, instead.
The document was generated by R and Rmarkdown code we’ve released into the public domain, hosted on Github. If this pattern is useful to you, please make use of and adapt it!
Over time, we expect to use variants of this pattern to standardize a variety of reports, internally-facing as well as customer-facing. If you do something similar, or better, I’d love to hear about it!