Sansan Tech Blog

Sansanのものづくりを支えるメンバーの技術やデザイン、プロダクトマネジメントの情報を発信

Economics Meets Data Science: Reproducible Research with R


Hi, I’m Juan. Econ researcher and web development geek from the Social Sciences Group at Sansan’s R&D Department.

If you have followed us recently on social media you may know that we have been thinking a lot about coding practices and reproducible research in Economics. In fact, we released a three-part series on the subject, covering topics from version control management with Git, computational reproducibility with Docker, and agile management of research projects with tools such as Jira (available in Japanese).


note.com

The Why

You might wonder why we have been thinking about these topics lately. First of all, we're not the only ones talking about coding practices and reproducibility lately.

That tweet itself spawned many discussions among economists about how the field could benefit (or not) from better coding practices. But the reproducibility discussion has become more prevalent across all social sciences in recent years, as data analysis tools become more accessible outside Web Development and Data Science, and more complex data becomes available for analysis. In our case, we thought that we needed to sit down and think of a way of improving our productivity, the quality of our reviews and the reproducibility of our reports.

Papers and taiyaki aren't too different if you think about it. Leave the shape to the template, concentrate on cooking the perfect filling. (Licensed under GFDL because taiyaki.)

So we made some Guidelines

Although we employ Python for a lot of tasks, we use R a lot when it comes to performing statistical analysis. So the last quarter a few of us, mostly heavy R users from several academic fields, set a goal to come up with guidelines for improving our research code. Here's what we came up with:

Coding Practices

This was easy to decide: we agreed to require the tidyverse Style Guide augmented by Google’s R Guide. It was hard to decide if we wanted to enforce the style or just recommending it. Enforcing it could involve anything from automatically fixing the code using a library such as styler as a pre-commit hook or at the GitHub Actions level. However, this wouldn't be a nice thing to do at the beginning as we get used to our new research process. We opted instead to just recommending the coding style for now. We encourage the use of the lintr package to highlight the coding style violations. In the future, we’re thinking about including it as part of our Continuous Integration (CI) pipeline, so that the style check can serve as one more aspect for the reviewer to consider when evaluating someone’s code.

style.tidyverse.org

Dependency Management

Another big point of pain when trying to reproduce someone else’s code is installing dependencies. It’s that part of code review that nobody wants to deal with. It simply sucks. We address it in two ways:

Docker

We encourage the use of Docker on each project by including the corresponding Dockerfile and a docker-compose.yaml file. This means that whatever your research project is, it will most likely run on anyone’s computer as long as they can use Docker. They don't even need to have R installed!

renv

We require the use of renv, a popular dependency management package for R. This has two benefits: it caches dependencies on your local environment while keeping it clean. By mapping the renv cache directory into the project’s Docker image through a volume, we don’t need to install dependencies every time the image is built, which can dramatically reduce the build time and size. The other benefit is renv’s lockfile, which allows us to manage the dependencies transparently with version control.

rstudio.github.io

If you need more information about using renv with Docker, you can find detailed information here.

Pipeline Management

Most research projects require processing big data. Often, this task is computationally expensive and may take many hours to complete. In Python, there are tools such as luigi and gokart, that allow you to split pipelines into smaller tasks, run them and cache the results so that you can restart the whole thing from the last successful task in case it fails or something in the pipeline changes. In R, a great alternative is the targets library, which allows you to do exactly the same, and even visualize your pipeline.
books.ropensci.org
This is great because reproducing the results is reduced to simply running the pipeline using targets, which is the same task regardless of the project.

Project Structure

If our strategy depended on everyone creating the right project structure we would certainly end up with a lot of slightly different research projects, which equals to no standardization at all. That’s why we provide a way to create a new research project with minimal involvement. We created a cookiecutter template which can be used to generate new empty research projects that are fully compliant with our guidelines.
cookiecutter.readthedocs.io

A new project called myproject follows the following structure:

myproject
├── Dockerfile
├── docker-compose.yml
└── myproject
    ├── R
    │   └── functions.R
    ├── _targets.R
    ├── data
    ├── renv.lock
    ├── report.Rmd
    ├── reproduce.R
    ├── sql
    ├── test
    └── myproject.Rproj

This package contains everything needed to start working on the project right away. So, to summarize, the process is very simple:

  1. Use cookiecutter to setup your research project.
  2. Load the project with RStudio or Visual Studio Code.
  3. Write the necessary code.
  4. Create a targets pipeline in the _targets.R file.
  5. Run the reproduce.R file.

That should run the whole pipeline and knit your report. The reproduce someone else's report, run the reproduce.R file and check that the results are according to your expectations. Very simple.

Of course, this requires you to create the necessary .gitignore files (we already handle that by the way we organize the code in version control).

If you like R Studio, you can use docker-compose up –d to start a Docker container on the background with your project's code already loaded. You can then use the browser to start working on your code. Alternatively, you can open the project with Visual Studio Code. We haven't done this yet, but it is possible to include a devcontainer.json file that allows you to work with your code within a Docker container based on your project’s image. In both cases, changes in your code are reflected in your local environment, so you can easily commit them to version control as usual.

code.visualstudio.com

When writing the necessary code to make your pipeline work, you have two options. You can create the R files in the R directory and import them later when building your pipeline. Or, you can turn your project into an R package using usethis::create_package('.'), which gives you access to the testing and checking functions included in devtools.

Note that this workflow is inspired by MLOps practices, and most components have direct equivalents in traditional Python pipelines. We tried to make it so that Python users can easily understand the R workflow as well without having much knowledge of R. This facilitates collaboration with researchers outside the Social Sciences.

It’s not much, but it’s honest work

The whole focus of creating our research guidelines for R was to reduce the marginal cost of an additional project. We are now in the phase of evaluating this new setup and see how much we can improve our productivity. After having reviewed a couple of projects using the new workflow, I'm convinced that we're heading in the right direction. I no longer have to spend time figuring out the project's structure, so I can spend more time checking the important parts.

Related Articles

buildersbox.corp-sansan.com

© Sansan, Inc.