Results
9 Reflections from research compendium and test-run

9 Reflections from research compendium and test-run

After completing the reproduction and evaluation, the final stage involved organising the reproduction/ folder into a mini “research compendium”, and having a test-run of this by a second STARS team member.

9.1 Reflections related to compendium

Organisation:

No one right way to organise. If it was already pretty good, I left it (e.g. Kim et al. 2021). However, for lots of them, I did change organise to e.g. scripts/, inputs/, outputs/

Tests:

When creating the tests, since many of the models had long run times, my intention was often just to create a “mini-run” of the model that allows someone to quickly and easily check if they can run some aspect of the model (without needing to try the whole one with long run time).
The tests created provide examples of how can set up tests in Python and R, doing something basic like comparing dataframes or CSV files between runs. Indeed, it took a little bit of time to get set up initially with tests working as I want, but once figured for one, I could reuse like a template for the next.
Sometimes had to amend model to be able to run tests - e.g. for Wood et al. 2021, test wouldn’t work if model was run with parallel processing
Also, an example of Johnson et al. 2021 where I couldn’t get the model to work correctly within a test (despite no visible errors), and so went with the solution of making a normal .R script that operates like a test but without the testthat package

GitHub actions:

Wasn’t able to successfully build a quarto book using GitHub actions if it required both Python and R (in which case, had to push from local)

Docker:

This sometimes took rather alot of troubleshooting, and could sometimes be quite tricky. However, once resolved (Python, R, and later a older Python), I could then just reuse previous dockerfiles like a template (although sometimes still some troubleshooting still needed).

9.2 Reflections related to test-run

The test-run stage after making compendium helped us to spot run issues that I hadn’t noticed, mainly from having not completely tested absolutely everything before pushing. For example:

Shoaib and Ramamohan 2022 - there was an authentication error when trying to pull the docker image from GitHub Container Registry, as I was missing some of the steps related to personal access tokens
Hernandez et al. 2015 - though the test worked locally, it did not work on docker. This was related to imports, and resolved by adding __init__.py files, but I hadn’t noticed as hadn’t checked the test on docker.
Lim et al. 2020 - one of the files failed to run as I hadn’t uploaded some of the required data

Shows importance of having someone else check things - easy to leave in small mistakes that mean it doesn’t run. In general, having something like this to check your code is a really handy practice, if possible.

The test-run stage also helped highlight things that weren’t clear. For example, I had a test that took a few minutes to run, and whilst running your console is blank. He was unsure if this was an error - so I then add a section to the README that explained what to expect to see when running those tests.

Also, likely errors people will encounter - for example, by not having a particular environment active or not being in the right folder when trying to run something - and so I then made sure to clarify this in the README.

When running test-runs on a fresh machine, Tom often found there were operating system dependencies he had to install for R. Also, examples like Huang et al. 2019 were Tom looked to run on his virtual machine but that only allocates 4GB RAM and the model used 8GB RAM so he had to use a different machine - hence, also helping highlight memory requirements.