Day 10 – Reproducing Huang et al. 2019

Note

Working on research compendium stage.

Untimed: Research compendium

Parallel processing

Tried adding parallel processing in model.R to speed it up

Add future.apply to the environment
plan(multisession, workers=max(availableCores()-5, 1))
future_lapply()
However, it took longer than usual! So I removed it

Reorganising

Moved scripts into a scripts/ folder
Moved help functions from reproduction.Rmd into seperate R script (primarily so can reuse in tests more easily)

Fix image size

Set ggsave() image width as realised it otherwise varied with window size when running

Tests

Create tests to check model results are consistent

Started with creating a basic test saving tempfile csv and loading it to compare to another dataframe
Then made a test with two example models being run for 3 replications and comparing results
Then, set up with two files, as testthat can run files in parallel, and configured parallel processing. This involved:
- Adding Config/testthat/parallel: true to DESCRIPTION
- Create project-specific environment file with nano reproduction/.Renviron and setting TESTTHAT_CPUS=4
Ran testthat::test_dir("tests"), although seemed to just run sequentially. Confirmed by checking testthat::isparallel() which returned FALSE.
Tried adding Config/testthat/start-first: shifts, model to DESCRIPTION and it ignored the order, so it appears the issue is it is not using info from the DESCRIPTION file
Checked version and it is correct for running in parallel (testthat>=3.0.0)
Tried instead running testthat::test_local(), and moving tests into a folder testthat/, and this returned an error Could not find a root 'DESCRIPTION' file that starts with '^Package' in /home/amy/Documents/stars/stars-reproduce-huang-2019/reproduction.
Changed DESCRIPTION to add Package and re-run - but this had error that installation of renv failed. Same error occurs if run testthat::test_dir(). It says to Try removing ‘/home/amy/.cache/R/renv/library/reproduction-0912b448/linux-ubuntu-jammy/R-4.4/x86_64-pc-linux-gnu/00LOCK-renv’. I deleted this file (navigated there than rm -r 00LOCK-renv) then re-ran. However, this kept getting the same error message with that same file being created.
Tried removing Package from DESCRIPTION and running testthat::test_dir("tests/testthat", load_package="none") - but that ignores the order in DESCRIPTION
Tried testthat::test_dir("tests/testthat", load_package="source") which had error that Field 'Version' not found. Once I had this and re-ran, it ran the tests in the specified order! From Config/testthat/start-first: shifts, model
I then add in Config/testthat/parallel: true and Config/testthat/edition: 3 but it had the same renv error as before
Then decided to just run without parallel for now, so removed those lines from DESCRIPTION, deleted the .Renviron file, and put tests in a single file

Package: huang2019
Version: 0.1
Config/testthat/start-first: shifts, model
Config/testthat/parallel: true
Config/testthat/edition: 3

Created function to simplify testing, then wrote tests fora selection of scenarios (not all scenarios, to minimise run time).
Test was failing with error of Length mismatch: comparison on first 2 components. I tried changing from expect_equal to using all.equal() and then expect_true(is_true()) on result. But this returned the same error!
I tried running everything manually in the console so I could inspect the dataframes myself.

file = "tests/testthat/expected_results/fig2_baseline.csv.gz"
exp <- as.data.frame(data.table::fread(file))
inputs=list(seed=200)
result <- do.call(run_model, inputs)

I realised the issue was that the expected result included a column shift where value throughout was 5pm. This was likely due to changing it at some point but not having re-run the whole script since, so I did that (and timed it!). I removed some of the model variants that aren’t to produce results from the paper (E.g. varying seeds)
- It takes a while to run and, midway through, the R session encountered a fatal error and aborted. Tried again, and it failed again on exclusive_f5 <- run_model(exclusive_use = TRUE, seed = SEED, fig5=TRUE).
- I’m suspecting this might be due to the size of the dataframes produced? So tried removing them from the environment after saving and ran again - but it still crashed, this time on the next run_model() statement
- I considered trying again with parallelisation but, given I hadn’t had much luck with that before, and given that the issue here is with R crashing (and so parallelisation actually may not help), I decided to instead split up reproduction.rmd into a few smaller files.
- I re-ran each of these in full, recording the run times.

Docker

Used the RStudio documentation and this tutorial to write a Dockerfile.