3 Methods

For work package 1, the methods were described in our protocol, Heather et al. (2024).

Heather, Amy, Thomas Monks, Alison Harper, Navonil Mustafee, and Andrew Mayne. 2024. “Protocol for Assessing the Computational Reproducibility of Discrete-Event Simulation Models on STARS,” June. https://zenodo.org/records/12179846.

These are summarised below, and modifications from the protocol are also described.

3.1 Protocol summary

For this work, eight published healthcare DES models were selected. These were models with publicly available code under an open licence (either already available or add on request from the STARS team). For each model, the follow stages of work were conducted:

Stage 1: Reproduction - assessing the computational reproducibility of each study

Informed authors about study and, if not available, asked if they would be happy to add an open licence to their code
Set up repository for reproduction with the article and code
Read the article and defined the scope of the reproduction, archiving the scope (and repository) on Zenodo before proceeding
Looked over the model code, created a suitable environment with the software and packages required, and then ran the model. For each study, with any issues faced in running the model, troubleshooting was performed such as modifying or writing code. If troubleshooting was exhaused and there were still issues or discrepancies in the results, the original study authors were informed and provided the opportunity to advice on the reason for this (although with no pressure or requirement to do so)
For each item in the scope, a decision was made as to whether this had been successfully reproduced or not. This was a subjective decision which allowed some expected deviation due to model stochasticity (for example, lack of seed control).
This is timed (including time to produce each item in the scope), and limited to a maximum of 40 horus

Stage 2: Evaluation - evaluating the publication, code and associated artefacts against sharing and reporting standards

The publication was evaluated against reporting guidelines for DES:
- Monks et al. (2019) - STRESS-DES: Strengthening The Reporting of Empirical Simulation Studies (Discrete-Event Simulation) (Version 1.0).
- Zhang, Lhachimi, and Rogowski (2020) - The generic reporting checklist for healthcare-related discrete event simulation studies derived from the the International Society for Pharmacoeconomics and Outcomes Research Society for Medical Decision Making (ISPOR-SDM) Modeling Good Research Practices Task Force reports.
The model code and associated artefacts (e.g. the GitHub repository shared by the authors) was evaluated against:
- The criteria of badges related to reproducibility from various organisations and journals - namely:
  - National Information Standards Organisation (NISO)(NISO Reproducibility Badging and Definitions Working Group (2021))
  - Association for Computing Machinery (ACM) (Association for Computing Machinery (ACM) (2020))
  - Center for Open Science (COS) (Blohowiak et al. (2023))
  - Institute of Electrical and Electronics Engineers (IEEE) (Institute of Electrical and Electronics Engineers (IEEE) (2024))
  - Psychological Science (Hardwicke and Vazire (2024) and Association for Psychological Science (APS) (2024))
- Recommendations from the pilot STARS framework for the sharing of code and associated materials from discrete-event simulation models (Monks, Harper, and Mustafee (2024)).
This is timed

Stage 3: Report and research compendium - summary report and organised repository

Wrote a report summarising the computational reproducibility assessment and evaluation
Restructed the reposuitory into a “research compendium”, which essentially consisted of organising the repository to ensure it is easy and clear for someone else to re-run. Steps included:
- Adding run times to the model notebooks
- Write a README for the reproduction folder
- Moving data, methods and outputs into seperate folders
- Creating tests which check if a user can get the same results from the model as we did during the reproduction
- A Dockerfile and Docker image published on the GitHub container registry
A second researcher from the STARS team then attempted to use the repository and confirm whether they were able to reproduce the results of the first researcher
Finally, the repository was archived on Zenodo, and the authors were informed.

For each study, a quarto site was produced which shared the results from the reproduction and evaluation and the summary report. Throughout the work, a detailed logbook was kept to keep track of timings and to record work on each stage, such as detailing troubleshooting steps during the reproduction, or uncertainities discussed with another STARS team member during the evaluation.

Summary diagram

This process is captured in the diagram below:

3.2 Inspiration

As cited in Heather et al. (2024), the protocol was informed by several prior studies assessing computational reproducibility:

Krafczyk et al. (2021): Assessed the reproducibility of seven articles published in the Journal of Computational Physics.
B. D. K. Wood, Müller, and Brown (2018) and B. Wood et al. (2018): Assessed the reproducibility of 109 published impact evaluations in low- and middle-income countries. Conducted in association with the replication programme of the International Initiative for Impact Evaluation (3ie).
Schwander et al. (2021): Assessed the reproducibility of four health economic obesity models.
Laurinavichyute, Yadav, and Vasishth (2022): Assessed the reproducibility of 118 articles published in the Journal of Memory and Language.
Konkol, Kray, and Pfeiffer (2019): Assessed the reproducibility of 41 geoscientific articles from Copernicus Publications and the Journal of Statistical Software.

It was also informed by:

Ayllón et al. (2021): Article with recommendations on keeping modelling notebooks to support completion of TRACE (TRAnsparent and Comprehensive model Evaluation) documents.
Berkeley Initiative for Transparency in the Social Sciences (2022): Guidelines on conducting reproductions of published social science research.
McManus, Turner, and Sach (2019): Article proposing several possible definitions for success in reproducing or replicating models in health economics.
Marwick, Boettiger, and Mullen (2018): Article recommending how to structure data analytical work as research compendiums using the R package structure.

3.3 Modifications from the protocol

There were two main changes from the protocol…

Modification	Description and reason for the change
Expanding from 6 to 8 studies	Whilst partway through the 6th study, we reflected on what we had found so far, and felt it would be beneficial to do a few more studies. The primary motivation for this was to try and include a selection of studies that reflect the range of studies in the literature. So far, we had a few studies where the model code/run times were very large or very small, and we felt it beneficial to try and include some more “medium-sized” studies, as these are more typical of the literature. We also thought it would be good to include an older study (for example, about 10 years ago, if we can find one), as all those included so far were within the last few years, but working with more outdated code will be/can be a problem people encounter.
Using the latest software packages	In the protocol, I had planned that - if no versions were provided we select a version of the software and each package that is closest to but still prior to the date of the code archive or paper publication. I kept to this for the Python models (easily set using a conda/mamba environment). However, I had great difficulties attempting to do this in R, and could not successfully backdate both. As such, I used the latest versions of R and each package for those studies

Expanding from 6 to 8 studies

Whilst partway through the 6th study, we reflected on what we had found so far, and felt it would be beneficial to do a few more studies. The primary motivation for this was to try and include a selection of studies that reflect the range of studies in the literature. So far, we had a few studies where the model code/run times were very large or very small, and we felt it beneficial to try and include some more “medium-sized” studies, as these are more typical of the literature. We also thought it would be good to include an older study (for example, about 10 years ago, if we can find one), as all those included so far were within the last few years, but working with more outdated code will be/can be a problem people encounter.

Using the latest software packages

In the protocol, I had planned that - if no versions were provided we select a version of the software and each package that is closest to but still prior to the date of the code archive or paper publication. I kept to this for the Python models (easily set using a conda/mamba environment). However, I had great difficulties attempting to do this in R, and could not successfully backdate both. As such, I used the latest versions of R and each package for those studies

There were also some very minor changes, which are explained below…

Modification	Description and reason for the change
Using percentage difference in results to help decided reproduction success	I did explore this, but I ultimately found it very unhelpful, as the percentage difference could be greatly impacted by scale (for example, 0.1 vs 0.2 would appear much greater than 3 vs 4, but the actual meaning of these differences might be similar (e.g. both might be considered a small difference) depending on the scale used and what is being compared - whilst in another context with a different scale, 0.1 vs 0.2 might actually reflect a huge difference!).
Moving onto evaluation stage before receive author response regarding reproduction discrepancy or before getting consensus on reproduction	In the protocol, we required that authors are contacted if there are any remaining difficulties in running the code or items in the scope that were not reproduced. The authors were given a total of four weeks to respond if they chose to. We had implied that we must wait for this time to pass before continuing to the evaluation stage (since the three stages were presented as being completed one after another). The rationale for this was that the timings for the reproduction would be influenced by whether the evaluation had been completed or not, and vice versa. However, given the many possible influences on the timings, this was considered negligble. We also required consensus on whether items had been succesfully reproduced or not before moving on. In some cases, this was not done, as other team members were not available (e.g. busy or on annual leave) and so could not yet give a second opinion, and so I progressed with the evaluation and got consensus on reproductions afterwards, once they were available.
Organisation of the repository for the research compendium	In the protocol, we had planned that seperate folders were created for data, methods and outputs. This was generally followed but, if an alternative structure seemed more suitable. For example, if the original study already divided items well, but perhaps with different folder names or with multiple scripts folders or so on, we might have used that original structure, as it still served the purpose of being clear and easy to re-run, whilst reducing the number of differences compared with the original study.
Retrospective archiving	For Johnson et al. 2021, I forgot to create a release to archive this repository after defining scope and before proceeding with reproduction. However, I was easily able to resolve this by setting the release to a prior commit, choosing the commit from that timepoint (between scope and reproduction), so it was as if it had been done at the time
Identification of papers	Although 7 papers were identified from the review by Monks and Harper (2023), 1 was identified from additional searches (informal, not systematic) - this was so we could find an older paper.
Reflections	I effectivelly created an “additional” stage, which was to create a page of reflections on the reproduction (barriers and faciliators). This was not formally part of the original protocol, but implicit.

3.4 Timings

As per the protocol, the reproduction and evaluation stages were timed. Although this was conducted carefully and thoroughly, it will not be perfect, and we recognise some of the potential sources of variation in timings between studies, such as:

Whether we consistently included amendments to the quarto site and repository and time spent on GitHub commits etc. within the timings for the reproduction.
For the first R study (Huang et al. (2019)), I initially tried to create an environment with R and package versions prior to the article publication date, although had great difficulties with this and ended up using the latest versions. This contributed to the set-up time during this reproduction, but on later R models (Kim et al. (2021) and Johnson et al. (2021)), based on that experience, I did not attempt to backdate them when getting started.
Any estimated times (for example, if I were partway through working but someone in the office came to talk to me and I forgot to note the time of that, I might estimate if that were about five or ten minutes of conversation, and set the time accordingly).
Timings from consensus discussions regarding uncertainities in the evaluation or reproduction (as these might be longer if done in person rather than over email - or vice versa - and I sometimes spent longer on sorting/tidying these for some studies than others, which I would have included in the time)
Whether subjectively feel that need to add random seeds during reproduction stage, if results vary considerably between each run, and so a certain seed could get a much more similar result than another

At an estimate, this uncertainty between study timings would lead me to conclude that the timings are approximately correct, give or take up to about four hours. However, this is just an estimate, and it is worth noting that Krafczyk et al. (2021), who also conducted computational reproducibility assessments in a different context, estimated that human error introduced a maximum of 8 hours ambiguity in the timings, due to the “non-precise nature of starting and stopping the watch consistently”.

3.5 References

Association for Computing Machinery (ACM). 2020. “Artifact Review and Badging Version 1.1.” ACM. https://www.acm.org/publications/policies/artifact-review-and-badging-current.

Association for Psychological Science (APS). 2024. “Psychological Science Submission Guidelines.” APS. https://www.psychologicalscience.org/publications/psychological_science/ps-submissions.

Ayllón, Daniel, Steven F. Railsback, Cara Gallagher, Jacqueline Augusiak, Hans Baveco, Uta Berger, Sandrine Charles, et al. 2021. “Keeping Modelling Notebooks with TRACE: Good for You and Good for Environmental Research and Management Support.” Environmental Modelling & Software 136 (February): 104932. https://doi.org/10.1016/j.envsoft.2020.104932.

Berkeley Initiative for Transparency in the Social Sciences. 2022. “Guide for Advancing Computational Reproducibility in the Social Sciences.” https://bitss.github.io/ACRE/.

Blohowiak, Ben B., Johanna Cohoon, Lee de-Wit, Eric Eich, Frank J. Farach, Fred Hasselman, Alex O. Holcombe, Macartan Humphreys, Melissa Lewis, and Brian A. Nosek. 2023. “Badges to Acknowledge Open Practices.” https://osf.io/tvyxz/.

Hardwicke, Tom E., and Simine Vazire. 2024. “Transparency Is Now the Default at Psychological Science.” Psychological Science 35 (7): 708–11. https://doi.org/10.1177/09567976231221573.

Heather, Amy, Thomas Monks, Alison Harper, Navonil Mustafee, and Andrew Mayne. 2024. “Protocol for Assessing the Computational Reproducibility of Discrete-Event Simulation Models on STARS.” Zenodo. https://doi.org/10.5281/zenodo.12179846.

Huang, Shiwei, Julian Maingard, Hong Kuan Kok, Christen D. Barras, Vincent Thijs, Ronil V. Chandra, Duncan Mark Brooks, and Hamed Asadi. 2019. “Optimizing Resources for Endovascular Clot Retrieval for Acute Ischemic Stroke, a Discrete Event Simulation.” Frontiers in Neurology 10 (June). https://doi.org/10.3389/fneur.2019.00653.

Institute of Electrical and Electronics Engineers (IEEE). 2024. “About Content in IEEE Xplore.” IEEE Explore. https://ieeexplore.ieee.org/Xplorehelp/overview-of-ieee-xplore/about-content.

Johnson, Kate M., Mohsen Sadatsafavi, Amin Adibi, Larry Lynd, Mark Harrison, Hamid Tavakoli, Don D. Sin, and Stirling Bryan. 2021. “Cost Effectiveness of Case Detection Strategies for the Early Detection of COPD.” Applied Health Economics and Health Policy 19 (2): 203–15. https://doi.org/10.1007/s40258-020-00616-2.

Kim, Lois G., Michael J. Sweeting, Morag Armer, Jo Jacomelli, Akhtar Nasim, and Seamus C. Harrison. 2021. “Modelling the Impact of Changes to Abdominal Aortic Aneurysm Screening and Treatment Services in England During the COVID-19 Pandemic.” PLOS ONE 16 (6): e0253327. https://doi.org/10.1371/journal.pone.0253327.

Konkol, Markus, Christian Kray, and Max Pfeiffer. 2019. “Computational Reproducibility in Geoscientific Papers: Insights from a Series of Studies with Geoscientists and a Reproduction Study.” International Journal of Geographical Information Science 33 (2): 408–29. https://doi.org/10.1080/13658816.2018.1508687.

Krafczyk, M. S., A. Shi, A. Bhaskar, D. Marinov, and V. Stodden. 2021. “Learning from Reproducing Computational Results: Introducing Three Principles and the Reproduction Package.” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 379 (2197): 20200069. https://doi.org/10.1098/rsta.2020.0069.

Laurinavichyute, Anna, Himanshu Yadav, and Shravan Vasishth. 2022. “Share the Code, Not Just the Data: A Case Study of the Reproducibility of Articles Published in the Journal of Memory and Language Under the Open Data Policy.” Journal of Memory and Language 125 (August): 104332. https://doi.org/10.1016/j.jml.2022.104332.

Marwick, Ben, Carl Boettiger, and Lincoln Mullen. 2018. “Packaging Data Analytical Work Reproducibly Using R (and Friends).” The American Statistician 72 (1): 80–88. https://doi.org/10.1080/00031305.2017.1375986.

McManus, Emma, David Turner, and Tracey Sach. 2019. “Can You Repeat That? Exploring the Definition of a Successful Model Replication in Health Economics.” PharmacoEconomics 37 (11): 1371–81. https://doi.org/10.1007/s40273-019-00836-y.

Monks, Thomas, Christine S. M. Currie, Bhakti Stephan Onggo, Stewart Robinson, Martin Kunc, and Simon J. E. Taylor. 2019. “Strengthening the Reporting of Empirical Simulation Studies: Introducing the STRESS Guidelines.” Journal of Simulation 13 (1): 55–67. https://doi.org/10.1080/17477778.2018.1442155.

Monks, Thomas, and Alison Harper. 2023. “Computer Model and Code Sharing Practices in Healthcare Discrete-Event Simulation: A Systematic Scoping Review.” Journal of Simulation 0 (0): 1–16. https://doi.org/10.1080/17477778.2023.2260772.

Monks, Thomas, Alison Harper, and Navonil Mustafee. 2024. “Towards Sharing Tools and Artefacts for Reusable Simulations in Healthcare.” Journal of Simulation 0 (0): 1–20. https://doi.org/10.1080/17477778.2024.2347882.

NISO Reproducibility Badging and Definitions Working Group. 2021. “Reproducibility Badging and Definitions.” https://doi.org/10.3789/niso-rp-31-2021.

Schwander, Björn, Mark Nuijten, Silvia Evers, and Mickaël Hiligsmann. 2021. “Replication of Published Health Economic Obesity Models: Assessment of Facilitators, Hurdles and Reproduction Success.” Pharmacoeconomics 39 (4): 433–46. https://doi.org/10.1007/s40273-021-01008-7.

Wood, Benjamin D. K., Rui Müller, and Annette N. Brown. 2018. “Push Button Replication: Is Impact Evaluation Evidence for International Development Verifiable?” PLOS ONE 13 (12): e0209416. https://doi.org/10.1371/journal.pone.0209416.

Wood, Benjamin, Annette Brown, Eric Djimeu, Maria Vasquez, Semi Yoon, and Jane Burke. 2018. “Replication Protocol for Push Button Replication (PBR).” OSF, January. https://doi.org/https://doi.org/10.17605/OSF.IO/YFBR8.

Zhang, Xiange, Stefan K. Lhachimi, and Wolf H. Rogowski. 2020. “Reporting Quality of Discrete Event Simulations in Healthcare—Results From a Generic Reporting Checklist.” Value in Health 23 (4): 506–14. https://doi.org/10.1016/j.jval.2020.01.005.