7 Evaluation of the repository

The code and related research artefacts in the original code repositories were evaluated against:

The criteria of badges related to reproducibility from various organisations and journals.
Recommendations from the STARS framework for the sharing of code and associated materials from discrete-event simulation models (Monks, Harper, and Mustafee (2024)).

Between each journal badge, there was often alot of overlap in criteria. Hence, a list of unique criteria was produced. The repositories are evaluated against this criteria, and then depending on which criteria they met, against the badges themselves.

Caveat: Please note that these criteria are based on available information about each badge online. Moreover, we focus only on reproduction of the discrete-event simulation, and not on other aspects of the article. We cannot guarantee that the badges below would have been awarded in practice by these journals.

Consider: What criteria are people struggling to meet from the guidelines?

7.1 Summary

Unique badge criteria:

Reflections

No clear relationship. I think it is more meaningful to actually look at what criteria were and were not met.

Badges:

Reflections

Not certain how meaningful these numbers are, as we have imbalanced numbers of different types of badge, and meeting certain popular criteria will weight what is met v.s. not.

Feel that looking at the criteria met is a bit more meaningful? And then specific examples of how that translates into badges - e.g.

None meeting ACM “Artifacts Evaluated - Functional” as requires xyz and these are commonly not met.
For several, they meet three badges, but those three badges have one criteria: reproducing results.

Essential components of STARS framework:

Reflections

Similar to journal criteria (unsurprisingly, as looking at similar things) - most studies meet very few and have wide range of reproduction success, from 12.5% to 100%. Three met more, and these were 80% to 100% reproduced.

I think, if we were to draw anything from this, it would be to reflect on exactly what criteria were and were not met, and why/how that impacted reproduction, in any way (either success or time).

Note: Just considers those fully met, in plot

Optional components of STARS framework:

Reflections

This highlights how Huang meets the most criteria, but is only partially reproduced - but I think it is most interesting to consider why this is.

Note: Just considers those fully met, in plot

Table with proportion of applicable STARS criteria that were fully met

This is part of a table used in the journal article:

	reproduction	stars_essential	stars_optional
Kim et al. 2021 (10/10)	100.0% (10/10)	50%	0%
Lim et al. 2020 (9/9)	100.0% (9/9)	25%	0%
Wood et al. 2021 (5/5)	100.0% (5/5)	25%	0%
Anagnostou et al. 2022 (1/1)	100.0% (1/1)	88%	20%
Shoaib and Ramamohan<br>2021 (16/17)	94.1% (16/17)	25%	0%
Johnson et al. 2021 (4/5)	80.0% (4/5)	50%	0%
Huang et al. 2019 (3/8)	37.5% (3/8)	25%	40%
Hernandez et al. 2015 (1/8)	12.5% (1/8)	25%	0%

7.2 Journal badges

Key:

S: Shoaib and Ramamohan (2021) - link to evaluation
Hu: Huang et al. (2019) - link to evaluation
L: Lim et al. (2020) - link to evaluation
K: Kim et al. (2021) - link to evaluation
A: Anagnostou et al. (2022) - link to evaluation
J: Johnson et al. (2021) - link to evaluation
He: Hernandez et al. (2015) - link to evaluation
W: Wood et al. (2021) - link to evaluation

In this section and below, the criteria for each study are marked as either being fully met (✅), partially met (🟡), not met (❌) or not applicable (N/A).

Unique criteria:

Item	S	Hu	L	K	A	J	He	W
Criteria related to how artefacts are shared
Artefacts are archived in a repository that is: (a) public (b) guarantees persistence (c) gives a unique identifier (e.g. DOI)	❌	❌	❌	❌	✅	❌	❌	❌
Open licence	❌	✅	❌	✅	✅	✅	❌	❌
Criteria related to what artefacts are shared
Complete (all relevant artefacts available)	❌	❌	❌	❌	❌	❌	❌	✅
Artefacts relevant to paper	✅	✅	✅	✅	✅	✅	✅	✅
Criteria related to the structure and documentation of the artefacts
Documents (a) how code is used (b) how it relates to article (c) software, systems, packages and versions	❌	❌	❌	❌	✅	❌	❌	❌
Documents (a) inventory of artefacts (b) sufficient description for artefacts to be exercised	❌	❌	❌	✅	❌	❌	❌	❌
Artefacts are carefully documented and well-structured to the extent that reuse and repurposing is facilitated, adhering to norms and standards	❌	❌	❌	✅	✅	❌	❌	❌
README file with step-by-step instructions to run analysis	❌	❌	❌	❌	✅	✅	❌	❌
Dependencies (e.g. package versions) stated	❌	❌	❌	❌	✅	❌	❌	❌
Clear how output of analysis corresponds to article	❌	❌	❌	❌	❌	❌	❌	✅
Criteria related to running and reproducing results
Scripts can be successfully executed	✅	✅	✅	✅	✅	✅	✅	✅
Reproduced results (assuming (a) acceptably similar (b) reasonable time frame (c) only minor troubleshooting)	❌	❌	❌	❌	❌	❌	❌	✅

Reflections

Artefacts are archived in a repository that is: (a) public (b) guarantees persistence (c) gives a unique identifier (e.g. DOI): Fulfillment doesn’t impact reproduction as I was able to get everything needed from the remote code repository (GitHub or GitLab). However, if these had been deleted from GitHub, it would have become invaluable.

Open licence: This had a big impact on our ability to complete reproductions, as we had to ask authors to add an open licence to their work, to enable us to use it. Gladly, all authors we contacted kindly add these on request. However, it’s worth noting that this was a relatively common issue, and one of the most important, since it completely prevents reuse if excluded.

Complete (all relevant artefacts available): This had a really big impact on the reproduction. The main reason for longer times in reproduction was (a) code for scenarios not provided, and (b) code to process results into figures and tables not provided.

Artefacts relevant to paper: All met (if not met, this would be a massive hindrance).

Documents (a) how code is used (b) how it relates to article (c) software, systems, packages and versions / Documents (a) inventory of artefacts (b) sufficient description for artefacts to be exercised / Artefacts are carefully documented and well-structured to the extent that reuse and repurposing is facilitated, adhering to norms and standards / README file with step-by-step instructions to run analysis:

All really handy. In cases where this was not met, this was often related to having quite a busy/cluttered repository which was confusing to navigate, with minimal documentation.
In Anagnostou et al. (2022), they include a file CHARM_INFO.md alongside their README which walks through the input parameters for the model. I didn’t need to change any of these for the reproduction, but would imagine this is to be very helpful if someone were to reuse the model.
Only three studies had any documentation - READMEs for Kim et al. (2021), Anagnostou et al. (2022) and Johnson et al. (2021). However, focusing on the READMEs, in each of these cases it was great to have these, guiding on how to run the scripts, or on what each folder/file in the repository is - although one didn’t have step-by-step instructions, as requested.
Whilst Anagnostou et al. (2022) did meet criteria, it should be noted that this was a very simple example, just requiring to run one script which quickly reproduces everything! I had been a bit uncertain on it, since the README doesn’t explicitly say how to make the figure, but it does provide instructions that lead you to regenerate the exact model results from the paper, and so I feel that it does provide instructions to reproduce results sufficiently (although would be more complete to include instructions for figure too - so if it weren’t a yes/no decision for badges, I would’ve said this was partially met). Ideally, studies would clearly outline how to reproduce results in full.

Dependencies (e.g. package versions) stated: Important and impacts analysis, takes a while to work out otherwise.

Clear how output of analysis corresponds to article: This is handy - clear link between analysis and items in paper.

Scripts can be successfully executed: This is true, though I did allow troubleshooting. Hence, the importance of e.g. environments and scripts being provided in a runnable format (both covered on the reflections page), since these are the hurdles to successfully executing scripts.

Reproduced results (assuming (a) acceptably similar (b) reasonable time frame (c) only minor troubleshooting): On the reproduction page, I reflected (where possible) on what I thought the primary reasons were, for cases where I didn’t manage to reproduce results despite troubleshooting. It is worth noting however that there were two studies that were quite quick to run, which I reflect about on the reproduction page.

Badges:

The badges are grouped into three categories:

“Open objects” badges: These badges relate to research artefacts being made openly available.
“Object review” badges: These badges relate to the research artefacts being reviewed against criteria of the badge issuer.
“Reproduced” badges: These badges relate to an independent party regenerating the reuslts of the article using the author objects.

Item	S	Hu	L	K	A	J	He	W
“Open objects” badges
ACM “Artifacts Available” • Artefacts are archived in a repository that is: (a) public (b) guarantees persistence (c) gives a unique identifier (e.g. DOI)	❌	❌	❌	❌	✅	❌	❌	❌
NISO “Open Research Objects (ORO)” • Artefacts are archived in a repository that is: (a) public (b) guarantees persistence (c) gives a unique identifier (e.g. DOI) • Open licence	❌	❌	❌	❌	✅	❌	❌	❌
NISO “Open Research Objects - All (ORO-A)” • Artefacts are archived in a repository that is: (a) public (b) guarantees persistence (c) gives a unique identifier (e.g. DOI) • Open licence • Complete (all relevant artefacts available)	❌	❌	❌	❌	❌	❌	❌	❌
COS “Open Code” • Artefacts are archived in a repository that is: (a) public (b) guarantees persistence (c) gives a unique identifier (e.g. DOI) • Open licence • Documents (a) how code is used (b) how it relates to article (c) software, systems, packages and versions	❌	❌	❌	❌	✅	❌	❌	❌
IEEE “Code Available” • Complete (all relevant artefacts available)	❌	❌	❌	❌	❌	❌	❌	✅
“Object review” badges
ACM “Artifacts Evaluated - Functional” • Documents (a) inventory of artefacts (b) sufficient description for artefacts to be exercised • Artefacts relevant to paper • Complete (all relevant artefacts available) • Scripts can be successfully executed	❌	❌	❌	❌	❌	❌	❌	❌
ACM “Artifacts Evaluated - Reusable” • Documents (a) inventory of artefacts (b) sufficient description for artefacts to be exercised • Artefacts relevant to paper • Complete (all relevant artefacts available) • Scripts can be successfully executed • Artefacts are carefully documented and well-structured to the extent that reuse and repurposing is facilitated, adhering to norms and standards	❌	❌	❌	❌	❌	❌	❌	❌
IEEE “Code Reviewed” • Complete (all relevant artefacts available) • Scripts can be successfully executed	❌	❌	❌	❌	❌	❌	❌	✅
“Reproduced” badges
ACM “Results Reproduced” • Reproduced results (assuming (a) acceptably similar (b) reasonable time frame (c) only minor troubleshooting)	❌	❌	❌	❌	❌	❌	❌	✅
NISO “Results Reproduced (ROR-R)” • Reproduced results (assuming (a) acceptably similar (b) reasonable time frame (c) only minor troubleshooting)	❌	❌	❌	❌	❌	❌	❌	✅
IEEE “Code Reproducible” • Reproduced results (assuming (a) acceptably similar (b) reasonable time frame (c) only minor troubleshooting)	❌	❌	❌	❌	❌	❌	❌	✅
Psychological Science “Computational Reproducibility” • Reproduced results (assuming (a) acceptably similar (b) reasonable time frame (c) only minor troubleshooting) • README file with step-by-step instructions to run analysis • Dependencies (e.g. package versions) stated • Clear how output of analysis corresponds to article	❌	❌	❌	❌	❌	❌	❌	❌

Reflections

Only one study had permanent archive (with persistent identifier), hence one being awarded NISO “Open Research Objects (ORO)”, ACM “Artifacts Available” and COS “Open Code”. However, that study did not receive NISO “Open Research Objects - All (ORO-A)” as artefacts were not complete.

A complete set of materials was required by IEEE “Code Available” and IEEE “Code Reviewed” - but this was only met by one study, as studies commonly did not include code for scenarios or creation of figures and tables. It was also required by ACM “Artifacts Evaluated - Functional and Reusable” badges, but since that one study didn’t meet their documentation requirements, none were awarded those badges.

Three badges had one criteria: reproduction of results - but with assumptions - and only one study met this (with 100% reproduction PLUS meeting the assumptions).

Badges summary tables for the article

	Criteria	Studies that met criteria
Badge
ACM “Artifacts<br>Available”	• Artefacts are archived in a repository that ...	1/8 (12.5%)
NISO “Open Research<br>Objects (ORO)”	• Artefacts are archived in a repository that ...	1/8 (12.5%)
NISO “Open Research<br>Objects - All (ORO-A)”	• Artefacts are archived in a repository that ...	0/8 (0.0%)
COS “Open Code”	• Artefacts are archived in a repository that ...	1/8 (12.5%)
IEEE “Code Available”	• Complete (all relevant artefacts available)	1/8 (12.5%)
ACM “Artifacts<br>Evaluated - Functional”	• Documents (a) inventory of artefacts (b) suf...	0/8 (0.0%)
ACM “Artifacts<br>Evaluated - Reusable”	• Documents (a) inventory of artefacts (b) suf...	0/8 (0.0%)
IEEE “Code Reviewed”	• Complete (all relevant artefacts available)<...	1/8 (12.5%)
ACM “Results Reproduced”	• Reproduced results (assuming (a) acceptably ...	1/8 (12.5%)
NISO “Results<br>Reproduced (ROR-R)”	• Reproduced results (assuming (a) acceptably ...	1/8 (12.5%)
IEEE “Code Reproducible”	• Reproduced results (assuming (a) acceptably ...	1/8 (12.5%)
Psychological Science<br>“Computational Reproducibility”	• Reproduced results (assuming (a) acceptably ...	0/8 (0.0%)

7.3 STARS framework

Key:

S: Shoaib and Ramamohan (2021) - link to evaluation
Hu: Huang et al. (2019) - link to evaluation
L: Lim et al. (2020) - link to evaluation
K: Kim et al. (2021) - link to evaluation
A: Anagnostou et al. (2022) - link to evaluation
J: Johnson et al. (2021) - link to evaluation
He: Hernandez et al. (2015) - link to evaluation
W: Wood et al. (2021) - link to evaluation

Item	S	Hu	L	K	A	J	He	W
Essential components
Open licence Free and open-source software (FOSS) licence (e.g. MIT, GNU Public Licence (GPL))	❌	✅	❌	✅	✅	✅	❌	❌
Dependency management Specify software libraries, version numbers and sources (e.g. dependency management tools like virtualenv, conda, poetry)	❌	❌	❌	🟡	✅	🟡	❌	❌
FOSS model Coded in FOSS language (e.g. R, Julia, Python)	✅	✅	✅	✅	✅	✅	✅	✅
Minimum documentation Minimal instructions (e.g. in README) that overview (a) what model does, (b) how to install and run model to obtain results, and (c) how to vary parameters to run new experiments	❌	❌	❌	✅	✅	🟡	❌	❌
ORCID ORCID for each study author	❌	❌	❌	❌	❌	❌	❌	❌
Citation information Instructions on how to cite the research artefact (e.g. CITATION.cff file)	❌	❌	❌	❌	✅	✅	❌	❌
Remote code repository Code available in a remote code repository (e.g. GitHub, GitLab, BitBucket)	✅	✅	✅	✅	✅	✅	✅	✅
Open science archive Code stored in an open science archive with FORCE11 compliant citation and guaranteed persistance of digital artefacts (e.g. Figshare, Zenodo, the Open Science Framework (OSF), and the Computational Modeling in the Social and Ecological Sciences Network (CoMSES Net))	❌	❌	❌	❌	✅	❌	❌	❌
Optional components
Enhanced documentation Open and high quality documentation on how the model is implemented and works (e.g. via notebooks and markdown files, brought together using software like Quarto and Jupyter Book). Suggested content includes: • Plain english summary of project and model • Clarifying licence • Citation instructions • Contribution instructions • Model installation instructions • Structured code walk through of model • Documentation of modelling cycle using TRACE • Annotated simulation reporting guidelines • Clear description of model validation including its intended purpose	❌	❌	❌	❌	❌	❌	❌	❌
Documentation hosting Host documentation (e.g. with GitHub pages, GitLab pages, BitBucket Cloud, Quarto Pub)	❌	❌	❌	❌	❌	❌	❌	❌
Online coding environment Provide an online environment where users can run and change code (e.g. BinderHub, Google Colaboratory, Deepnote)	❌	❌	❌	❌	❌	❌	❌	❌
Model interface Provide web application interface to the model so it is accessible to less technical simulation users	❌	✅	❌	❌	✅	❌	❌	❌
Web app hosting Host web app online (e.g. Streamlit Community Cloud, ShinyApps hosting)	❌	✅	❌	❌	🟡	❌	❌	❌

Reflections

These topics were covered in the badge criteria reflections: open licence, minimum documentation, and open science archive.

Dependency management: This was pretty uncommon, and often took some troubleshooting at the start, to figure out which packages were needed, and certain versions.

FOSS model: All met as requirement of our reproduction.

ORCID and citation information: Doesn’t impact reproduction in this case - but:

We do go to these from having found an article. I was choosing repositories that I had found from papers, so I already at least knew who the paper authors were.
In all cases, I emailed the authors, which requires finding contact information (generally via paper, sometimes from googling them to find new emails).
Any attempted citation of the repository itself would’ve necessarily been correct, depending on whether the author list would be the same as in the paper, if you relied on the paper without citation information.

Remote code repository: All met, most common way to share code.

Enhanced documentation: Only three studies had any documentation, and neither met these extensive requirements. I anticipate - if any had met this - it would’ve made the reproduction very quick and easy!

Documentation hosting: Not applicable, given only basic documentation.

Online coding environment: None provided. I always intended to run on my own machine, so this might not have had much bearing in my case if provided, but would moreso for people who perhaps didn’t have Python or R installed, and hopefully would have bypassed environment troubleshooting issues.

Model interface: Two studies had applications, although in both cases, these weren’t “outcomes” in scope of reproduction, nor did they produce them.

Web app hosting: This was quite important. Both apps had been hosted, but one was hosted with a site that is no longer operational. In both cases, the app wasn’t in “scope” although I did still view it and look into it for one as it was hosted and so could very easily - but for the other, I didn’t view it, as I didn’t go through the steps of running it locally, since it wasn’t the focus.

7.4 Timings

Shoaib and Ramamohan (2021) - 30m
Huang et al. (2019) - 17m
Lim et al. (2020) - 18m
Kim et al. (2021) - 18m
Anagnostou et al. (2022) - 19m
Johnson et al. (2021) - 20m
Hernandez et al. (2015) - 13m
Wood et al. (2021) - 14m

Revisiting and redoing evaluation for badges was only 2-3 minutes per study, and have just stuck with the original evaluation timings above.

Reflections

No particular comments, don’t think we learn much from the timings here.

7.5 Badge sources

National Information Standards Organisation (NISO) (NISO Reproducibility Badging and Definitions Working Group (2021))

“Open Research Objects (ORO)”
“Open Research Objects - All (ORO-A)”
“Results Reproduced (ROR-R)”

Association for Computing Machinery (ACM) (Association for Computing Machinery (ACM) (2020))

“Artifacts Available”
“Artifacts Evaluated - Functional”
“Artifacts Evaluated - Resuable”
“Results Reproduced”

Center for Open Science (COS) (Blohowiak et al. (2023))

“Open Code”

Institute of Electrical and Electronics Engineers (IEEE) (Institute of Electrical and Electronics Engineers (IEEE) (2024))

“Code Available”
“Code Reviewed”
“Code Reproducible”

Psychological Science (Hardwicke and Vazire (2024) and Association for Psychological Science (APS) (2024))

“Computational Reproducibility”

7.6 References

Anagnostou, Anastasia, Derek Groen, Simon J. E. Taylor, Diana Suleimenova, Nura Abubakar, Arindam Saha, Kate Mintram, et al. 2022. “FACS-CHARM: A Hybrid Agent-Based and Discrete-Event Simulation Approach for Covid-19 Management at Regional Level.” In 2022 Winter Simulation Conference (WSC), 1223–34. https://doi.org/10.1109/WSC57314.2022.10015462.

Association for Computing Machinery (ACM). 2020. “Artifact Review and Badging Version 1.1.” ACM. https://www.acm.org/publications/policies/artifact-review-and-badging-current.

Association for Psychological Science (APS). 2024. “Psychological Science Submission Guidelines.” APS. https://www.psychologicalscience.org/publications/psychological_science/ps-submissions.

Blohowiak, Ben B., Johanna Cohoon, Lee de-Wit, Eric Eich, Frank J. Farach, Fred Hasselman, Alex O. Holcombe, Macartan Humphreys, Melissa Lewis, and Brian A. Nosek. 2023. “Badges to Acknowledge Open Practices.” https://osf.io/tvyxz/.

Hardwicke, Tom E., and Simine Vazire. 2024. “Transparency Is Now the Default at Psychological Science.” Psychological Science 35 (7): 708–11. https://doi.org/10.1177/09567976231221573.

Hernandez, Ivan, Jose E. Ramirez-Marquez, David Starr, Ryan McKay, Seth Guthartz, Matt Motherwell, and Jessica Barcellona. 2015. “Optimal Staffing Strategies for Points of Dispensing.” Computers & Industrial Engineering 83 (May): 172–83. https://doi.org/10.1016/j.cie.2015.02.015.

Huang, Shiwei, Julian Maingard, Hong Kuan Kok, Christen D. Barras, Vincent Thijs, Ronil V. Chandra, Duncan Mark Brooks, and Hamed Asadi. 2019. “Optimizing Resources for Endovascular Clot Retrieval for Acute Ischemic Stroke, a Discrete Event Simulation.” Frontiers in Neurology 10 (June). https://doi.org/10.3389/fneur.2019.00653.

Institute of Electrical and Electronics Engineers (IEEE). 2024. “About Content in IEEE Xplore.” IEEE Explore. https://ieeexplore.ieee.org/Xplorehelp/overview-of-ieee-xplore/about-content.

Johnson, Kate M., Mohsen Sadatsafavi, Amin Adibi, Larry Lynd, Mark Harrison, Hamid Tavakoli, Don D. Sin, and Stirling Bryan. 2021. “Cost Effectiveness of Case Detection Strategies for the Early Detection of COPD.” Applied Health Economics and Health Policy 19 (2): 203–15. https://doi.org/10.1007/s40258-020-00616-2.

Kim, Lois G., Michael J. Sweeting, Morag Armer, Jo Jacomelli, Akhtar Nasim, and Seamus C. Harrison. 2021. “Modelling the Impact of Changes to Abdominal Aortic Aneurysm Screening and Treatment Services in England During the COVID-19 Pandemic.” PLOS ONE 16 (6): e0253327. https://doi.org/10.1371/journal.pone.0253327.

Lim, Chun Yee, Mary Kathryn Bohn, Giuseppe Lippi, Maurizio Ferrari, Tze Ping Loh, Kwok-Yung Yuen, Khosrow Adeli, and Andrea Rita Horvath. 2020. “Staff Rostering, Split Team Arrangement, Social Distancing (Physical Distancing) and Use of Personal Protective Equipment to Minimize Risk of Workplace Transmission During the COVID-19 Pandemic: A Simulation Study.” Clinical Biochemistry 86 (December): 15–22. https://doi.org/10.1016/j.clinbiochem.2020.09.003.

Monks, Thomas, Alison Harper, and Navonil Mustafee. 2024. “Towards Sharing Tools and Artefacts for Reusable Simulations in Healthcare.” Journal of Simulation 0 (0): 1–20. https://doi.org/10.1080/17477778.2024.2347882.

NISO Reproducibility Badging and Definitions Working Group. 2021. “Reproducibility Badging and Definitions.” https://doi.org/10.3789/niso-rp-31-2021.

Shoaib, Mohd, and Varun Ramamohan. 2021. “Simulation Modelling and Analysis of Primary Health Centre Operations.” arXiv, June. https://doi.org/10.48550/arXiv.2104.12492.

Wood, Richard M., Adrian C. Pratt, Charlie Kenward, Christopher J. McWilliams, Ross D. Booton, Matthew J. Thomas, Christopher P. Bourdeaux, and Christos Vasilakis. 2021. “The Value of Triage During Periods of Intense COVID-19 Demand: Simulation Modeling Study.” Medical Decision Making 41 (4): 393–407. https://doi.org/10.1177/0272989X21994035.