Input data management


When managing input data in your RAP, there are three key files:


1 What is included in a RAP?

Your reproducible analytical pipeline (RAP) should begin with the earliest data you access. This could be:

  • Raw data (if you estimate parameters yourself), or-
  • Pre-defined parameters (if these are already supplied).

Keep in mind that, especially in sensitive areas like healthcare, you may not be able to share your full RAP outside your team or organisation. Even so, it’s crucial to maintain a complete RAP internally so your work remains fully reproducible. For example:

Why is this important? By starting at the source, you make your work transparent and easy to repeat. For instance, if new raw data becomes available, it’s important you have your input modelling code so that you can check your distributions are still appropriate, re-estimate your model parameters, and re-run your analysis.


2 Raw data

This is data which reflects system you will be simulating. It is used to estimate parameters and fit distributions for your simulation model. For example:

ARRIVAL_DATE ARRIVAL_TIME SERVICE_DATE SERVICE_TIME DEPARTURE_DATE DEPARTURE_TIME
2025-01-01 0001 2025-01-01 0007 2025-01-01 0012
2025-01-01 0002 2025-01-01 0004 2025-01-01 0007
2025-01-01 0003 2025-01-01 0010 2025-01-01 0030
2025-01-01 0007 2025-01-01 0014 2025-01-01 0022

2.1 📋 Checklist: Managing your raw data

🗂️ Always

  • Keep copies of your raw data
    Or, if you can’t export it, document how to access it (e.g. database location, required permissions).

  • Record metadata
    Include: data source, date obtained, time period covered, number of records, and any known issues.

  • Keep copy of the data dictionary
    If none exists, create one to explain your data’s structure and variables.


🔓 If you can share the data:

  • Make the data openly available
    Follow the FAIR principles: Findable, Accessible, Interoperable, Reusable.

  • Deposit in a trusted archive
    Use platforms like Zenodo, Figshare, GitHub or GitLab.

  • Add an open data licence
    Examples: CC0, CC-BY.

  • Provide a citation or DOI
    Make it easy for others to reference your dataset.


🔒 If you cannot share the data:

  • Describe the dataset
    Include details in your documentation.

  • Share the data dictionary
    If allowed, to help others understand the data structure.

  • Consider providing a synthetic dataset
    Create a sample with the same structure (but no sensitive information) so that others can understand the data layout and test run code.

  • Explain access restrictions
    State why sharing isn’t possible and provide contact information for access requests.

“Data sourced from the XYZ database. Copies are available in this repository, or, to access directly, log in to the XYZ database and navigate to [path/to/data].

Data covers January 2012 to December 2017, with [number] records. Note: [details on missing data, known issues, etc.].

A copy of the data dictionary is available in the repository or online at [URL].”

Access to the dataset is restricted due to patient confidentiality. Researchers interested in accessing the data must submit a data access request to the XYZ Data Governance Committee. For more information, contact data.manager@xyz.org.

A data dictionary describes each field, its format, units, and any coding schemes used. Example data dictionary:

Field Field name Format Description
ARRIVAL_DATE CLINIC ARRIVAL DATE Date(CCYY-MM-DD) The date on which the patient arrived at the clinic
ARRIVAL_TIME CLINIC ARRIVAL TIME Time(HH:MM) The time at which the patient arrived at the clinic
DEPARTURE_DATE CLINIC DEPARTURE DATE Date(CCYY-MM-DD) The date on which the patient left the clinic
DEPARTURE_TIME CLINIC DEPARTURE TIME Time(HH:MM) The time at which the patient left the clinic
SERVICE_DATE NURSE SERVICE START DATE Date(CCYY-MM-DD) The date on which the nurse consultation began
SERVICE_TIME NURSE SERVICE START TIME Time(HH:MM) The time at which the nurse consultation began


3 Input modelling code

Input modelling code refers to the scripts used to define and fit the statistical distributions that represent the uncertain inputs for a simulation model.

These scripts are often not shared, but are an essential part of your simulation RAP. Sharing them ensures transparency in how distributions were chosen and allows you (or others) to re-run the process if new data or assumptions arise.

3.1 📋 Checklist: Managing your input modelling code

🔓 If you can share the code:

  • Include the input modelling code in your repository
    Store it alongside your simulation code and other relevant scripts.


🔒 If you cannot share the code:

  • For internal use:
    • Store the code securely and ensure it is accessible to your team or organisation - avoid saving it only on a personal device.
    • Use version control (e.g. a private GitHub repository) to track changes and maintain access.
  • For external documentation:
    • Clearly describe the input modelling process.
    • Explain why the code cannot be shared (e.g. it contains sensitive or proprietary logic).


4 Parameters

Parameters are the numerical values used in your model, like the arrival rates, service times or probabilities.

4.1 📋 Checklist: Managing your parameters

🗂️ Always

  • Keep a structured parameter file
    Store all model parameters in a clearly structured format like a CSV file or a script.

  • Document each parameter
    Include a data dictionary or documentation describing each parameter, its meaning, units, and any abbreviations or codes used.

  • Be clear how parameters were determined.
    If you calculated them, link to the input modelling code or describe the calculation steps (as above). If they were supplied to you, then clearly state the source of the parameters and any known processing or transformation.


🔓 If you can share the parameters:

  • Include parameter files in your repository
    Store parameter files alongside your model code and documentation.


You must share some parameters with your model so that it is possible for others to run it. Parameters are often less sensitive than raw data, so sharing is usually possible. However-

🔒 If you cannot share the parameters:

  • Provide synthetic parameters
    Supply artifical values for each parameter, clearly labelled as synthetic.

  • Describe how synthetic parameters were generated
    Document the process or basis for generating synthetic values (e.g. totally artifical, based on published ranges, expert opinion).

  • Explain access restrictions
    State why real parameters cannot be shared and provide contact information fore requests, if appropriate.


5 Maintaining a private and public version of your model

It is likely that you may have some data and/or code that you need to keep private, and cannot share along with the simulation model. It’s important that both the private and public components are version controlled. One way of managing this is to have two separate repositories: a private repository and a public repository.

If the public repository contains the real parameters and results, it’s quite simple: use the private repository for processing input data, then switch to the public repository for running the model.

If the public repository only contains synthetic parameters, you’ll need to be able to run the simulation in the private repository with the real parameters and results, and also in the public repository with the synthetic parameters and results. To avoid duplicating the simulation code across both repositories, a good strategy is to develop your simulation code as a package. This package can be published on GitHub, PyPI, or simply installed locally. Your private repository can then import and use this package, allowing you to maintain a single version of the simulation code while keeping sensitive parameters and data private.


6 Further information