HDR UK Futures HDR UK Futures Testing in Research Workflows
  1. Case study

This site contains materials for the testing module on HDR UK’s RSE001 Research Software Engineering training course. It was developed as part of the STARS project.

  • When and why to run tests?
  • Case study
  • Introduction to writing and running tests
    • How to write a basic test
    • How to run tests
    • Parameterising tests
  • Types of test
    • Functional tests
    • Unit tests
    • Back tests
  • Test coverage
  • Running tests via GitHub actions
  • Example repositories

Case study

Choose your language:  


Our tutorial uses a simple example of importing a small patient dataset and carrying out basic descriptive analysis.

In the example we will:

  1. Import patient-level data from a CSV file and check that it has the expected columns.
  2. Derive waiting times from arrival and departure datetimes.
  3. Compute simple summary statistics (mean, standard deviation, and 95% confidence interval) for waiting time.

Packages

import json
from pathlib import Path

import numpy as np
import pandas as pd
import scipy.stats as st

pd.set_option("display.max_columns", 8)

Import patient data

Our dataset is a small synthetic set of patient-level event times that could come from a healthcare setting (e.g., emergency department, outpatient clinic, doctor’s appointments). Each row records when a patient arrived and when service began.

Our function import_patient_data() reads the data from a CSV file and checks it contains the expected columns, returning a pandas DataFrame.

def import_patient_data(path):
    """
    Import raw patient data and check that required columns are present.

    Parameters
    ----------
    path : str or pathlib.Path
        Path to the CSV file containing the patient data.

    Returns
    -------
    pandas.DataFrame
        Dataframe containing the raw patient-level data.

    Raises
    ------
    ValueError
        If the CSV file does not contain exactly the expected columns
        in the expected order.
    """
    df = pd.read_csv(Path(path))

    # Expected columns in the raw data (names and order must match)
    expected = [
        "PATIENT_ID",
        "ARRIVAL_DATE", "ARRIVAL_TIME",
        "SERVICE_DATE", "SERVICE_TIME"
    ]
    if list(df.columns) != expected:
        raise ValueError(
            f"Unexpected columns: {list(df.columns)} (expected {expected})"
        )

    return df

We can run this function on our example dataset patient_data.csv.

You can download a copy of this data here:

raw_data = import_patient_data("data/patient_data.csv")
raw_data
   PATIENT_ID ARRIVAL_DATE  ARRIVAL_TIME SERVICE_DATE  SERVICE_TIME
0           1   2025-01-01             1   2025-01-01             7
1           2   2025-01-01             2   2025-01-01             4
2           3   2025-01-01             3   2025-01-01            10
3           4   2025-01-01             7   2025-01-01            14
4           5   2025-01-01            10   2025-01-01            12
5           6   2025-01-01            10   2025-01-01            11

Calculate waiting times

Next, we convert the date and time fields into datetime columns and calculate each patient’s waiting time in minutes.

def calculate_wait_times(df):
    """
    Add arrival/service datetimes and waiting time in minutes.

    Parameters
    ----------
    df : pandas.DataFrame
        Patient-level data containing `ARRIVAL_DATE`, `ARRIVAL_TIME`,
        `SERVICE_DATE`, and `SERVICE_TIME` columns.

    Returns
    -------
    pandas.DataFrame
        Copy of the input DataFrame with additional columns:
        `arrival_datetime`, `service_datetime`, and `wait_minutes`.
    """
    df = df.copy()

    # Combine date and time columns into datetime columns
    for prefix in ("ARRIVAL", "SERVICE"):
        df[f"{prefix.lower()}_datetime"] = pd.to_datetime(
            df[f"{prefix}_DATE"].astype(str) +
            " " +
            df[f"{prefix}_TIME"].astype(str).str.zfill(4),
            format="%Y-%m-%d %H%M",
        )

    # Waiting time in minutes
    df["wait_minutes"] = (
        df["service_datetime"] - df["arrival_datetime"]
    ) / pd.Timedelta(minutes=1)

    return df

We then apply this function to the raw data.

processed_data = calculate_wait_times(raw_data)
processed_data
   PATIENT_ID ARRIVAL_DATE  ARRIVAL_TIME SERVICE_DATE  SERVICE_TIME  \
0           1   2025-01-01             1   2025-01-01             7   
1           2   2025-01-01             2   2025-01-01             4   
2           3   2025-01-01             3   2025-01-01            10   
3           4   2025-01-01             7   2025-01-01            14   
4           5   2025-01-01            10   2025-01-01            12   
5           6   2025-01-01            10   2025-01-01            11   

     arrival_datetime    service_datetime  wait_minutes  
0 2025-01-01 00:01:00 2025-01-01 00:07:00           6.0  
1 2025-01-01 00:02:00 2025-01-01 00:04:00           2.0  
2 2025-01-01 00:03:00 2025-01-01 00:10:00           7.0  
3 2025-01-01 00:07:00 2025-01-01 00:14:00           7.0  
4 2025-01-01 00:10:00 2025-01-01 00:12:00           2.0  
5 2025-01-01 00:10:00 2025-01-01 00:11:00           1.0  

Calculate summary statistics

Finally, we define a small utility function that calculates the mean, standard deviation, and a 95% confidence interval for a numeric series. The function handles small‑sample edge cases explicitly and uses the t‑distribution, which is appropriate when the sample size is modest.

def summary_stats(data):
    """
    Calculate mean, standard deviation and 95% confidence interval (CI).

    CI is calculated using the t-distribution, which is appropriate for
    small samples and converges to the normal distribution as the sample
    size increases.

    Parameters
    ----------
    data : pandas.Series
        Data to use in the calculation.

    Returns
    -------
    dict[str, float]
        A dictionary with keys `mean`, `std_dev`, `ci_lower` and `ci_upper`.
        Each value is a float, or `numpy.nan` if it can't be computed.
    """
    # Drop missing values
    data = data.dropna()

    # Find number of observations
    count = len(data)

    # If there are no observations, then set all to NaN
    if count == 0:
        mean, std_dev, ci_lower, ci_upper = np.nan, np.nan, np.nan, np.nan

    # If there are 1 or 2 observations, can do mean but not other statistics
    elif count < 3:
        mean = data.mean()
        std_dev, ci_lower, ci_upper = np.nan, np.nan, np.nan

    # With more than two observations, can calculate all...
    else:
        mean = data.mean()
        std_dev = data.std()

        # If there is no variation, then CI is equal to the mean
        if np.var(data) == 0:
            ci_lower, ci_upper = mean, mean
        else:
            # 95% CI based on the t-distribution
            ci_lower, ci_upper = st.t.interval(
                confidence=0.95,
                df=count-1,
                loc=mean,
                scale=st.sem(data)
            )

    return {
        "mean": mean,
        "std_dev": std_dev,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper
    }

We apply summary_stats() to the waiting times.

results = summary_stats(processed_data["wait_minutes"])

# Format dictionary for display
formatted_results = json.dumps(results, indent=4)
print(formatted_results)
{
    "mean": 4.166666666666667,
    "std_dev": 2.786873995477131,
    "ci_lower": 1.2420217719136457,
    "ci_upper": 7.091311561419689
}
When and why to run tests?
How to write a basic test
 
  • Code licence: MIT. Text licence: CC-BY-SA 4.0.