HDR UK Futures HDR UK Futures Testing in Research Workflows
  1. Case study

This site contains materials for the testing module on HDR UK’s RSE001 Research Software Engineering training course. It was developed as part of the STARS project.

  • When and why to run tests?
  • Case study
  • Introduction to writing and running tests
    • How to write a basic test
    • How to run tests
    • Parameterising tests
  • Types of test
    • Unit tests
    • Functional tests
    • Back tests
  • What was the point? Let’s break it and see!
  • Test coverage
  • Running tests via GitHub actions
  • Example repositories

Case study

Choose your language:  


Our tutorial uses a simple example of importing a small patient dataset and carrying out basic descriptive analysis.

In the example we will:

  1. Import patient-level data from a CSV file and check that it has the expected columns.
  2. Derive waiting times from arrival and departure datetimes.
  3. Compute simple summary statistics (mean, standard deviation, and 95% confidence interval) for waiting time.

Packages

import json
from pathlib import Path

import numpy as np
import pandas as pd
import scipy.stats as st

pd.set_option("display.max_columns", 8)
library(readr)
library(dplyr)
library(lubridate)

Import patient data

Our dataset is a small synthetic set of patient-level event times that could come from a healthcare setting (e.g., emergency department, outpatient clinic, doctor’s appointments). Each row records when a patient arrived and when service began.

Our function import_patient_data() reads the data from a CSV file and checks it contains the expected columns, returning a pandas DataFrame.

def import_patient_data(path):
    """
    Import raw patient data and check that required columns are present.

    Parameters
    ----------
    path : str or pathlib.Path
        Path to the CSV file containing the patient data.

    Returns
    -------
    pandas.DataFrame
        Dataframe containing the raw patient-level data.

    Raises
    ------
    ValueError
        If the CSV file does not contain exactly the expected columns
        in the expected order.
    """
    df = pd.read_csv(Path(path))

    # Expected columns in the raw data (names and order must match)
    expected = [
        "PATIENT_ID",
        "ARRIVAL_DATE", "ARRIVAL_TIME",
        "SERVICE_DATE", "SERVICE_TIME"
    ]
    if list(df.columns) != expected:
        raise ValueError(
            f"Unexpected columns: {list(df.columns)} (expected {expected})"
        )

    return df
#' Import raw patient data and check that required columns are present.
#'
#' Raises an error if the CSV file does not contain exactly the expected 
#' columns in the expected order.
#'
#' @param path Character string giving path to the CSV file containing the 
#'   patient data.
#'
#' @return A data frame containing the raw patient-level data.
#'
#' @export
import_patient_data <- function(path) {
  df <- readr::read_csv(path, show_col_types = FALSE)

  # Expected columns in the raw data (names and order must match)
  expected <- c(
    "PATIENT_ID",
    "ARRIVAL_DATE", "ARRIVAL_TIME",
    "SERVICE_DATE", "SERVICE_TIME"
  )
  if (!identical(colnames(df), expected)) {
    stop(
      sprintf(
        "Unexpected columns: %s (expected %s)",
        paste(colnames(df), collapse = ", "),
        paste(expected, collapse = ", ")
      )
    )
  }

  return(df)
}

We can run this function on our example dataset patient_data.csv.

You can download a copy of this data here:

raw_data = import_patient_data(
   "../examples/python_package/data/patient_data.csv"
)
raw_data
   PATIENT_ID ARRIVAL_DATE  ARRIVAL_TIME SERVICE_DATE  SERVICE_TIME
0           1   2025-01-01             1   2025-01-01             7
1           2   2025-01-01             2   2025-01-01             4
2           3   2025-01-01             3   2025-01-01            10
3           4   2025-01-01             7   2025-01-01            14
4           5   2025-01-01            10   2025-01-01            12
5           6   2025-01-01            10   2025-01-01            11
raw_data <- import_patient_data(
  file.path("..", "examples", "r_package", "inst", "extdata", "patient_data.csv")
)
raw_data
# A tibble: 6 × 5
  PATIENT_ID ARRIVAL_DATE ARRIVAL_TIME SERVICE_DATE SERVICE_TIME
       <dbl> <date>       <chr>        <date>       <chr>       
1          1 2025-01-01   0001         2025-01-01   0007        
2          2 2025-01-01   0002         2025-01-01   0004        
3          3 2025-01-01   0003         2025-01-01   0010        
4          4 2025-01-01   0007         2025-01-01   0014        
5          5 2025-01-01   0010         2025-01-01   0012        
6          6 2025-01-01   0010         2025-01-01   0011        

Calculate waiting times

Next, we convert the date and time fields into datetime columns and calculate each patient’s waiting time in minutes.

def calculate_wait_times(df):
    """
    Add arrival/service datetimes and waiting time in minutes.

    Parameters
    ----------
    df : pandas.DataFrame
        Patient-level data containing `ARRIVAL_DATE`, `ARRIVAL_TIME`,
        `SERVICE_DATE`, and `SERVICE_TIME` columns.

    Returns
    -------
    pandas.DataFrame
        Copy of the input DataFrame with additional columns:
        `arrival_datetime`, `service_datetime`, and `waittime`.
    """
    df = df.copy()

    # Combine date and time columns into datetime columns
    for prefix in ("ARRIVAL", "SERVICE"):
        df[f"{prefix.lower()}_datetime"] = pd.to_datetime(
            df[f"{prefix}_DATE"].astype(str) +
            " " +
            df[f"{prefix}_TIME"].astype(str).str.zfill(4),
            format="%Y-%m-%d %H%M"
        )

    # Waiting time in minutes
    df["waittime"] = (
        df["service_datetime"] - df["arrival_datetime"]
    ) / pd.Timedelta(minutes=1)

    return df
#' Add arrival/service datetimes and waiting time in minutes.
#'
#' @param df Data frame with patient-level data containing `ARRIVAL_DATE`, 
#'   `ARRIVAL_TIME`, `SERVICE_DATE`, and `SERVICE_TIME` columns.
#'
#' @return A copy of the input data frame with additional columns:
#'   `arrival_datetime`, `service_datetime`, and `waittime`.
#'
#' @export
calculate_wait_times <- function(df) {
  df <- df |>
    dplyr::mutate(
      arrival_datetime = lubridate::ymd_hm(
        paste(
          as.character(ARRIVAL_DATE),
          sprintf("%04d", as.integer(ARRIVAL_TIME))
        )
      ),
      service_datetime = lubridate::ymd_hm(
        paste(
          as.character(SERVICE_DATE),
          sprintf("%04d", as.integer(SERVICE_TIME))
        )
      )
    )

  if (any(is.na(df$arrival_datetime) | is.na(df$service_datetime))) {
    stop(
      "Failed to parse arrival or service datetimes; ",
      "check for missing or invalid dates/times."
    )
  }

  df <- df |>
    dplyr::mutate(
      waittime = as.numeric(
        difftime(service_datetime, arrival_datetime, units = "mins")
      )
    )

  df
}

We then apply this function to the raw data.

processed_data = calculate_wait_times(raw_data)
processed_data
   PATIENT_ID ARRIVAL_DATE  ARRIVAL_TIME SERVICE_DATE  SERVICE_TIME  \
0           1   2025-01-01             1   2025-01-01             7   
1           2   2025-01-01             2   2025-01-01             4   
2           3   2025-01-01             3   2025-01-01            10   
3           4   2025-01-01             7   2025-01-01            14   
4           5   2025-01-01            10   2025-01-01            12   
5           6   2025-01-01            10   2025-01-01            11   

     arrival_datetime    service_datetime  waittime  
0 2025-01-01 00:01:00 2025-01-01 00:07:00       6.0  
1 2025-01-01 00:02:00 2025-01-01 00:04:00       2.0  
2 2025-01-01 00:03:00 2025-01-01 00:10:00       7.0  
3 2025-01-01 00:07:00 2025-01-01 00:14:00       7.0  
4 2025-01-01 00:10:00 2025-01-01 00:12:00       2.0  
5 2025-01-01 00:10:00 2025-01-01 00:11:00       1.0  
processed_data <- calculate_wait_times(raw_data)
processed_data
# A tibble: 6 × 8
  PATIENT_ID ARRIVAL_DATE ARRIVAL_TIME SERVICE_DATE SERVICE_TIME
       <dbl> <date>       <chr>        <date>       <chr>       
1          1 2025-01-01   0001         2025-01-01   0007        
2          2 2025-01-01   0002         2025-01-01   0004        
3          3 2025-01-01   0003         2025-01-01   0010        
4          4 2025-01-01   0007         2025-01-01   0014        
5          5 2025-01-01   0010         2025-01-01   0012        
6          6 2025-01-01   0010         2025-01-01   0011        
# ℹ 3 more variables: arrival_datetime <dttm>, service_datetime <dttm>,
#   waittime <dbl>

Calculate summary statistics

Finally, we define a small utility function that calculates the mean, standard deviation, and a 95% confidence interval for a numeric series. The function handles small‑sample edge cases explicitly and uses the t‑distribution, which is appropriate when the sample size is modest.

def summary_stats(data):
    """
    Calculate mean, standard deviation and 95% confidence interval (CI).

    CI is calculated using the t-distribution, which is appropriate for
    small samples and converges to the normal distribution as the sample
    size increases.

    Parameters
    ----------
    data : pandas.Series
        Data to use in the calculation.

    Returns
    -------
    dict[str, float]
        A dictionary with keys `mean`, `std_dev`, `ci_lower` and `ci_upper`.
        Each value is a float, or `numpy.nan` if it can't be computed.
    """
    # Drop missing values
    data = data.dropna()

    # Find number of observations
    count = len(data)

    # If there are no observations, then set all to NaN
    if count == 0:
        mean, std_dev, ci_lower, ci_upper = np.nan, np.nan, np.nan, np.nan

    # If there are 1 or 2 observations, can do mean but not other statistics
    elif count < 3:
        mean = data.mean()
        std_dev, ci_lower, ci_upper = np.nan, np.nan, np.nan

    # With more than two observations, can calculate all...
    else:
        mean = data.mean()
        std_dev = data.std()

        # If there is no variation, then CI is equal to the mean
        if np.var(data) == 0:
            ci_lower, ci_upper = mean, mean
        else:
            # 95% CI based on the t-distribution
            ci_lower, ci_upper = st.t.interval(
                confidence=0.95,
                df=count-1,
                loc=mean,
                scale=st.sem(data)
            )

    return {
        "mean": mean,
        "std_dev": std_dev,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper
    }
#' Calculate mean, standard deviation and 95% confidence interval (CI).
#'
#' CI is calculated using the t-distribution, which is appropriate for
#' small samples and converges to the normal distribution as the sample
#' size increases.
#'
#' @param data Numeric vector of data to use in the calculation.
#'
#' @return A named list with elements `mean`, `std_dev`, `ci_lower` and 
#'   `ci_upper`. Each value is a numeric, or `NA` if it can't be computed.
#'
#' @export
summary_stats <- function(data) {
  tibble::tibble(value = data) |>
    dplyr::reframe(
      n_complete = sum(!is.na(value)),
      mean = mean(value, na.rm = TRUE),
      std_dev = stats::sd(value, na.rm = TRUE),
      ci_lower   = {
        if (n_complete < 2L) {
          NA_real_
        } else if (std_dev == 0 || is.na(std_dev)) {
          mean       # CI collapses to mean when no variation
        } else {
          stats::t.test(value)$conf.int[1L]
        }
      },
      ci_upper   = {
        if (n_complete < 2L) {
          NA_real_
        } else if (std_dev == 0 || is.na(std_dev)) {
          mean       # CI collapses to mean when no variation
        } else {
          stats::t.test(value)$conf.int[2L]
        }
      }
    ) |>
    dplyr::select(-n_complete) |>
    as.list()
}

We apply summary_stats() to the waiting times.

results = summary_stats(processed_data["waittime"])

# Format dictionary for display
formatted_results = json.dumps(results, indent=4)
print(formatted_results)
{
    "mean": 4.166666666666667,
    "std_dev": 2.786873995477131,
    "ci_lower": 1.2420217719136457,
    "ci_upper": 7.091311561419689
}
results <- summary_stats(processed_data[["waittime"]])
results
$mean
[1] 4.166667

$std_dev
[1] 2.786874

$ci_lower
[1] 1.242022

$ci_upper
[1] 7.091312
When and why to run tests?
How to write a basic test
 
  • Code licence: MIT. Text licence: CC-BY-SA 4.0.