HDR UK Futures HDR UK Futures Testing in Research Workflows
  1. Types of test
  2. Back tests

This site contains materials for the testing module on HDR UK’s RSE001 Research Software Engineering training course. It was developed as part of the STARS project.

  • When and why to run tests?
  • Case study
  • Introduction to writing and running tests
    • How to write a basic test
    • How to run tests
    • Parameterising tests
  • Types of test
    • Unit tests
    • Functional tests
    • Back tests
  • What was the point? Let’s break it and see!
  • Test coverage
  • Running tests via GitHub actions
  • Example repositories
  1. Types of test
  2. Back tests

Back tests

Choose your language:  


What is a back test?

A back test involves running your workflow on historical data and confirming that results are consistent over time.

It’s not focused on whether results are theoretically correct. It’s about consistency and reproducibility - confirming that code changes, environment updates, or data pipeline tweaks have not silently changed results.

This overlaps with functional testing, but is distinct, because functional tests ask “is this output correct” whereas back tests ask “is this output the same as before?”.

Example: waiting times case study

We will run back tests using the dataset we introduced for our waiting times case study.

On the test page, we need to import:

from pathlib import Path
import numpy as np
from waitingtimes.patient_analysis import (
    import_patient_data, calculate_wait_times, summary_stats
)

Back test

def test_reproduction():
    """Re-running on historical data should produce consistent results."""
    # Specify path to historical data
    csv_path = Path(__file__).parent.joinpath("data/patient_data.csv")

    # Run functions
    df = import_patient_data(csv_path)
    df = calculate_wait_times(df)
    stats = summary_stats(df["waittime"])

    # Verify the workflow produces consistent results
    assert np.isclose(stats["mean"], 4.1666, rtol=0.0001)
    assert np.isclose(stats["std_dev"], 2.7869, rtol=0.0001)
    assert np.isclose(stats["ci_lower"], 1.2420, rtol=0.0001)
    assert np.isclose(stats["ci_upper"], 7.0913, rtol=0.0001)
test_that("re-running on historical data produces consistent results", {
  # Re-running on historical data should produce consistent results.

  # Specify path to historical data
  csv_path <- testthat::test_path("data", "patient_data.csv")

  # Run functions
  df <- import_patient_data(csv_path)
  df <- calculate_wait_times(df)
  stats <- summary_stats(df$waittime)

  # Verify the workflow produces consistent results
  expect_equal(stats$mean,     4.1666, tolerance = 1e-4)
  expect_equal(stats$std_dev,  2.7869, tolerance = 1e-4)
  expect_equal(stats$ci_lower, 1.2420, tolerance = 1e-4)
  expect_equal(stats$ci_upper, 7.0913, tolerance = 1e-4)
})

Running our example test

NoteTest output
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /__w/hdruk_tests/hdruk_tests/examples/python_package
configfile: pyproject.toml
plugins: cov-7.0.0
collected 1 item

../examples/python_package/tests/test_back.py .                          [100%]

============================== 1 passed in 0.92s ===============================
<ExitCode.OK: 0>

══ Testing test_back.R ═════════════════════════════════════════════════════════

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 0 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 1 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 3 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 4 ] Done!

When should you update your back tests?

Errors: If you identify an error in your pipeline, you first fix the code and then deliberately update the back test in isolation, so you know the only change in behaviour is the error fix and not something unintended elsewhere.

Changes over time: As your research evolves, you may update the workflow (e.g., improve the wait time calculation method) or use more recent datasets. You can keep the old back test running alongside new ones - this verifies that changes to the workflow don’t accidentally alter results on historical data, while new back tests validate that updated methods work correctly on current data.

Functional tests
What was the point? Let’s break it and see!
 
  • Code licence: MIT. Text licence: CC-BY-SA 4.0.