System tests

Choose your language:

What is a system test?

System tests are broader than unit tests. They focus on a whole feature or workflow, often involving several functions or classes working together. They check whether this end-to-end behaviour gives the correct outputs for given inputs, matching our requirements or expected results.

Integration tests

Another useful type of test that sits between unit and system tests is the integration test.

Integration tests focus on how two or more components work together (e.g., how a data import function hands data to a processing function), without necessarily running the whole workflow end-to-end.

In this small case study, adding separate integration tests would not add much beyond our unit and system tests, but in larger projects they are very helpful for checking the interactions between parts of your code.

Example: waiting times case study

We will return to our waiting times case study which involved three functions:

import_patient_data() - imports raw patient data and checks that the required columns are present.
calculate_wait_times() - adds arrival and service datetimes, and waiting time in minutes.
summary_stats() - calculates mean, standard deviation and 95% confidence interval.

Unlike unit tests, which check each function in isolation, system tests run all three steps together and verify the end‑to‑end workflow produces correct results.

We will need the follow imports in our test script:

import numpy as np
import pandas as pd
import pytest
from waitingtimes.patient_analysis import (
    import_patient_data, calculate_wait_times, summary_stats
)

How to write system tests

1. Identify the feature or workflow to test

Start by choosing the complete feature or workflow you want to validate.

In our case study, we only have a simple three-step pipeline - but more complex projects may have multiple intersecting workflows you want to focus on.

2. Define inputs and expected outputs

Think about realistic scenarios that cover:

Clean/positive/success cases: standard inputs where everything should work correctly, including realistic variations in the inputs (e.g., different sample sizes, different distributions in the data input).
Edge/extreme cases: unusual but plausible inputs (e.g., unusual sample sizes, boundary values).
Error/negative/dirty cases: invalid inputs that should trigger errors.

3. Write tests for these scenarios

Clean case: typical data

This test confirms the workflow succeeds with standard inputs and produces correct summary statistics.

def test_workflow_success(tmp_path):
    """Complete workflow should calculate correct wait statistics."""

    # Create test data with known values
    test_data = pd.DataFrame({
        "PATIENT_ID": ["p1", "p2", "p3"],
        "ARRIVAL_DATE": ["2024-01-01", "2024-01-01", "2024-01-02"],
        "ARRIVAL_TIME": ["0800", "0930", "1015"],
        "SERVICE_DATE": ["2024-01-01", "2024-01-01", "2024-01-02"],
        "SERVICE_TIME": ["0830", "1000", "1045"],
    })

    # Write test CSV
    csv_path = tmp_path / "patients.csv"
    test_data.to_csv(csv_path, index=False)

    # Run complete workflow
    df = import_patient_data(csv_path)
    df = calculate_wait_times(df)
    stats = summary_stats(df["waittime"])

    # Verify the workflow produces correct results
    # Expected wait times: 30, 30, 30 minutes
    assert stats["mean"] == 30.0
    assert stats["std_dev"] == 0.0
    assert stats["ci_lower"] == 30.0
    assert stats["ci_upper"] == 30.0

test_that("complete workflow should calculate correct wait statistics", {

  # Create test data with known values
  test_data <- tibble::tibble(
    PATIENT_ID   = c("p1", "p2", "p3"),
    ARRIVAL_DATE = c("2024-01-01", "2024-01-01", "2024-01-02"),
    ARRIVAL_TIME = c("0800", "0930", "1015"),
    SERVICE_DATE = c("2024-01-01", "2024-01-01", "2024-01-02"),
    SERVICE_TIME = c("0830", "1000", "1045")
  )

  # Write test CSV
  csv_path <- tempfile(fileext = ".csv")
  readr::write_csv(test_data, csv_path)

  # Run complete workflow
  df <- import_patient_data(csv_path)
  df <- calculate_wait_times(df)
  stats <- summary_stats(df$waittime)

  # Verify the workflow produces correct results
  # Expected wait times: 30, 30, 30 minutes
  expect_identical(stats$mean, 30)
  expect_identical(stats$std_dev, 0)
  expect_identical(stats$ci_lower, 30)
  expect_identical(stats$ci_upper, 30)
})

Clean case: variation using data with different distributions

This test confirms the workflow handles realistic variation in wait times.

def test_workflow_with_variation(tmp_path):
    """Workflow should correctly compute statistics for variable wait times."""

    # Create test data with known wait times: 15, 30, 45 minutes
    test_data = pd.DataFrame({
        "PATIENT_ID": ["p1", "p2", "p3"],
        "ARRIVAL_DATE": ["2024-01-01", "2024-01-01", "2024-01-01"],
        "ARRIVAL_TIME": ["0800", "0900", "1000"],
        "SERVICE_DATE": ["2024-01-01", "2024-01-01", "2024-01-01"],
        "SERVICE_TIME": ["0815", "0930", "1045"],
    })

    csv_path = tmp_path / "patients.csv"
    test_data.to_csv(csv_path, index=False)

    # Run complete workflow
    df = import_patient_data(csv_path)
    df = calculate_wait_times(df)
    stats = summary_stats(df["waittime"])

    # Verify mean and standard deviation
    assert stats["mean"] == 30
    assert np.isclose(stats["std_dev"], 15)

    # CI should be symmetric around mean for this small sample
    assert stats["ci_lower"] < stats["mean"] < stats["ci_upper"]

test_that("workflow should give correct statistics for variable wait times", {

  # Create test data with known wait times: 15, 30, 45 minutes
  test_data <- tibble::tibble(
    PATIENT_ID   = c("p1", "p2", "p3"),
    ARRIVAL_DATE = c("2024-01-01", "2024-01-01", "2024-01-01"),
    ARRIVAL_TIME = c("0800", "0900", "1000"),
    SERVICE_DATE = c("2024-01-01", "2024-01-01", "2024-01-01"),
    SERVICE_TIME = c("0815", "0930", "1045")
  )

  csv_path <- tempfile(fileext = ".csv")
  readr::write_csv(test_data, csv_path)

  # Run complete workflow
  df <- import_patient_data(csv_path)
  df <- calculate_wait_times(df)
  stats <- summary_stats(df$waittime)

  # Verify mean and standard deviation
  expect_identical(stats$mean, 30)
  expect_equal(stats$std_dev, 15, tolerance = 1e-8)

  # CI should be symmetric around mean for this small sample
  expect_lt(stats$ci_lower, stats$mean)
  expect_gt(stats$ci_upper, stats$mean)
})

Error case: invalid input data

This test confirms the workflow fails appropriately when given invalid data.

def test_missing_date_error(tmp_path):
    """Workflow should raise error when dates are missing."""

    test_data = pd.DataFrame({
        "PATIENT_ID": ["p1", "p2", "p3"],
        "ARRIVAL_DATE": ["2024-01-01", "2024-01-01", "2024-01-01"],
        "ARRIVAL_TIME": ["0800", "0900", "1000"],
        "SERVICE_DATE": ["2024-01-01", pd.NaT, "2024-01-01"],
        "SERVICE_TIME": ["0830", "1000", "1045"],
    })

    csv_path = tmp_path / "patients.csv"
    test_data.to_csv(csv_path, index=False)

    # Workflow should fail when calculating wait times with missing dates
    df = import_patient_data(csv_path)
    with pytest.raises(ValueError, match="time data"):
        df = calculate_wait_times(df)

test_that("workflow should raise error when dates are missing", {

  test_data <- tibble::tibble(
    PATIENT_ID   = c("p1", "p2", "p3"),
    ARRIVAL_DATE = c("2024-01-01", "2024-01-01", "2024-01-01"),
    ARRIVAL_TIME = c("0800", "0900", "1000"),
    SERVICE_DATE = c("2024-01-01", NA, "2024-01-01"),
    SERVICE_TIME = c("0830", "1000", "1045")
  )

  csv_path <- tempfile(fileext = ".csv")
  readr::write_csv(test_data, csv_path)

  # Workflow should fail when calculating wait times with missing dates
  # Will also have warning from ymd_hm() about returning NA
  df <- import_patient_data(csv_path)
  expect_warning(
    expect_error(
      calculate_wait_times(df),
      regexp = "Failed to parse arrival or service datetimes"
    ),
    regexp = "failed to parse"
  )
})

Running our example tests

Test output

============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /__w/stars-testing-intro/stars-testing-intro/examples/python_package
configfile: pyproject.toml
plugins: cov-7.0.0
collected 3 items

../examples/python_package/tests/test_system.py ...                      [100%]

============================== 3 passed in 0.99s ===============================
<ExitCode.OK: 0>


══ Testing test_system.R ═══════════════════════════════════════════════════════

[ FAIL 0 | WARN 0 | SKIP 0 | PASS 0 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 1 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 3 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 4 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 5 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 6 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 7 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 8 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 9 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 10 ] Done!

When to stop writing tests

You cannot test everything. You’ve written enough tests when:

Critical workflows are covered with at least one success case.
Import variations and edge cases are tested.
Key error conditions are verified.

Focus your testing effort on workflows that matter to your research and scenarios you’re likely to encounter in practice. You’re building confidence in your code, not trying to test every theoretical possibility.