import pandas as pd
import pytest
from waitingtimes.patient_analysis import import_patient_dataUnit tests
What is a unit test?
A unit test checks one small, isolated unit of code - usually a single function or method. Your aim is to check that, for specific inputs, the function behaves exactly as promised.
Example: import_patient_data()
Let’s use the import_patient_data() function from our case study. We will import it to our test script, alongside other required packages.
Its main behaviours are that it:
- Reads a CSV into a pandas DataFrame.
- Requires the columns to match a specific list exactly (names and order).
- Raises a
ValueErrorif columns are incorrect. - Returns a DataFrame with raw patient-level data.
Let’s use the import_patient_data() function from our case study.
Its main behaviours are that it:
- Reads a CSV into a dataframe.
- Requires the columns to match a specific list exactly (names and order).
- Stops if columns are incorrect.
- Returns a dataframe with raw patient-level data.
import_patient_data()
def import_patient_data(path):
"""
Import raw patient data and check that required columns are present.
Parameters
----------
path : str or pathlib.Path
Path to the CSV file containing the patient data.
Returns
-------
pandas.DataFrame
Dataframe containing the raw patient-level data.
Raises
------
ValueError
If the CSV file does not contain exactly the expected columns
in the expected order.
"""
df = pd.read_csv(Path(path))
# Expected columns in the raw data (names and order must match)
expected = [
"PATIENT_ID",
"ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME"
]
if list(df.columns) != expected:
raise ValueError(
f"Unexpected columns: {list(df.columns)} (expected {expected})"
)
return df#' Import raw patient data and check that required columns are present.
#'
#' Raises an error if the CSV file does not contain exactly the expected
#' columns in the expected order.
#'
#' @param path Character string giving path to the CSV file containing the
#' patient data.
#'
#' @return A data frame containing the raw patient-level data.
#'
#' @export
import_patient_data <- function(path) {
df <- readr::read_csv(path, show_col_types = FALSE)
# Expected columns in the raw data (names and order must match)
expected <- c(
"PATIENT_ID",
"ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME"
)
if (!identical(colnames(df), expected)) {
stop(
sprintf(
"Unexpected columns: %s (expected %s)",
paste(colnames(df), collapse = ", "),
paste(expected, collapse = ", ")
)
)
}
return(df)
}How to write unit tests
1. Start from the docstring
Always write docstrings for your code (see our docstring tutorial if you need guidance).
Your docstring makes promises about how the function should behave. For import_patient_data(), the docstring promises it will:
- Accept
strorPathas thepathparameter. - Return a pandas DataFrame.
- Raise
ValueErrorif columns are incorrect.
- Return a dataframe.
- Stop if columns are incorrect.
Each of these becomes something you can check with a test.
2. Define what “success” looks like
Pick the simplest input that should work.
In our case, we can create a small CSV with correct columns in the right order and one or two data rows. We can then write tests that confirm:
- The result is a pandas DataFrame.
- The columns match exactly:
list(df.columns) == expected_list
- The result is a dataframe.
- The columns match exactly.
Because we are testing the promised behaviour of the function, not just trusting its current implementation.
If someone edits the code later and accidentally removes that validation, your test will catch it!
def test_import_success(tmp_path):
"""Small CSV with correct columns should work."""
expected_cols = [
"PATIENT_ID", "ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME",
]
# Create temporary CSV file
df_in = pd.DataFrame(
[["p1", "2024-01-01", "08:00", "2024-01-01", "09:00"]],
columns=expected_cols,
)
csv_path = tmp_path / "patients.csv"
df_in.to_csv(csv_path, index=False)
# Run function and check it looks correct
result = import_patient_data(csv_path)
assert isinstance(result, pd.DataFrame)
assert list(result.columns) == expected_cols
pd.testing.assert_frame_equal(result, df_in)test_that("small CSV with correct columns imports successfully", {
expected_cols <- c(
"PATIENT_ID", "ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME"
)
# Create temporary CSV file
df_in <- tibble::tibble(
PATIENT_ID = "p1",
ARRIVAL_DATE = lubridate::ymd("2024-01-01"),
ARRIVAL_TIME = hms::as_hms("08:00:00"),
SERVICE_DATE = lubridate::ymd("2024-01-01"),
SERVICE_TIME = hms::as_hms("09:00:00")
)
csv_path <- tempfile(fileext = ".csv")
readr::write_csv(df_in, csv_path)
# Run function and check it looks correct
result <- import_patient_data(csv_path)
expect_s3_class(result, "data.frame")
expect_identical(names(result), expected_cols)
expect_equal(as.data.frame(result), as.data.frame(df_in))
})3. List ways things can go wrong
Now think: how can inputs break the promises?
For import_patient_data(), a ValueError should be raised when we have:
- Missing columns
- Extra columns
- Correct columns but wrong order
For each case, we can create a small DataFrame with the problem and assert that a ValueError is raised.
def test_import_errors(tmp_path, columns):
"""Incorrect columns should trigger ValueError."""
# Create temporary CSV file
df_in = pd.DataFrame([range(len(columns))], columns=columns)
csv_path = tmp_path / "patients.csv"
df_in.to_csv(csv_path, index=False)
# Check it raises ValueError
with pytest.raises(ValueError):
import_patient_data(csv_path)For import_patient_data(), the function should stop with an error if we have:
- Missing columns
- Extra columns
- Correct columns but wrong order
For each case, we can create a small dataframe with the problem and check that the function fails.
patrick::with_parameters_test_that(
"incorrect columns cause import_patient_data() to fail",
{
# Create dataframe with incorrect columns
df <- as.data.frame(as.list(seq_along(cols)))
names(df) <- cols
# Save as temporary CSV and run function, expecting an error
csv_path <- tempfile(fileext = ".csv")
readr::write_csv(df, csv_path)
expect_error(import_patient_data(csv_path))
},
patrick::cases(
# Example 1: Missing columns
list(
cols = c("PATIENT_ID", "ARRIVAL_DATE", "ARRIVAL_TIME", "SERVICE_DATE")
),
# Example 2: Extra columns
list(
cols = c(
"PATIENT_ID", "ARRIVAL_DATE", "ARRIVAL_TIME", "SERVICE_DATE",
"SERVICE_TIME", "EXTRA"
)
),
# Example 3: Right columns, wrong order
list(
cols = c(
"ARRIVAL_DATE", "PATIENT_ID", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME"
)
)
)
)4. Consider edge cases
Edge cases are inputs that are unusual but still realistic.
For example, what if the CSV has the correct headers but no data? Should that succeed and return an empty DataFrame, or should it fail?
In this case, you might decide that an empty CSV with correct headers is fine and does not raise an error. You may still choose to write a test though, as that makes this decision explicit so other coders know what “correct” means at the edges.
def test_import_empty_csv(tmp_path):
"""Empty CSV with correct columns should succeed."""
expected_cols = [
"PATIENT_ID", "ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME",
]
# Create empty CSV with correct header
df_in = pd.DataFrame(columns=expected_cols)
csv_path = tmp_path / "patients.csv"
df_in.to_csv(csv_path, index=False)
# Should succeed and return empty DataFrame
result = import_patient_data(csv_path)
assert len(result) == 0
assert list(result.columns) == expected_colstest_that("empty CSV with correct columns should succeed", {
# Empty CSV with correct columns should succeed.
expected_cols <- c(
"PATIENT_ID", "ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME"
)
# Create empty CSV with correct header
df_in <- tibble::tibble(
PATIENT_ID = character(),
ARRIVAL_DATE = character(),
ARRIVAL_TIME = character(),
SERVICE_DATE = character(),
SERVICE_TIME = character()
)
csv_path <- tempfile(fileext = ".csv")
readr::write_csv(df_in, csv_path)
# Should succeed and return empty data frame
result <- import_patient_data(csv_path)
expect_identical(nrow(result), 0L)
expect_identical(names(result), expected_cols)
})5. Test all equivalent input forms
If the function promises to accept multiple equivalent input types, verify they really are equivalent.
With import_patient_data(), we expect a str or Path object to both succeed and return the same DataFrame.
def test_import_path_types(tmp_path):
"""str and Path inputs should behave identically."""
# Create temporary CSV file
expected_cols = [
"PATIENT_ID",
"ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME",
]
df_in = pd.DataFrame(
[["p1", "2024-01-01", "08:00", "2024-01-01", "09:00"]],
columns=expected_cols,
)
csv_path = tmp_path / "patients.csv"
df_in.to_csv(csv_path, index=False)
# Run function with str or Path inputs
df_str = import_patient_data(str(csv_path))
df_path = import_patient_data(csv_path)
# Check that results are the same
pd.testing.assert_frame_equal(df_str, df_path)In this case, import_patient_data() just accepts a character string for path, so there is nothing to test.
Running our example tests
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /__w/hdruk_tests/hdruk_tests/examples/python_package
configfile: pyproject.toml
plugins: cov-7.0.0
collected 6 items
../examples/python_package/tests/test_unit.py ...... [100%]
============================== 6 passed in 0.94s ===============================
<ExitCode.OK: 0>
══ Testing test_unit.R ═════════════════════════════════════════════════════════
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 0 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 1 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 3 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 4 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 5 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 6 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 7 ]
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 8 ] Done!
When to stop writing tests
You cannot test everything. You’ve written enough tests when:
- Every promise in the docstring is tested.
- Every important code branch (like error handling) is tested.
Write additional tests based on real needs (e.g., bug reports, tricky edge cases in your context), and not by trying to anticipate every theoretical failure.
In real research projects, you won’t unit test every single function or every possible case. The aim is not perfection, but reasonable confidence in the most important behaviours of your code.