def import_patient_data(path):
"""
Import raw patient data and check that required columns are present.
Parameters
----------
path : str or pathlib.Path
Path to the CSV file containing the patient data.
Returns
-------
pandas.DataFrame
Dataframe containing the raw patient-level data.
Raises
------
ValueError
If the CSV file does not contain exactly the expected columns
in the expected order.
"""
df = pd.read_csv(Path(path))
# Expected columns in the raw data (names and order must match)
expected = [
"PATIENT_ID",
"ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME"
]
if list(df.columns) != expected:
raise ValueError(
f"Unexpected columns: {list(df.columns)} (expected {expected})"
)
return dfData, temporary files and mocking
Another common requirement when writing tests for research code is to provide a file path as input. For example, you may have functions that read a CSV.
This page shows three ways to give a file path to your test code:
A real data file.
A temporary file created inside the test.
Mocking, where we pretend to read a file but never touch the filesystem.
Later on, we also show an alternative design where you test the data‑processing code directly without passing a file path at all.
This page shows two ways to give a file path to your test code:
A real data file.
A temporary file created inside the test.
We also briefly mention a third way called mocking, where we pretend to read a file but never touch the filesystem. For most R workflows, you can get a long way with real or temporary files, and treat mocking as an advanced tool you may grow into later.
Example: import_patient_data()
To demonstrate, we will use the import_patient_data() function from our case study. This function expects a path to a CSV file.
import_patient_data()
#' Import raw patient data and check that required columns are present.
#'
#' Raises an error if the CSV file does not contain exactly the expected
#' columns in the expected order.
#'
#' @param path Character string giving path to the CSV file containing the
#' patient data.
#'
#' @return A data frame containing the raw patient-level data.
#'
#' @export
import_patient_data <- function(path) {
df <- readr::read_csv(path, show_col_types = FALSE)
# Expected columns in the raw data (names and order must match)
expected <- c(
"PATIENT_ID",
"ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME"
)
if (!identical(colnames(df), expected)) {
stop(
sprintf(
"Unexpected columns: %s (expected %s)",
paste(colnames(df), collapse = ", "),
paste(expected, collapse = ", ")
)
)
}
df
}On the rest of this page, we will focus on how to provide the path to the test, and the tests themselves are then just dummy examples confirming that the import data exists, and no detailed checks (those come later!).
Option 1: Real data file
The simplest approach is to point your test at a real CSV file that lives alongside your project (e.g., in tests/data/).
The simplest approach is to point your test at a real CSV file that lives alongside your project (e.g., in tests/testthat/testdata).
This is straightforward, and is useful when you want to run your workflow on the same file your analysis uses. For example:
The data file is produced by an earlier part of your pipeline and you want to check that later steps still work on that file.
You expect the file to evolve over time (e.g., updated extracts or slightly different structure) and want to know if those changes affect your results.
However, there are some downsides:
The test is tightly coupled to that specific file and its content. Any change to the file (even a harmless one, such as reordering rows) can cause tests to fail unexpectedly.
If the file is large, tests can become slow and feel heavy to run.
Example
We have to find the folder containing the test file (Path(__file).parent), and then locate data/patient_data.csv relative to the test file itself, rather than relying on the current working directory (which will change depending on where tests are run from).
from pathlib import Path
from waitingtimes.patient_analysis import import_patient_datadef test_real_data_file():
"""Importing a real data file to a test"""
# Path to example test data
csv_path = Path(__file__).parent.joinpath("data/patient_data.csv")
# Load the data and check it was read
df = import_patient_data(csv_path)
assert not df.emptyWe use testthat::test_path() to find the folder containing the test file, then locate data/patient_data.csv relative to the test file itself, rather than relying on the current working directory (which will change depending on where tests are run from).
test_that("real data file imports in test", {
# Path to example test data
csv_path <- testthat::test_path("data", "patient_data.csv")
# Load the data and check it was read
df <- import_patient_data(csv_path)
expect_gt(nrow(df), 0)
})Option 2: Temporary file
For many tests, we do not need a full real dataset. We just need a small CSV with the right structure so the code can run.
In that case we can build a tiny dataset inside the test, write it to a temporary file, and pass that temporary path to import_patient_data().
This has several advantages:
- The test is self-contained - all the data it needs is defined inside the test, and it does not rely on any particular file being present on someone’s machine.
- It is fast, because the file is tiny and only exists for the duration of the test.
The main downside is that we are still doing file I/O.
What is file I/O?
File I/O (input/output) just means reading from or writing to files on disk. This is slower than working purely in memory, and it depends on the filesystem behaving as expected (paths existing, permissions being correct, enough space, etc.).
Example
The tmp_path argument is a pytest fixture that provides a temporary directory unique to the test. We use this directory to construct a file path and save a CSV file during the test, ensuring file creation happens in an isolated location managed and cleaned up automatically by pytest.
import pandas as pd
from waitingtimes.patient_analysis import import_patient_datadef test_temporary_file(tmp_path):
"""Providing data to a test via a temporary file"""
# Create sample patient data
testdata = pd.DataFrame(
[["p1", "2024-01-01", "08:00", "2024-01-01", "09:00"]],
columns=[
"PATIENT_ID", "ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME",
],
)
# Create a temporary CSV file
csv_path = tmp_path / "patients.csv"
testdata.to_csv(csv_path, index=False)
# Load the data and check it was read
df = import_patient_data(csv_path)
assert not df.emptyThe tempfile() function creates a temporary file path unique to the test session. We use this path to write a CSV file during the test, ensuring the file is created in an isolated temporary location and is automatically cleaned up by R after use.
test_that("tempfile is created and read by import_patient_data", {
# Create temporary CSV file
testdata <- tibble::tibble(
PATIENT_ID = "p1",
ARRIVAL_DATE = lubridate::ymd("2024-01-01"),
ARRIVAL_TIME = hms::as_hms("08:00:00"),
SERVICE_DATE = lubridate::ymd("2024-01-01"),
SERVICE_TIME = hms::as_hms("09:00:00")
)
csv_path <- tempfile(fileext = ".csv")
readr::write_csv(testdata, csv_path)
# Load the data and check it was read
df <- import_patient_data(csv_path)
expect_gt(nrow(df), 0)
})Option 3: Mocking
An alternative to the temporary file approach is to avoid writing any files at all.
Mocking means temporarily replacing parts of your code (for example, the function that reads a CSV, read_csv()) with a fake version during the test.
The fake function still returns a dataset, but it does so directly from memory, without reading or writing any files, even temporarily.
Benefits of mocking are:
No file I/O - tests are faster and do not touch the filesystem at all.
Better isolation - tests do not depend on the behaviour of third-party libraries like
pd.read_csv(), which themselves could change or break.
The trade‑off is that mocking is more complex to understand and set up than using a temporary file.
Example
Pytest provides the monkeypatch fixture for mocking. It allows us to temporarily modify functions or attributes during a test. Here, we replace pd.read_csv with a mock function that returns a pre-defined DataFrame.
Because pd.read_csv is patched, the file path passed to import_patient_data is irrelvant - no file is ever opened. The file behaves as if it is successfully read a CSV file, allowing us to focus purely on testing the data processing logic.
import pandas as pd
from waitingtimes.patient_analysis import import_patient_datadef test_mocking(monkeypatch):
"""Providing data to a test via mocking"""
# Create sample patient data
testdata = pd.DataFrame(
[["p1", "2024-01-01", "08:00", "2024-01-01", "09:00"]],
columns=[
"PATIENT_ID", "ARRIVAL_DATE", "ARRIVAL_TIME",
"SERVICE_DATE", "SERVICE_TIME",
],
)
# Define a fake CSV reader that just returns our DataFrame
def mock_read_csv(path):
return testdata
# Temporarily replace pd.read_csv with our fake version
monkeypatch.setattr(pd, "read_csv", mock_read_csv)
# Call the function with any path - it does not matter - it will use the
# mocked reader, and pd.read_csv is never actually called
df = import_patient_data("does_not_matter.csv")
assert not df.emptyAlternative: design for testability
For small examples like reading a single CSV file, mocking pandas.read_csv is a simple way to show how tests can avoid real file I/O. In larger projects, a common alternative is to change the design of the code so that most tests do not need to mock external libraries at all.
A typical pattern is to separate your code into:
- One function that gets the data (I/O).
- One function that processes the data (logic only).
For example:
import pandas as pd
def get_patient_data(path):
"""I/O layer: read data from disk."""
return pd.read_csv(path)
def process_patient_data(df):
"""Logic layer: all processing goes here, no file I/O."""
# do whatever transformation you need
return df
def import_patient_data(path):
"""Public entry point."""
df = get_patient_data(path)
return process_patient_data(df)With this structure, you can write most of your unit tests against process_patient_data() by passing in a small, in‑memory DataFrame, without touching the filesystem or mocking pd.read_csv at all.
For a simple CSV reader like this, splitting into get_patient_data() and process_patient_data() can feel like overkill, because import_patient_data() is already very small. But in real projects, especially when you are pulling data from databases, servers, or APIs, having a thin I/O layer and a separate “business logic” layer makes tests much easier to write and understand. This design also matches the “don’t mock what you don’t own” guideline, which encourages against heavily mocking complex external APIs.
Running our example tests
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0
rootdir: /__w/stars-testing-intro/stars-testing-intro/examples/python_package
configfile: pyproject.toml
plugins: cov-7.0.0
collected 3 items
../examples/python_package/tests/test_data_real.py . [ 33%]
../examples/python_package/tests/test_data_temp.py . [ 66%]
../examples/python_package/tests/test_data_mock.py . [100%]
============================== 3 passed in 0.98s ===============================
<ExitCode.OK: 0>
✔ | F W S OK | Context
⠏ | 0 | data_real
⠋ | 1 | data_real
✔ | 1 | data_real
⠏ | 0 | data_temp
✔ | 1 | data_temp
══ Results ═════════════════════════════════════════════════════════════════════
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 2 ]
When to use each option
You do not need to pick a single approach for your whole project. Instead, choose the best option for the goal of each test.
In Python, we often mix all three patterns:
Real data file - best when you want to run your workflow on a specific dataset. For example, we will use this in regression tests where you compare current results to a saved output to check that behaviour stays stable over time.
Temporary file - a good default for many tests where you just need a small, representative dataset with the right structure. For example, this works well in smoke tests and system tests.
Mocking or alternative design - most useful when your code lives in a package and you want fast, isolated unit tests where you want to isolate your own logic from external libraries and the filesystem, and avoid file I/O entirely. However, not every unit test requries this: sometimes you deliberately call the real library function because you want to confirm you are using it correctly (for example, that you pass the right arguments, or that it can handle your expected input format).
In R, you choice will also be impacted by whether you are working inside a package:
Real data file - best when you want to run your workflow on a specific dataset. For example, we will use this in regression tests where you compare current results to a saved output to check that behaviour stays stable over time.
Temporary file - a good default for many tests where you just need a small, representative dataset with the right structure. For example, this works well in smoke tests, system tests and unit tests.
Optional advanced learning: mocking
Mocking means temporarily replacing parts of your code (for example, the function that reads a CSV, read_csv()) with a fake version during the test. The fake function still returns a dataset, but it does so directly from memory, without reading or writing any files, even temporarily.
Benefits of mocking are:
No file I/O - tests are faster and do not touch the filesystem at all.
Better isolation - tests do not depend on the behaviour of third-party libraries like
readr::read_csv(), which themselves could change or break.
The trade‑off is that mocking is more complex to understand and set up than using a temporary file, and in R it works a bit differently than in Python.
How mocking works in R
In R, the modern recommended tool for mocking is testthat::local_mocked_bindings(). It works by temporarily changing function bindings inside a package’s namespace. The namespace is a list that maps each function name to the code it runs, including both your own functions and any external functions you imported with @import or @importFrom. For example, the NAMESPACE for our case study package is:
# Generated by roxygen2: do not edit by hand
export(calculate_wait_times)
export(import_patient_data)
export(summary_stats)Mocking temporarily changes that mapping for a given function name, so during the test the package calls your fake version instead of the original one, and then everything is restored afterwards.
The testthat documentation describes four places the function you want to mock might come from:
- A function defined inside your package.
- A function you imported from another package via
NAMESPACE(using@importor@importFrom). - A function from
base. - A function you call with
pkg::fun.
The first three cases are treated the same: if the name exists in your package (either because you defined it or imported it), you can mock it directly. In our example, if we want to mock readr::read_csv(), we can:
- Add
@importFrom readr read_csvto the function documentation soread_csvappears inNAMESPACE. - Call
read_csv(...)(withoutreadr::) insideimport_patient_data().
We would then be able to mock read_csv like this:
test_that("providing data to a test via mocking", {
testdata <- tibble::tibble(
PATIENT_ID = "p1",
ARRIVAL_DATE = lubridate::ymd("2024-01-01"),
ARRIVAL_TIME = hms::as_hms("08:00:00"),
SERVICE_DATE = lubridate::ymd("2024-01-01"),
SERVICE_TIME = hms::as_hms("09:00:00")
)
testthat::local_mocked_bindings(
read_csv = function(path, show_col_types = FALSE) testdata,
.package = "waitingtimes"
)
df <- import_patient_data("does_not_matter.csv")
expect_gt(nrow(df), 0)
})Here, local_mocked_bindings() temporarily replaces the read_csv() binding inside the waitingtimes namespace, so any call to read_csv() from our package code uses the fake version during the test
The fourth case (pkg::fun) is trickier. If your code calls readr::read_csv() directly, then the function you want to mock lives in the readr namespace, not your own. You can point local_mocked_bindings() at another package by setting .package = "readr", but the testthat documentation cautions against this, because it will affect all calls to readr::read_csv().
The documentation instead recommends a safer pattern:
“It’s safer to either import the function into your package, or make a wrapper that you can mock”
We have seen the import approach above - the other option is keeping using readr::read_csv() inside a small helper than you own, and mock that helper in your tests. For example, in your package:
# Small helper that actually reads from disk
get_patient_data <- function(path) {
readr::read_csv(path, show_col_types = FALSE)
}
# Function we want to test
import_patient_data <- function(path) {
df <- get_patient_data(path)
# Processing would go here
df
}In your test:
test_that("providing data to a test via mocking", {
testdata <- tibble::tibble(
PATIENT_ID = "p1",
ARRIVAL_DATE = lubridate::ymd("2024-01-01"),
ARRIVAL_TIME = hms::as_hms("08:00:00"),
SERVICE_DATE = lubridate::ymd("2024-01-01"),
SERVICE_TIME = hms::as_hms("09:00:00")
)
# Temporarily replace get_patient_data() inside the waitingtimes package
testthat::local_mocked_bindings(
get_patient_data = function(path) testdata,
.package = "waitingtimes"
)
df <- import_patient_data("does_not_matter.csv")
expect_gt(nrow(df), 0)
})A closely related design, which you will often see in larger projects, is to split the workflow into two functions: one that gets the data (I/O) and one that processes the data (pure-ish logic) - and then just focuses entirely on process_patient_data() when testing, without touching the filesystem or mocking anything.
# I/O layer
get_patient_data <- function(path) {
readr::read_csv(path, show_col_types = FALSE)
}
# Logic layer
process_patient_data <- function(df) {
# all processing goes here, no file I/O
df
}
# Public entry point
import_patient_data <- function(path) {
df <- get_patient_data(path)
process_patient_data(df)
}Final note: For this course, you do not need to use mocking in R to write useful tests. Real files and temporary files are usually enough. We show mocking here as an optional technique you might use later, especially in package code where you want very fast, isolated unit tests.