Case study

Choose your language:

Our tutorial uses a simple example of importing a small patient dataset and carrying out basic descriptive analysis.

In the example we will:

Import patient-level data from a CSV file and check that it has the expected columns.
Derive waiting times from arrival and departure datetimes.
Compute simple summary statistics (mean, standard deviation, and 95% confidence interval) for waiting time.

Packages

import json
from pathlib import Path

import numpy as np
import pandas as pd
import scipy.stats as st

pd.set_option("display.max_columns", 8)

library(readr)
library(dplyr)
library(lubridate)

Import patient data

Our dataset is a small synthetic set of patient-level event times that could come from a healthcare setting (e.g., emergency department, outpatient clinic, doctor’s appointments). Each row records when a patient arrived and when service began.

Our function import_patient_data() reads the data from a CSV file and checks it contains the expected columns, returning a pandas DataFrame.

def import_patient_data(path):
    """
    Import raw patient data and check that required columns are present.

    Parameters
    ----------
    path : str or pathlib.Path
        Path to the CSV file containing the patient data.

    Returns
    -------
    pandas.DataFrame
        Dataframe containing the raw patient-level data.

    Raises
    ------
    ValueError
        If the CSV file does not contain exactly the expected columns
        in the expected order.
    """
    df = pd.read_csv(Path(path))

    # Expected columns in the raw data (names and order must match)
    expected = [
        "PATIENT_ID",
        "ARRIVAL_DATE", "ARRIVAL_TIME",
        "SERVICE_DATE", "SERVICE_TIME"
    ]
    if list(df.columns) != expected:
        raise ValueError(
            f"Unexpected columns: {list(df.columns)} (expected {expected})"
        )

    return df

#' Import raw patient data and check that required columns are present.
#'
#' Raises an error if the CSV file does not contain exactly the expected 
#' columns in the expected order.
#'
#' @param path Character string giving path to the CSV file containing the 
#'   patient data.
#'
#' @return A data frame containing the raw patient-level data.
#'
#' @export
import_patient_data <- function(path) {
  df <- readr::read_csv(path, show_col_types = FALSE)

  # Expected columns in the raw data (names and order must match)
  expected <- c(
    "PATIENT_ID",
    "ARRIVAL_DATE", "ARRIVAL_TIME",
    "SERVICE_DATE", "SERVICE_TIME"
  )
  if (!identical(colnames(df), expected)) {
    stop(
      sprintf(
        "Unexpected columns: %s (expected %s)",
        paste(colnames(df), collapse = ", "),
        paste(expected, collapse = ", ")
      )
    )
  }

  return(df)
}

We can run this function on our example dataset patient_data.csv.

You can download a copy of this data here:

raw_data = import_patient_data(
   "../examples/python_package/data/patient_data.csv"
)
raw_data

   PATIENT_ID ARRIVAL_DATE  ARRIVAL_TIME SERVICE_DATE  SERVICE_TIME
0           1   2025-01-01             1   2025-01-01             7
1           2   2025-01-01             2   2025-01-01             4
2           3   2025-01-01             3   2025-01-01            10
3           4   2025-01-01             7   2025-01-01            14
4           5   2025-01-01            10   2025-01-01            12
5           6   2025-01-01            10   2025-01-01            11

raw_data <- import_patient_data(
  file.path("..", "examples", "r_package", "inst", "extdata", "patient_data.csv")
)
raw_data

# A tibble: 6 × 5
  PATIENT_ID ARRIVAL_DATE ARRIVAL_TIME SERVICE_DATE SERVICE_TIME
       <dbl> <date>       <chr>        <date>       <chr>       
1          1 2025-01-01   0001         2025-01-01   0007        
2          2 2025-01-01   0002         2025-01-01   0004        
3          3 2025-01-01   0003         2025-01-01   0010        
4          4 2025-01-01   0007         2025-01-01   0014        
5          5 2025-01-01   0010         2025-01-01   0012        
6          6 2025-01-01   0010         2025-01-01   0011

Calculate waiting times

Next, we convert the date and time fields into datetime columns and calculate each patient’s waiting time in minutes.

def calculate_wait_times(df):
    """
    Add arrival/service datetimes and waiting time in minutes.

    Parameters
    ----------
    df : pandas.DataFrame
        Patient-level data containing `ARRIVAL_DATE`, `ARRIVAL_TIME`,
        `SERVICE_DATE`, and `SERVICE_TIME` columns.

    Returns
    -------
    pandas.DataFrame
        Copy of the input DataFrame with additional columns:
        `arrival_datetime`, `service_datetime`, and `waittime`.
    """
    df = df.copy()

    # Combine date and time columns into datetime columns
    for prefix in ("ARRIVAL", "SERVICE"):
        df[f"{prefix.lower()}_datetime"] = pd.to_datetime(
            df[f"{prefix}_DATE"].astype(str) +
            " " +
            df[f"{prefix}_TIME"].astype(str).str.zfill(4),
            format="%Y-%m-%d %H%M"
        )

    # Waiting time in minutes
    df["waittime"] = (
        df["service_datetime"] - df["arrival_datetime"]
    ) / pd.Timedelta(minutes=1)

    return df

#' Add arrival/service datetimes and waiting time in minutes.
#'
#' @param df Data frame with patient-level data containing `ARRIVAL_DATE`, 
#'   `ARRIVAL_TIME`, `SERVICE_DATE`, and `SERVICE_TIME` columns.
#'
#' @return A copy of the input data frame with additional columns:
#'   `arrival_datetime`, `service_datetime`, and `waittime`.
#'
#' @export
calculate_wait_times <- function(df) {
  df <- df |>
    dplyr::mutate(
      arrival_datetime = lubridate::ymd_hm(
        paste(
          as.character(ARRIVAL_DATE),
          sprintf("%04d", as.integer(ARRIVAL_TIME))
        )
      ),
      service_datetime = lubridate::ymd_hm(
        paste(
          as.character(SERVICE_DATE),
          sprintf("%04d", as.integer(SERVICE_TIME))
        )
      )
    )

  if (any(is.na(df$arrival_datetime) | is.na(df$service_datetime))) {
    stop(
      "Failed to parse arrival or service datetimes; ",
      "check for missing or invalid dates/times."
    )
  }

  df <- df |>
    dplyr::mutate(
      waittime = as.numeric(
        difftime(service_datetime, arrival_datetime, units = "mins")
      )
    )

  df
}

We then apply this function to the raw data.

processed_data = calculate_wait_times(raw_data)
processed_data

   PATIENT_ID ARRIVAL_DATE  ARRIVAL_TIME SERVICE_DATE  SERVICE_TIME  \
0           1   2025-01-01             1   2025-01-01             7   
1           2   2025-01-01             2   2025-01-01             4   
2           3   2025-01-01             3   2025-01-01            10   
3           4   2025-01-01             7   2025-01-01            14   
4           5   2025-01-01            10   2025-01-01            12   
5           6   2025-01-01            10   2025-01-01            11   

     arrival_datetime    service_datetime  waittime  
0 2025-01-01 00:01:00 2025-01-01 00:07:00       6.0  
1 2025-01-01 00:02:00 2025-01-01 00:04:00       2.0  
2 2025-01-01 00:03:00 2025-01-01 00:10:00       7.0  
3 2025-01-01 00:07:00 2025-01-01 00:14:00       7.0  
4 2025-01-01 00:10:00 2025-01-01 00:12:00       2.0  
5 2025-01-01 00:10:00 2025-01-01 00:11:00       1.0

processed_data <- calculate_wait_times(raw_data)
processed_data

# A tibble: 6 × 8
  PATIENT_ID ARRIVAL_DATE ARRIVAL_TIME SERVICE_DATE SERVICE_TIME
       <dbl> <date>       <chr>        <date>       <chr>       
1          1 2025-01-01   0001         2025-01-01   0007        
2          2 2025-01-01   0002         2025-01-01   0004        
3          3 2025-01-01   0003         2025-01-01   0010        
4          4 2025-01-01   0007         2025-01-01   0014        
5          5 2025-01-01   0010         2025-01-01   0012        
6          6 2025-01-01   0010         2025-01-01   0011        
# ℹ 3 more variables: arrival_datetime <dttm>, service_datetime <dttm>,
#   waittime <dbl>

Calculate summary statistics

Finally, we define a small utility function that calculates the mean, standard deviation, and a 95% confidence interval for a numeric series. The function handles small‑sample edge cases explicitly and uses the t‑distribution, which is appropriate when the sample size is modest.

def summary_stats(data):
    """
    Calculate mean, standard deviation and 95% confidence interval (CI).

    CI is calculated using the t-distribution, which is appropriate for
    small samples and converges to the normal distribution as the sample
    size increases.

    Parameters
    ----------
    data : pandas.Series
        Data to use in the calculation.

    Returns
    -------
    dict[str, float]
        A dictionary with keys `mean`, `std_dev`, `ci_lower` and `ci_upper`.
        Each value is a float, or `numpy.nan` if it can't be computed.
    """
    # Drop missing values
    data = data.dropna()

    # Find number of observations
    count = len(data)

    # If there are no observations, then set all to NaN
    if count == 0:
        mean, std_dev, ci_lower, ci_upper = np.nan, np.nan, np.nan, np.nan

    # If there are 1 or 2 observations, can do mean but not other statistics
    elif count < 3:
        mean = data.mean()
        std_dev, ci_lower, ci_upper = np.nan, np.nan, np.nan

    # With more than two observations, can calculate all...
    else:
        mean = data.mean()
        std_dev = data.std()

        # If there is no variation, then CI is equal to the mean
        if np.var(data) == 0:
            ci_lower, ci_upper = mean, mean
        else:
            # 95% CI based on the t-distribution
            ci_lower, ci_upper = st.t.interval(
                confidence=0.95,
                df=count-1,
                loc=mean,
                scale=st.sem(data)
            )

    return {
        "mean": mean,
        "std_dev": std_dev,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper
    }

#' Calculate mean, standard deviation and 95% confidence interval (CI).
#'
#' CI is calculated using the t-distribution, which is appropriate for
#' small samples and converges to the normal distribution as the sample
#' size increases.
#'
#' @param data Numeric vector of data to use in the calculation.
#'
#' @return A named list with elements `mean`, `std_dev`, `ci_lower` and 
#'   `ci_upper`. Each value is a numeric, or `NA` if it can't be computed.
#'
#' @export
summary_stats <- function(data) {
  tibble::tibble(value = data) |>
    dplyr::reframe(
      n_complete = sum(!is.na(value)),
      mean = mean(value, na.rm = TRUE),
      std_dev = stats::sd(value, na.rm = TRUE),
      ci_lower   = {
        if (n_complete < 2L) {
          NA_real_
        } else if (std_dev == 0 || is.na(std_dev)) {
          mean       # CI collapses to mean when no variation
        } else {
          stats::t.test(value)$conf.int[1L]
        }
      },
      ci_upper   = {
        if (n_complete < 2L) {
          NA_real_
        } else if (std_dev == 0 || is.na(std_dev)) {
          mean       # CI collapses to mean when no variation
        } else {
          stats::t.test(value)$conf.int[2L]
        }
      }
    ) |>
    dplyr::select(-n_complete) |>
    as.list()
}

We apply summary_stats() to the waiting times.

results = summary_stats(processed_data["waittime"])

# Format dictionary for display
formatted_results = json.dumps(results, indent=4)
print(formatted_results)

{
    "mean": 4.166666666666667,
    "std_dev": 2.786873995477131,
    "ci_lower": 1.2420217719136457,
    "ci_upper": 7.091311561419689
}

results <- summary_stats(processed_data[["waittime"]])
results

$mean
[1] 4.166667

$std_dev
[1] 2.786874

$ci_lower
[1] 1.242022

$ci_upper
[1] 7.091312