Sampling in R

Author

Thomas Monks, Alison Harper, Amy Heather

Overview

By default R offers less control over random sampling than python and commercial simulation software. It uses a single random number stream for all sampling and does not allow you to create individual random number streams (each with its own seed) for each sampling distribution.

This is not ideal for DES, and has a range of impacts. The one you will likely experience is additional noise between experiments due to this lack of control. Another way to describe this is that changes in sampling distribution parameters and particularly arrival rates may cause experiments to go out of sync where the same patients experience different activity duration and routing due to random sampling differences across scenarios rather than the systematic differences you have introduced in your experiments.

The result of this random noise is that typically you will need to run a lot more replications to carefully assess difference between experiments than if it were reduced/eliminated. It is also harder to debug experiments.

This notebook will:

Demonstrate the shortcomings of a single random number stream and how noise is introduced between experiments.
Illustrate that problem with a simple simmer model that varies arrival rates
Introduce up to 25 random streams for sampling using the SimEd R package.

1. Imports

library(simmer)
library(simmer.bricks)
library(magrittr)
suppressMessages(library(simEd))
suppressMessages(library(tidyr))

2. A simple motivating example

To illustrate the impact of this design consider the example below. Imagine we are going to simulate 5 patients length of stay (LoS) in an acute hospital followed by rehabilitation. Acute LoS is exponentially distribution while rehabilitation LoS follows a uniform distribution (the choice of distribution does not matter).

IMPORTANT: To make the as results “repeatable” as possible we will set a random seed. With a single random stream we will see that this does not guarantee repeatable samples for patients between experiments.

2.1 Constants

SEED <- 42
ACUTE_MEAN <- 32.0
REHAB_MIN <- 15.0
REHAB_MAX <- 80.0

2.2 Experiment 1

n_patients <- 5

set.seed(SEED)
acute_los <- rexp(n=n_patients, rate=1.0/ACUTE_MEAN)
rehab_los <- runif(n=n_patients, min=REHAB_MIN, max=REHAB_MAX)

print(acute_los)

[1]  6.346778 21.148648  9.071713  1.222141 15.141652

print(rehab_los)

[1] 23.75333 57.70450 60.82921 44.75322 61.74230

2.3 Experiment 2

We will now reset the random stream using the same seed and limit the number of patients simulated to 2.

When we re-run the code we might expect to get

Acute Los:
6.346778 21.148648
Rehab Los:
23.75333 57.70450

But we will see that this does not happen. This is because all sampling makes use of a pseudo random number stream that generates uniformly distribution numbers \(U\)’s between 0 and 1. When only 1 stream is used for all sampling we can end up with lots of noise between experiments simply because different \(U\)’s are used for the same patients.

n_patients <- 2

set.seed(SEED)
acute_los <- rexp(n=n_patients, rate=1.0/ACUTE_MEAN)
rehab_los <- runif(n=n_patients, min=REHAB_MIN, max=REHAB_MAX)

print(acute_los)

[1]  6.346778 21.148648

print(rehab_los)

[1] 56.71346 48.74124

3. Force the order of sampling

To force the order of sampling within a single random number stream each patient must do their sampling upfront and in process order.

Below we re-run Experiments 1 and 2, but this time we make sure the process is sampled in order (acute then rehab) for each patient.

3.1 A Return to Experiment 1

n_patients <- 3

set.seed(SEED)

for (patient_i in 1:n_patients) {
    print(paste("Patient ", patient_i))
    print(paste("Acute ", rexp(n=1, rate=1.0/ACUTE_MEAN)))
    print(paste("Rehab ", runif(n=1, min=REHAB_MIN, max=REHAB_MAX)))
}

[1] "Patient  1"
[1] "Acute  6.34677797708443"
[1] "Rehab  68.9790956943762"
[1] "Patient  2"
[1] "Acute  9.07171320915222"
[1] "Rehab  48.7412366934586"
[1] "Patient  3"
[1] "Acute  15.1416521370411"
[1] "Rehab  23.7533288204577"

3.2 A Return to Experiment 2

n_patients <- 2

set.seed(SEED)

for (patient_i in 1:n_patients) {
    print(paste("Patient ", patient_i))
    print(paste("Acute ", rexp(n=1, rate=1.0/ACUTE_MEAN)))
    print(paste("Rehab ", runif(n=1, min=REHAB_MIN, max=REHAB_MAX)))
}

[1] "Patient  1"
[1] "Acute  6.34677797708443"
[1] "Rehab  68.9790956943762"
[1] "Patient  2"
[1] "Acute  9.07171320915222"
[1] "Rehab  48.7412366934586"

4. A `simmer` model

By default simmer will be affected by the single random stream for all sampling. This is because in a DES there is no guarantee that sampling will not occur in process order like we saw above. The numbers generated will vary depending on when events are scheduled to take place.

We will first consider this in experiments where we set the exact number of arrivals to the model. In these experiments we will not use resources. This means that there is no impact on the model due to queuing if we increase or decrease the number of arrivals to the model.

4.1 Experiment 1 (5 arrivals)

# set the seed
set.seed(SEED)

# create simmer environment
env <- simmer("Experiment_1", log_level=1) 

# setup simple patient trajectory
patient <- trajectory("patient_pathway") %>% 
  set_attribute("start_acute", function() {now(env)}) %>%
  timeout(function() rexp(1, rate=1.0/ACUTE_MEAN)) %>% 
  set_attribute("acute_los", function() {now(env) - get_attribute(env, "start_acute")}) %>%
  log_(function() {paste("Acute LoS ", now(env) - get_attribute(env, "start_acute"))},
       level=1) %>%
  set_attribute("start_rehab", function() {now(env)}) %>%
  timeout(function() runif(n=1, min=REHAB_MIN, max=REHAB_MAX)) %>% 
  set_attribute("rehab_los", function() {now(env) - get_attribute(env, "start_rehab")}) %>%
  log_(function() {paste("Rehab LoS ", now(env) - get_attribute(env, "start_rehab"))},
       level=1)


env %>% 
  # add 5 arrivals all at the same time.
  add_generator("patient", patient, at(0, 0, 0, 0, 0)) %>% 
  invisible
 
env %>%
  run() %>% 
  invisible

1.22214: patient3: Acute LoS  1.2221407443285
6.34678: patient0: Acute LoS  6.34677797708443
9.07171: patient2: Acute LoS  9.07171320915222
15.1417: patient4: Acute LoS  15.1416521370411
21.1486: patient1: Acute LoS  21.1486480683088
24.9755: patient3: Rehab LoS  23.7533288204577
59.8949: patient4: Rehab LoS  44.7532154561486
64.0513: patient0: Rehab LoS  57.704498876119
69.9009: patient2: Rehab LoS  60.8292109624017
82.8909: patient1: Rehab LoS  61.7422963574063

4.2 Experiment 2 (3 arrivals)

Here we setup the model to simulate 3 patients that all arrive as the unit opens. Arrival times are the same, so we may expect the acute and rehab lengths of stay to remain the same. However, we can see that the acute length of stay and rehab length of stay quickly goes out of sync i.e. we have introduced noise between experiments that is nothing to do with the variation in the number of patients (that we changed between experiments). Let’s take patient 0 as an example.

In experiment 1:

Acute treatment is 6.3 days
Rehab treatment is 57.7 days

In experiment 2:

Acute treatment is 6.3 days
Rehab treatment is 48.7 days

# reset the seed
set.seed(SEED)

# create simmer environment
env <- simmer("Experiment_2", log_level=1) 

env %>% 
  # now limit to 3 patients.
  add_generator("patient", patient, at(0, 0, 0)) %>% 
  invisible
 
env %>%
  run() %>% 
  invisible

6.34678: patient0: Acute LoS  6.34677797708443
9.07171: patient2: Acute LoS  9.07171320915222
21.1486: patient1: Acute LoS  21.1486480683088
44.902: patient1: Rehab LoS  23.7533288204577
55.088: patient0: Rehab LoS  48.7412366934586
71.95: patient2: Rehab LoS  62.8782404516824

5. A `simmer` model with random arrivals

Finally we demonstrate that the effect is still observed across two experiments that vary the parameter of exponentially distributed inter-arrival times. We will run two new experiments. In the first IAT is 10 minutes. The second experiment increases the intensity of arrivals to an IAT of 2.0 minutes.

The function get_results_for_patient helps us trace patient 0 as they flowed through the model. The results illustratrate that the acute LoS remains the same, but the rehab LoS is different. As a single stream of random numbers was used, we were unable to control the order in which \(U\)’s were used to generate samples from the rehab L distributions.

# helper function to process results 
get_results_for_patient <- function(sim_env, patient_id){
  results <- subset(get_mon_attributes(sim_env), select = c(name, key, value))
  results <- spread(results, key, value)
  return(results[results$name == patient_id,])
}

5.1 IAT ~ Exponential(10.0)

mean_iat = 10.0

# reset the seed
set.seed(SEED)

# create simmer environment
env <- simmer("Experiment_simmer1") 

env %>% 
  # expontially distr arrivals mean IAT = 10.0
  add_generator("patient", patient, function() rexp(1, rate=1.0/mean_iat),
                mon=2) %>% 
  invisible
 
env %>%
  run(90) %>% 
  invisible


results_exp1 <- get_results_for_patient(env, "patient0")

5.2 IAT ~ Exponential(2.0)

mean_iat = 2.0

# reset the seed
set.seed(SEED)

# create simmer environment
env <- simmer("Experiment_simmer2")

env %>% 
  # modify expontially distr arrivals to mean IAT = 9.0
  add_generator("patient", patient, function() rexp(1, rate=1.0/mean_iat),
                mon=2) %>% 
  invisible
 
env %>%
  run(90) %>% 
  invisible

results_exp2 <- get_results_for_patient(env, "patient0")

comparison <- rbind(results_exp1, results_exp2)
comparison$name <- c("Patient 0 in Exp1", "Patient 0 in Exp2")
comparison

name	acute_los	rehab_los	start_acute	start_rehab
Patient 0 in Exp1	9.071713	60.82921	1.9833681	11.055081
Patient 0 in Exp2	9.071713	76.53344	0.3966736	9.468387

6. Using the `simEd` package

simEd is an R package aimed at improving simulation education. It makes use of a package called rstream that provides multiple random number streams for DES. simEd provides up to 25 streams and re-implements a useful selection of statistical distributions that can be used.

Details of the package can be found here: https://www.rdocumentation.org/packages/simEd/versions/2.0.1

B. Lawson and L. M. Leemis, “An R package for simulation education,” 2017 Winter Simulation Conference (WSC), Las Vegas, NV, USA, 2017, pp. 4175-4186, doi: 10.1109/WSC.2017.8248124 https://ieeexplore.ieee.org/document/8248124

6.1 Revisiting our initial experiment

The code below re-implements the original experiments we conducted using simEd equivalent sampling functions. The main difference is that we prefix functions with v instead of r. For example rexp becomes vexp. We also introduce a third parameter called stream (settable 1:25).

This time when we reduce the number of patients from 5 to 2 the acute and rehab samples remain the same for the first two patients.

n_patients <- 5

set.seed(SEED)
# replace rexp with vexp and set stream number to 1
acute_los <- vexp(n=n_patients, rate=1.0/ACUTE_MEAN, stream=1)

# replace runif with vunif and set stream number to 2
rehab_los <- vunif(n=n_patients, min=REHAB_MIN, max=REHAB_MAX, stream=2)

print(acute_los)

[1]  0.4526612 13.6325644 29.0375715 38.6327714 24.0971868

print(rehab_los)

[1] 79.39406 54.01103 59.64871 55.21520 36.20446

n_patients <- 2

set.seed(SEED)
# replace rexp with vexp and set stream number to 1
acute_los <- vexp(n=n_patients, rate=1.0/ACUTE_MEAN, stream=1)

# replace runif with vunif and set stream number to 2
rehab_los <- vunif(n=n_patients, min=REHAB_MIN, max=REHAB_MAX, stream=2)

print(acute_los)

[1]  0.4526612 13.6325644

print(rehab_los)

[1] 79.39406 54.01103

6.2 An updated simmer model

Finally we update the simmer model to use simEd and compare the results for patient 0 again. As expected we confirm that the sampling of acute and rehab duration is now in sync across the experiments.

# reset the seed
set.seed(SEED)

# create simmer environment
env <- simmer("Experiment_1_simEd") 

# redefine simple patient trajectory to use simEd variate functions
# each distribution in the model has its own stream
patient <- trajectory("patient_pathway") %>% 
  set_attribute("start_acute", function() {now(env)}) %>%
  timeout(function() vexp(1, rate=1.0/ACUTE_MEAN, stream=1)) %>%  
  set_attribute("acute_los", function() {now(env) - get_attribute(env, "start_acute")}) %>%
  set_attribute("start_rehab", function() {now(env)}) %>%
  timeout(function() vunif(n=1, min=REHAB_MIN, max=REHAB_MAX, stream=2)) %>% 
  set_attribute("rehab_los", function() {now(env) - get_attribute(env, "start_rehab")}) 

env %>% 
  # exponentially distr arrivals mean IAT = 10.0
  add_generator("patient", patient, function() vexp(1, rate=1.0/10.0, stream=3),
                mon=2) %>% 
  invisible
 
env %>%
  run(90) %>% 
  invisible


# store results for experiment 1 and patient 0
results_exp1 <- get_results_for_patient(env, "patient0")

# reset the seed
set.seed(SEED)

# create simmer environment
env <- simmer("Experiment_2_simEd") 

env %>% 
  # exponentially distr arrivals mean IAT = 9.0
  add_generator("patient", patient, function() vexp(1, rate=1.0/2.0, stream=3),
                mon=2) %>% 
  invisible
 
env %>%
  run(90) %>% 
  invisible

# store results for experiment 2 and patient 0
results_exp2 <- get_results_for_patient(env, "patient0")

# updated comparison
comparison <- rbind(results_exp1, results_exp2)
comparison$name <- c("Patient 0 in Exp1", "Patient 0 in Exp2")
comparison

name	acute_los	rehab_los	start_acute	start_rehab
Patient 0 in Exp1	0.4526612	79.39406	2.2381819	2.6908430
Patient 0 in Exp2	0.4526612	79.39406	0.4476364	0.9002976