Day 3 – Reproducing Lim et al. 2020

Note

Add parallel processing and reproduced supplementary tables 2-4 and Figure 2. Total time used: 10h 7m (25.3%)

09.40-11.12, 11.17-11.27: Adding parallel processing

Having now run the full model code that was provided, I can see that the loop produced a 10x4 table (for each of the end by day 7, 14 and 21). It varies:

Staff strength (4, 6)
Frequency of staff change (1, 3, 7, 14, 21)
Number of staff per shift (5, 10, 20, 30)

It doesn’t include:

Staff strength 2 - should be easy to add to loop
Scenarios from paper (e.g. varying probability of secondary infection) - although that example is easy to change with p = 0.15 # secondary attack rate

It had quite a long run time (19m 8s) which will make it difficult to run all these variants. Hence, I will first try to introduce parallel processing to help speed up the reproduction.

First, I modified model.py so the model is run by a function, that is then more easily callable from the reproduction notebook. Initially this was a single function, but then I realised I needed to divide it into a function that runs the model with a single set of parameters, and a function that loops through scenarios, to be easily compatable with parallel processing
To simplify/improve reusability of varying the number of staff per shift, I changed the results dataframe index from 0-4 to be based on the provided variants for staff per shift (e.g. 5, 10, 20, 30)
Whilst running these, noted we do so fluctuation between re-runs (e.g. 0.47 to 0.54 for day 21 4-1)
Then I introduced from multiprocessing import Pool and modified the nested for loop so that we can instead apply the function in parallel. I also had to modify how the results dataframes as created, as they were all modifying the same objects, which wouldn’t be possible in parallel.

This now ran in 2m 40s - 3m 6s. This is great, and will make running all the different scenarios much easier. By eye, the variation in results before and after parallel processing is very minimal, and all within the range I would expect due to there being no seed control.

Hid warnings and print progress message to enable cleaner output.

11.31-11.36, 11.39-12.05: Adding parameters

Modified model.py so it has staff strength [2, 4, 6] but received error:

File "/home/amy/Documents/stars/stars-reproduce-lim-2020/reproduction/scripts/model.py", line 175, in run_model
    fillroster1(staff_pool,f,Nday,stafflist,roster)
  File "/home/amy/Documents/stars/stars-reproduce-lim-2020/reproduction/scripts/model.py", line 65, in fillroster1
    temp = random.sample(stafflist[stafflist.loc[:,'rest']==0].index.values.tolist(),k=Nday)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/random.py", line 363, in sample
    raise ValueError("Sample larger than population or is negative")
ValueError: Sample larger than population or is negative

I think this is related to strength 2, which is set to NA in the paper when there are two or three shifts per day, stating that it is “not available as the number of staff per shift was too low to simulate under the required conditions”. Lim et al. (2020)

I wasn’t certain how to alter the shifts per day from the model, but I tried commenting out shift 2 and 3 in contact(). to see if the model would run. However, I still got the error from above. I introduced error handling, to set the result to NaN if there was a Value Error. With those sections still commented, and only running strength 2, the only results it output were for 2-21.

I tried uncommenting shifts 2 and 3 and running for strength 2. However, this again only output results for 2-21.

12.40-13.18, 13.20-13.44: Troubleshooting staff strength 2

We don’t expect it to work when there are three shifts per day, so I tried again to figure out how to successfully change it to one shift per day…

Commented out shift 2 and 3 sections from contact() abd creation of shift 2 and 3 rosters in restartsim(). Got error:

  File "/home/amy/Documents/stars/stars-reproduce-lim-2020/reproduction/scripts/model.py", line 169, in run_model
    fillroster1(staff_pool,f,Nday,stafflist,roster)
  File "/home/amy/Documents/stars/stars-reproduce-lim-2020/reproduction/scripts/model.py", line 63, in fillroster1
    roster.iloc[f*i+j] = temp
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/indexing.py", line 670, in __setitem__
    iloc._setitem_with_indexer(indexer, value)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/indexing.py", line 1802, in _setitem_with_indexer
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 534, in setitem
    return self.apply("setitem", indexer=indexer, value=value)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 406, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 885, in setitem
    values[indexer] = value
ValueError: cannot copy sequence with size 18 to array axis with dimension 10
"""

This might actually be the issue we were looking for - that we can’t fill up the roster with that staff number. I tried running with strength 4, which should work regardless, but get a similar error

  File "/home/amy/Documents/stars/stars-reproduce-lim-2020/reproduction/scripts/model.py", line 169, in run_model
    fillroster1(staff_pool,f,Nday,stafflist,roster)
  File "/home/amy/Documents/stars/stars-reproduce-lim-2020/reproduction/scripts/model.py", line 63, in fillroster1
    roster.iloc[f*i+j] = temp
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/indexing.py", line 670, in __setitem__
    iloc._setitem_with_indexer(indexer, value)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/indexing.py", line 1802, in _setitem_with_indexer
    self.obj._mgr = self.obj._mgr.setitem(indexer=indexer, value=value)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 534, in setitem
    return self.apply("setitem", indexer=indexer, value=value)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/internals/managers.py", line 406, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/pandas/core/internals/blocks.py", line 885, in setitem
    values[indexer] = value
ValueError: cannot copy sequence with size 9 to array axis with dimension 5

To be sure it’s not an error I’ve introduced, I went back to the original model code and made a fresh notebook, commenting out the same sections, but with the same result
I worked through the code, trying to spot anything I should’ve also changed, and noticed that:
- Nday = staffpershift1 + staffpershift2 + staffpershift3
- Then used in fillroster1(staff_pool,f,Nday)
- As temp = random.sample(stafflist[stafflist.loc[:,'rest']==0].index.values.tolist(),k=Nday)
I changed that to Nday = staffpershift1, and this then worked!

I then add a parameter to run_scenarios() to control number of shifts per day, and made those commented out sections into conditional logic depending on whether it’s 1, 2 or 3 days.

I had to change the results dataframe (as it was set up to only save results for one variant of shifts per day). I decided the simplest solution - that would also make creating the figures later easier - would be to save it in long format.

I also set the results to NaN if it found that the input parameters were staff strength 2 and more than one shift per day.

I then ran the whole model, with all the parameter variants.

13.55-14.09: Supplementary table 2

Compared the results for supplementary table 2, and found nearly all were identical.

The maximum absolute difference observed was 0.07, which is within the normal variation I observed above between runs of the model without seed control.

Hence, consider this successfully reproduced at 14.09.

import sys
sys.path.append('../')
from timings import calculate_times

# Minutes used prior to today
used_to_date = 277

# Times from today
times = [
    ('09.40', '11.12'),
    ('11.17', '11.27'),
    ('11.31', '11.36'),
    ('11.39', '12.05'),
    ('12.40', '13.18'),
    ('13.20', '13.44'),
    ('13.55', '14.09')]

calculate_times(used_to_date, times)

Time spent today: 209m, or 3h 29m
Total used to date: 486m, or 8h 6m
Time remaining: 1914m, or 31h 54m
Used 20.2% of 40 hours max

14.10-14.14, 14.25-14.28, 14.33-14.44, 14.49-14.58: Supplementary table 3 and 4

Re-ran and renamed base scenario with 15% probability of secondary infection, and then ran model again but with 5% and 30% probability. This took about 21 minutes.

While that ran, I worked on speeding up the model code further, by changing the n(0,100) for loop into a map statement.

Max difference for each was 0.1, which I feel is within normal variation observed. This was reaffirmed by looking at the few examples where the results were different by more than 0.05, which were few and not felt to be noteable.

Further confirmed by the fact that, upon re-run, the max difference for supplementary table 2 was now 0.12, when before it was 0.07.

In all cases, the overwhelming majority of points are very similar, and these could be considered normal variation within the stochasticity of the model.

Would considered each complete at 14.58.

import sys
sys.path.append('../')
from timings import calculate_times

# Minutes used prior to today
used_to_date = 277

# Times from today
times = [
    ('09.40', '11.12'),
    ('11.17', '11.27'),
    ('11.31', '11.36'),
    ('11.39', '12.05'),
    ('12.40', '13.18'),
    ('13.20', '13.44'),
    ('13.55', '14.09'),
    ('14.10', '14.14'),
    ('14.25', '14.28'),
    ('14.33', '14.44'),
    ('14.49', '14.58')]

calculate_times(used_to_date, times)

Time spent today: 236m, or 3h 56m
Total used to date: 513m, or 8h 33m
Time remaining: 1887m, or 31h 27m
Used 21.4% of 40 hours max

15.08-15.19: Improving model run time

Attempted to improve run time further, although then decided not to pursue this, as I have already add parallel processing, and as had hit an error and didn’t want to spend too much time on it as not essential to reproduciton.

  File "/home/amy/Documents/stars/stars-reproduce-lim-2020/reproduction/scripts/model.py", line 275, in run_model
    res_median = np.median(results, axis=0)
  File "<__array_function__ internals>", line 5, in median
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3520, in median
    r, k = _ureduce(a, func=_median, axis=axis, out=out,
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3429, in _ureduce
    r = func(a, **kwargs)
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/numpy/lib/function_base.py", line 3555, in _median
    part = partition(a, kth, axis=axis)
  File "<__array_function__ internals>", line 5, in partition
  File "/home/amy/mambaforge/envs/lim2020/lib/python3.8/site-packages/numpy/core/fromnumeric.py", line 748, in partition
    a.partition(kth, axis=axis, kind=kind, order=order)
TypeError: '<' not supported between instances of 'NoneType' and 'NoneType'
"""

15.25-15.42, 15.49-16.55: Figure 2

The total staff is 6, 120, or 4 * 30 staff per shift. The total number of staff is expressed as multiples of staff in the first shift, and referred to as staff strength, and discussed in section 2.1.2.

I realised that my current runs of the model had just been outputting the results for day 7, 14 and 21, but I would need to keep all the results to produce these figures. I modified the model.py to allow this, and then re-ran the base 15 model.

The model varies shifts_per_day, staff_per_shift and strength.

I was initially unsure on the frequency of staff changes. The shift arrangement is discussed in 2.1.3: “The staff was assumed to change shift after a single day (shift), as well as after working 3, 7, 14, 21 consecutive days. After each shift, the simulated staff is assumed to return and stay at home for at least the same number of days as the shift before being randomly assigned to a new shift with the other off-duty colleagues.” Lim et al. (2020) The figure 2 legend mentions that the simulated staff work non-consecutive days. From these sections, I think I am correct in assuming that the default is for it to be freq=1, and so I have set that.

Add matplotlib to environment but didn’t specify to older version, as appears the figures in the article are produced in excel.

Wrote code to extract the required results from the dataframe and produce the figure.

Looks just like paper. Happy that this is reproduced at 16.55.

Timings

import sys
sys.path.append('../')
from timings import calculate_times

# Minutes used prior to today
used_to_date = 277

# Times from today
times = [
    ('09.40', '11.12'),
    ('11.17', '11.27'),
    ('11.31', '11.36'),
    ('11.39', '12.05'),
    ('12.40', '13.18'),
    ('13.20', '13.44'),
    ('13.55', '14.09'),
    ('14.10', '14.14'),
    ('14.25', '14.28'),
    ('14.33', '14.44'),
    ('14.49', '14.58'),
    ('15.08', '15.19'),
    ('15.25', '15.42'),
    ('15.49', '16.55')]

calculate_times(used_to_date, times)

Time spent today: 330m, or 5h 30m
Total used to date: 607m, or 10h 7m
Time remaining: 1793m, or 29h 53m
Used 25.3% of 40 hours max

References

Lim, Chun Yee, Mary Kathryn Bohn, Giuseppe Lippi, Maurizio Ferrari, Tze Ping Loh, Kwok-Yung Yuen, Khosrow Adeli, and Andrea Rita Horvath. 2020. “Staff Rostering, Split Team Arrangement, Social Distancing (Physical Distancing) and Use of Personal Protective Equipment to Minimize Risk of Workplace Transmission During the COVID-19 Pandemic: A Simulation Study.” Clinical Biochemistry 86 (December): 15–22. https://doi.org/10.1016/j.clinbiochem.2020.09.003.