Day 5 – Reproducing Johnson et al. 2021

Note

Fixed my loading of epicR so get model results! Experiemented with agent numbers and started Table 3. Total time used: 6h 11m (15.5%)

09.19-09.41: Returning to run the model

Did a quick review of my prior logbook entries to get back up to speed. Will be trying to run Mini_Analysis.Rmd, which I had created previously and contains only scenario S1NoCD and saves the result to csv (whilst the full script Case_Detection_Results.Rmd contains many more scenarios. I had previously reduced the number of agents (settings$n_base_agents <-) to try and run on my local machine but this just returned 0/1/NaN, so I will now be trying with the full number of agents.

In Case_Detection_Results.Rmd, it sets settings$n_base_agents <- 50e+6 but has comment “change number of agents here: 100e+6 is used in the analysis”.

I decided to start with trying Mini_Analysis.Rmd with 50e+6. I logged into the remote computer, cloned the GitHub repository, then ran the following:

cd stars-reproduce-johnson-2021/reproduction
Rscript -e "renv::restore()"
Rscript -e "rmarkdown::render('scripts/Mini_Analysis.Rmd')"

This however returned an error that pandoc was required, so I ran sudo apt install pandoc then tried again. This then successfully rendered the .Rmd. I initially did 5000 agents to make sure it worked and output correctly, and then the original 50e+6 as mentioned.

11.29-11.34, 11.48-12.00: Reviewing results from 50e+6

This ran in 1.642896 hours (1 hour 38 minutes). The result was still wrong though:

import pandas as pd
pd.read_csv('mini_analysis_50e6.csv')

	Scenario	Agents	PatientYears	CopdPYs	NCaseDetections	DiagnosedPYs	OverdiagnosedPYs	SABA	LAMA	LAMALABA	...	NoCOPD	GOLD1	GOLD2	GOLD3	GOLD4	Cost	CostpAgent	QALY	QALYpAgent	NMB
0	S1NoCD	0	0	0	0	0	0	NaN	NaN	NaN	...	0	0	0	0	0	0	NaN	0	NaN	NaN

1 rows × 29 columns

From the Case_Retection_Results.md in their repository, I can see that I should be getting numbers like 37195539 agents, 625946560 patient years, 71206641 CopdPYs, and so on.

I’m sceptical that increasing to 100e+6 would fix that issue, as their notebooks appear to have been run with 50e+6, as I did.

I double-checked the code in Mini_Analysis.md, but was satisfied that this matched up with theirs, with the only changes being to import epicR from the local folder, and to add a timer, and not having some processing of the table (but that would make no difference here, as numbers 0 and NaN).

I’m presuming the issue therefore is in the epicR package I have, but I’m rather unsure on where to start, or where the issue might be!

12.01-12.24: Basic run of epicR to resolve the “0” results issue

As I starting point, I tried following the instructions available for running epicR generally (and not this specific study), to see if I could get it working and printing some results for a random basic scenario.

On https://github.com/resplab/epicR, they give code for running epicR with default settings:

library(epicR)
init_session()
run()
Cget_output()
terminate_session()

And the default inputs viewed with:

input <- get_input()

However, I later realised this was a newer function in epicR and not in the version we have.

I loaded my local epicR package, although without warning=False, realised it had a warning:

Warning: ── Conflicts ───────────────────────────────────────────────────────────────────────────────── epicR conflicts
──
✖ `Cget_agent_events` masks `epicR::Cget_agent_events()`.
✖ `Cget_all_events` masks `epicR::Cget_all_events()`.
✖ `Cget_all_events_matrix` masks `epicR::Cget_all_events_matrix()`.
  … and more.
ℹ Did you accidentally source a file rather than using `load_all()`?
  Run `rm(list = c("Cget_agent_events", "Cget_all_events", "Cget_all_events_matrix", "Cget_event",
  "Cget_events_by_type", "Cget_inputs", "Cget_n_events", "Cget_output", "Cget_output_ex", "Cget_runtime_stats",
  "Cget_settings", "Cget_smith", "Cset_input_var", "Cset_settings_var", "get_sample_output", "mvrnormArma"))` to
  remove the conflicts.

I found that, if I ran the whole script just with the local devtools::load_all("../epicR"), it would appear to run but give results like 0. If however I cleared the environment and then ran library(epicR) below the load_all() statement, it would run and give actual results. I’m presuming this relates to the conflicts message above. I understand that, when I clear the environment, I’m removing objects, states and data that might have come with load_all() but keeping the package itself, so I am just left with a “clean slate” with the package.

12.25-12.30, 14.08-14.53, 15.10-15.14, 15.20-15.25: Fixing my mini analysis and returning to lower numbers of agents

I removed the Basic_Model.Rmd and returned to Mini_Analysis.Rmd, adding the new workflow of clearing the environment and reloading the library. I returned to running it with fewer numbers of agents. This output results - hurrah!

As I’d tried before, ran these (either on local machine or more high powered remote machine):

5000 (5e3)
50,000 (5e4) (17 seconds local, 7.2 seconds remote)
500,000 (5e5) (2.6 minutes local, 1.0 minutes remote)
1 million (1e6) (4.7 minutes local, 2.0 minutes remote)
5 million (5e6) (27.3 minutes local, 10.0 minutes remote)

And it was quicker than before too! (17s vs 2.4m, 2.6m vs 18.2m).

import pandas as pd

# Import results
df5e3 = pd.read_csv('mini_analysis_fixed_5e3.csv').rename(index={0: '5e3'})

df5e4 = pd.read_csv('mini_analysis_fixed_5e4.csv').rename(index={0: '5e4'})

df5e5 = pd.read_csv('mini_analysis_fixed_5e5.csv').rename(index={0: '5e5'})

df1e6 = pd.read_csv('mini_analysis_fixed_1e6.csv').rename(index={0: '1e6'})

df5e6 = pd.read_csv('mini_analysis_fixed_5e6.csv').rename(index={0: '5e6'})

# Combine into a single dataframe and show
df_s1nocd = pd.concat([df5e3, df5e4, df5e5, df1e6, df5e6])
df_s1nocd

	Scenario	Agents	PatientYears	CopdPYs	NCaseDetections	DiagnosedPYs	OverdiagnosedPYs	SABA	LAMA	LAMALABA	...	NoCOPD	GOLD1	GOLD2	GOLD3	GOLD4	Cost	CostpAgent	QALY	QALYpAgent	NMB
5e3	S1NoCD	3696	6.233997e+04	7.457940e+03	18933	1.396640e+03	1387	0.017348	0.118369	0.148212	...	52174	3045	3113	749	178	8.072027e+06	2183.990032	4.642803e+04	12.561696	625900.832686
5e4	S1NoCD	37129	6.235741e+05	7.094660e+04	190251	1.322909e+04	13394	0.017161	0.134433	0.152504	...	525258	29237	30306	6618	1303	7.887546e+07	2124.362691	4.649150e+05	12.521615	623956.370519
5e5	S1NoCD	371579	6.251100e+06	7.066933e+05	1907271	1.310409e+05	133904	0.017192	0.136018	0.151516	...	5270314	287267	304912	67938	12088	7.985673e+08	2149.118458	4.660359e+06	12.542042	624952.967972
1e6	S1NoCD	743128	1.249876e+07	1.421786e+06	3814132	2.622621e+05	267345	0.017119	0.135762	0.151370	...	10529668	577647	613813	136603	24218	1.600396e+09	2153.593806	9.317379e+06	12.538055	624749.138617
5e6	S1NoCD	3718120	6.256653e+07	7.114655e+06	19087564	1.318087e+06	1336660	0.017180	0.135601	0.151460	...	52712576	2886521	3074656	685563	119768	7.978907e+09	2145.951883	4.663988e+07	12.543942	625051.169501

5 rows × 29 columns

Can’t calculate ICER as need incremental costs and qalys, and those need S1a, S1b and S1c to make incrementalcosts, incrementalqaly and incrementalnmb, since they compare against S1NoCD.

Comparing against S1NoCD from Case_Detection_Results.md:

Some columns are different as their numbers depend on number of agents:
- Agents
- PatientYears
- CopdPYs
- NCaseDetections
- DiagnosedPYs
- OverdiagnosedPYs
- Mild, Moderate, Severe, VerySevere
- NoCOPD
- GOLD1, GOLD2, GOLD3, GOLD4
- Cost
- QALY
But some are clearly processed and so should look at these to see if same results:
- SABA
- LAMA
- LAMALABA
- ICSLAMALABA
- MildPY, ModeratePY, SeverePY, VerySeverePY
- CostpAgent
- QALYpAgent
- NMB

I filtered to those columns, and add the results from Case_Detection_Results.md…

# Filter to columns not influenced by agent number
s1 = df_s1nocd[[
  'SABA', 'LAMA', 'LAMALABA', 'ICSLAMALABA', 'MildPY','ModeratePY', 'SeverePY',
  'VerySeverePY', 'CostpAgent', 'QALYpAgent', 'NMB']]

# Add original results to dataframe
original = {
  'SABA': 0.017,
  'LAMA': 0.135,
  'LAMALABA': 0.151,
  'ICSLAMALABA': 0.080,
  'MildPY': 0.217,
  'ModeratePY': 0.041,
  'SeverePY': 0.068,
  'VerySeverePY': 0.006,
  'CostpAgent': 2150.057,
  'QALYpAgent': 12.544,
  'NMB': 625073.2
}
original_row = pd.DataFrame(original, index=['original'])
s1 = pd.concat([s1, original_row])

# View dataframe, rounding to 3dp (as they present it in original)
round(s1, 3)

	SABA	LAMA	LAMALABA	ICSLAMALABA	MildPY	ModeratePY	SeverePY	VerySeverePY	CostpAgent	QALYpAgent	NMB
5e3	0.017	0.118	0.148	0.076	0.233	0.035	0.062	0.006	2183.990	12.562	625900.833
5e4	0.017	0.134	0.153	0.081	0.212	0.039	0.067	0.006	2124.363	12.522	623956.371
5e5	0.017	0.136	0.152	0.081	0.215	0.041	0.069	0.006	2149.118	12.542	624952.968
1e6	0.017	0.136	0.151	0.080	0.216	0.041	0.068	0.006	2153.594	12.538	624749.139
5e6	0.017	0.136	0.151	0.080	0.217	0.041	0.068	0.006	2145.952	12.544	625051.170
original	0.017	0.135	0.151	0.080	0.217	0.041	0.068	0.006	2150.057	12.544	625073.200

We can see that many columns look very similar between them. However, for MildPY, CostpAgent, QALYpAgent and NMB, the result 5e5 is markedly better than 5e4.

I then add 1 million (1e6) and 5 million (5e6) (already done above). And this looked pretty similar to 500,000 (5e5).

Looking at Case_Detection_Results.Rmd, there are 12 scenarios, and then another 12 for 5 years, so a total of 24.

I figured I could try running the whole thing with 500,000 to begin with, but up it to 1 million or 5 million if needed, but that 50 million probably wasn’t necessary (as anticipate will likely be able to get similar enough results with lower numbers).

15.26-15.38, 15.58-16.08, 16.18-16.53: Running the full original script

I re-copied the original script and only made the following amendments before running:

Altering import of epicR
Adding a timer
Reducing agent number from 50e6 (50 million) to 5e5 (500 thousand)
Saving results tables to csv files

Then ran on remote machine. It took a bit of tweaking to correct the saving into outputs folder and get right paths, as I got the path wrong, and then it wouldn’t work before had made “outputs” folder.

Finishing this, I got five results tables (saved to csv).

pd.read_csv('sall_5e5.csv').sort_values(by='Scenario')

	Scenario	Agents	Cost	CostpAgent	QALY	QALYpAgent	ICER	IncrementalNMB
6	S1NoCD	371579	7.985673e+08	2149.118458	4.660359e+06	12.542042	NaN	0.000000
2	S1a	371902	9.004813e+08	2421.286568	4.670660e+06	12.558846	16195.977030	568.065578
10	S1b	372140	8.821882e+08	2370.581549	4.668622e+06	12.545338	67189.317529	-56.657807
3	S1c	371919	8.780531e+08	2360.871744	4.668922e+06	12.553600	18321.203589	366.138021
7	S2NoCD	219759	5.923556e+08	2695.478273	2.696622e+06	12.270813	NaN	0.000000
1	S2a	221233	6.444787e+08	2913.121915	2.718304e+06	12.287065	13391.766206	594.958813
8	S3NoCD	172154	4.944031e+08	2871.865595	1.934054e+06	11.234440	NaN	0.000000
4	S3a	172660	5.270176e+08	3052.343287	1.941269e+06	11.243307	20354.541872	262.857493
5	S3b	172336	5.420539e+08	3145.331710	1.937179e+06	11.240712	43606.186231	40.097325
0	S3c	172314	5.350466e+08	3105.067680	1.939341e+06	11.254689	11516.850740	779.236513
9	S3d	172882	5.427277e+08	3139.295863	1.943060e+06	11.239224	55904.120380	-28.243723

pd.read_csv('ceplane_5e5.csv')

	Scenario	Agents	PropAgents	Cost	CostpAgent	CostpAgentExcluded	CostpAgentAll	QALY	QALYpAgent	QALYpAgentExcluded	QALYpAgentAll	IncrementalCosts	IncrementalQALY	ICERAdj	ICER	INMB
0	S1NoCDAvg	371714	1.000000	8.009440e+08	2154.731836	0.000000	2154.731836	4.662659e+06	12.543673	0.000000	12.543673	0.000000	0.000000	NaN	NaN	0.000000
1	S1a	371902	1.000000	9.004813e+08	2421.286568	0.000000	2421.286568	4.670660e+06	12.558846	0.000000	12.558846	266.554732	0.015174	17566.697170	16195.977030	492.138634
2	S1b	372140	1.000000	8.821882e+08	2370.581549	0.000000	2370.581549	4.668622e+06	12.545338	0.000000	12.545338	215.849713	0.001665	129616.172631	67189.317529	-132.584751
3	S1c	371919	1.000000	8.780531e+08	2360.871744	0.000000	2360.871744	4.668922e+06	12.553600	0.000000	12.553600	206.139907	0.009927	20765.538289	18321.203589	290.211077
4	S2NoCD	219759	0.591205	5.923556e+08	2695.478273	1372.698365	2154.731836	2.696622e+06	12.270813	12.938285	12.543673	0.000000	0.000000	NaN	NaN	0.000000
5	S2a	221233	0.595170	6.444787e+08	2913.121915	1372.698365	2289.512160	2.718304e+06	12.287065	12.938285	12.550698	134.780324	0.007026	19183.267805	13391.766206	216.516247
6	S3NoCD	172154	0.463136	4.944031e+08	2871.865595	1536.083584	2154.731836	1.934054e+06	11.234440	13.673105	12.543673	0.000000	0.000000	NaN	NaN	0.000000
7	S3a	172660	0.464497	5.270176e+08	3052.343287	1536.083584	2240.381513	1.941269e+06	11.243307	13.673105	12.544471	85.649677	0.000799	107210.068856	20354.541872	-45.704885
8	S3b	172336	0.463625	5.420539e+08	3145.331710	1536.083584	2282.171665	1.937179e+06	11.240712	13.673105	12.545386	127.439829	0.001713	74374.399669	43606.186231	-41.765303
9	S3c	172314	0.463566	5.350466e+08	3105.067680	1536.083584	2263.411383	1.939341e+06	11.254689	13.673105	12.552009	108.679547	0.008337	13035.890064	11516.850740	308.167889
10	S3d	172882	0.465094	5.427277e+08	3139.295863	1536.083584	2281.728207	1.943060e+06	11.239224	13.673105	12.541121	126.996370	-0.002551	-49778.541469	55904.120380	-254.557732

pd.read_csv('clinicalresults_5e5.csv')

	Scenario	PropAgents	ProppPatientYears	ProppCopdPYs	SABAAll	LAMAAll	LAMALABAAll	ICSLAMALABAAll	MildpAgentAll	ModeratepAgentAll	SeverepAgentAll	VerySeverepAgentAll	NoCOPDpPYAll	GOLD1pPYAll	GOLD2pPYAll	GOLD3pPYAll	GOLD4pPYAll	DiagnosedpPYAll
0	S1NoCDAvg	1.000000	1.000000	1.000000	0.019067	0.136964	0.152488	0.081139	0.411860	0.077802	0.130921	0.011304	0.842808	0.046012	0.048997	0.010899	0.001925	0.187126
1	S1a	1.000000	1.000000	1.000000	0.026545	0.157022	0.308491	0.092651	0.399933	0.075006	0.126251	0.010954	0.841922	0.046363	0.049408	0.011021	0.001948	0.464262
2	S1b	1.000000	1.000000	1.000000	0.021459	0.149159	0.241721	0.088141	0.404149	0.077388	0.129586	0.011345	0.842348	0.046148	0.049331	0.010965	0.001861	0.333770
3	S1c	1.000000	1.000000	1.000000	0.020317	0.144204	0.214241	0.085843	0.405836	0.075828	0.127832	0.010900	0.843357	0.045779	0.048706	0.010946	0.001846	0.290201
4	S2NoCD	0.591205	0.580239	0.708846	0.019067	0.136964	0.152488	0.081139	0.411860	0.077802	0.130921	0.011304	0.842808	0.046012	0.048997	0.010899	0.001925	0.187126
5	S2a	0.595170	0.584381	0.711030	0.022459	0.150276	0.233361	0.090027	0.402526	0.077061	0.129553	0.010940	0.842892	0.046209	0.048552	0.011044	0.001959	0.315256
6	S3NoCD	0.463136	0.415130	0.587070	0.019067	0.136964	0.152488	0.081139	0.411860	0.077802	0.130921	0.011304	0.842808	0.046012	0.048997	0.010899	0.001925	0.187126
7	S3a	0.464497	0.416543	0.589477	0.020197	0.140746	0.182354	0.084633	0.411228	0.078085	0.130900	0.011011	0.842680	0.045921	0.049123	0.010895	0.002053	0.232586
8	S3b	0.463625	0.415283	0.585001	0.022672	0.148789	0.234355	0.088940	0.400451	0.076676	0.128972	0.011083	0.843023	0.045768	0.049076	0.010821	0.001914	0.318064
9	S3c	0.463566	0.415885	0.582899	0.021257	0.146624	0.211534	0.088250	0.404950	0.078611	0.131189	0.011164	0.843306	0.045817	0.048867	0.010668	0.001974	0.277774
10	S3d	0.465094	0.416844	0.590338	0.020732	0.144100	0.197982	0.085758	0.407402	0.078907	0.131801	0.011216	0.842570	0.045941	0.049306	0.010897	0.001906	0.255943

I initially focused on sall, and reworking it to be more similar to Table 3. Whilst doing this, I assumed that S1NoCD was S0, as the values were closest and as it was the first (as opposed to S2NoCD or S3NoCD).

Timings

import sys
sys.path.append('../')
from timings import calculate_times

# Minutes used prior to today
used_to_date = 193

# Times from today
times = [
    ('09.19', '09.41'),
    ('11.29', '11.34'),
    ('11.48', '12.00'),
    ('12.01', '12.24'),
    ('12.25', '12.30'),
    ('14.08', '14.53'),
    ('15.10', '15.14'),
    ('15.20', '15.25'),
    ('15.26', '15.38'),
    ('15.58', '16.08'),
    ('16.18', '16.53')]

calculate_times(used_to_date, times)

Time spent today: 178m, or 2h 58m
Total used to date: 371m, or 6h 11m
Time remaining: 2029m, or 33h 49m
Used 15.5% of 40 hours max