library(EpiNow2)
library(posterior)
library(bayesplot)
library(data.table)
library(scoringutils)
library(dplyr)
library(ggplot2)
How to evaluate the ‘good of fit’ of an EpiNow2 model run
Introduction
This is a quick guide on how to assess or evaluate the output of an EpiNow2 model run.
{EpiNow2}
uses the Stan backend for model specification and inference. This means that the model fit can be assessed using the same tools and techniques used for any Stan model. The EpiNow2 package provides a convenient interface to run the models, but the underlying model is a Stan model. The returned fit
object is either a <stanfit>
or CmdStanModel
object, depending on the backend ({rstan}
or {cmdstanr}
) used. For example, you can use the {bayesplot}
and {posterior}
packages to visualize and summarize the model diagnostics.
Evaluating or assessing an {EpiNow2}
model fit requires a holistic approach involving evaluating the MCMC diagnostics as well as the performance of the forecast against observed data. We will therefore proceed in that regard here.
Let’s start by setting up and running an example model, assessing the MCMC diagnostics, and then evaluating the forecast performance.
Running an example model
Input preparation
This will involve: loading required packages, setting up the model parameters, and running the model.
# Set number of cores for parallel processing.
options(mc.cores = min(4, parallel::detectCores() - 1))
# Set example generation time, incubation period, and reporting delay. In practice, these should use estimates from the literature or be estimated from data.
<- Gamma(
generation_time shape = Normal(1.3, 0.3),
rate = Normal(0.37, 0.09),
max = 14
)
<- LogNormal(
incubation_period meanlog = Normal(1.6, 0.06),
sdlog = Normal(0.4, 0.07),
max = 14
)<- LogNormal(mean = 2, sd = 1, max = 10)
reporting_delay
# Use example case data.
<- example_confirmed[1:60]
reported_cases
# Plot the data
<- ggplot(reported_cases, aes(x = date, y = confirm)) +
cases_plots geom_col() +
labs(title = "Confirmed cases", x = "Date", y = "Cases") +
theme_minimal()
cases_plots
Fitting the model
Estimate Rt and nowcast/forecast cases by date of infection.
<- epinow(
out data = reported_cases,
generation_time = gt_opts(generation_time),
rt = rt_opts(prior = LogNormal(mean = 2, sd = 0.1)),
delays = delay_opts(incubation_period + reporting_delay)
)
WARN [2025-08-19 12:21:14] epinow: partial match of 'length' to 'lengths' - $, consecutive, length
# Plot the model fit
plot(out)
Assessing model diagnostics
We’ll first extract the Stan fit object and compute diagnostics. The epinow()
function returns a list of model outputs, including the Stan fit object. This is what we need to assess the model fit. Stan provides diagnostic metrics to assess the model fit, convergence, etc. which we’ll use to assess the model fit. The Stan ecosystem has a rich set of tools for diagnosing model fit, including the bayesplot
and posterior
packages. These tools provide a way to visualize and summarize the model diagnostics. For an indepth look at MCMC diagnostics, for example, see the bayesplot R package vignette on visual MCMC diagnostics. You can also do posterior predictive checks using the bayesplot
package.
Diagnostics include:
- Divergent transitions. These should be minimized; ideally 0. Divergent transitions can be improved by tuning Stan controls like
adapt_delta
,max_treedepth
, andstepsize
instan_opts()
. If there are divergent transitions, increaseadapt_delta
(e.g., 0.95 or 0.99). See an in‑depth explanation at https://mc-stan.org/learn-stan/diagnostics-warnings.html. - Rhat. These should be close to 1 indicating good mixing. Rhat values should be less than 1.05 (See
?rstan::Rhat
). - Treedepth. This should be low; high values indicate potential issues with the model.
- ESS values. These should be high; low values indicate insufficient sampling. In particular, both bulk‑ESS and tail‑ESS should be at least ~100 per chain to ensure reliable posterior quantile estimates (see
?rstan::ess_bulk
).
<- out$estimates$fit
fit
<- bayesplot::nuts_params(fit)
np <- subset(np, Parameter == "divergent__")
divergence_data <- subset(np, Parameter == "treedepth__")
treedepth_data
<- posterior::summarize_draws(
posterior_summary c(posterior::default_convergence_measures(), "ess_basic")
fit, |> subset(variable != "lp__")
)
<- min(posterior_summary$ess_basic, na.rm = TRUE)
fit_ess_basic <- min(posterior_summary$ess_bulk, na.rm = TRUE)
fit_ess_bulk <- min(posterior_summary$ess_tail, na.rm = TRUE)
fit_ess_tail
<- data.table(
diagnostics "samples" = nrow(np) / length(unique(np$Parameter)),
"max_rhat" = round(max(posterior_summary$rhat, na.rm = TRUE), 3),
"divergent_transitions" = sum(divergence_data$Value),
"per_divergent_transitions" = mean(divergence_data$Value),
"max_treedepth" = max(treedepth_data$Value),
"ess_basic" = fit_ess_basic,
"ess_bulk" = fit_ess_bulk,
"ess_tail" = fit_ess_tail
)
:=
diagnostics[, no_at_max_treedepth sum(treedepth_data$Value == max_treedepth)
:= no_at_max_treedepth / samples]
][, per_at_max_treedepth
::kable(diagnostics) knitr
samples | max_rhat | divergent_transitions | per_divergent_transitions | max_treedepth | ess_basic | ess_bulk | ess_tail | no_at_max_treedepth | per_at_max_treedepth |
---|---|---|---|---|---|---|---|---|---|
2000 | 1.01 | 0 | 0 | 9 | 1325.989 | 1277.945 | 1109.327 | 837 | 0.4185 |
Evaluating forecast performance
Forecast performance can be evaluated by comparing the forecast of reported cases with the observed cases using Proper Scoring Rules[Gneiting and Raftery (2007);Carvalho (2016);@] such as the Continuous Ranked Probability Score (CRPS) and Weighted Interval Score (WIS). These are available via the scoringutils::score()
function.
The scoringutils::score()
function requires a dataset that contains at least the following columns:
date
: the date of the forecastprediction
: forecast valuestrue_value
: the observed valuesquantile
: If evaluating quantiles, the quantile of the forecasted values (e.g., median, lower and upper bounds), in ascending order.
Let’s extract and prepare the forecasts and corresponding observed cases for the evaluation.
EpiNow2’s epinow()
function returns a list of model outputs, including the forecasts. The forecasts are stored in the out$estimates$forecasts
data.table. This contains the forecasted values for each date, including the median and quantiles (lower and upper bounds).
# Extract the forecasts
<- out$estimates$summarised[variable == "reported_cases"][type == "forecast", ] %>%
forecasts select(-c(mean, sd, variable, strat, type))
We’ll also extract the corresponding observed data and clean up the quantiles into a format suitable for use with scoringutils
. The forecasts
data.table contains the forecasted values for each date, including the median and quantiles (lower and upper bounds). The quantiles are stored in columns named lower_20
, upper_20
, etc., where the number indicates the quantile level (e.g., 50% for the median).
# Extract observed cases for the same period
<- example_confirmed[date >= min(forecasts$date) & date <= max(forecasts$date)]
obs_data
# Combine forecasts with observed cases
<- merge(forecasts, obs_data, by = "date")[, true_value := confirm][, confirm := NULL]
eval_dt
# Melt the data to long format
<- c("median", grep("^(lower|upper)_", names(eval_dt), value = TRUE))
cols_to_melt <- melt(
eval_long
eval_dt,id.vars = setdiff(names(eval_dt), cols_to_melt),
measure.vars = cols_to_melt,
variable.name = "prediction",
value.name = "value"
)
# Prepare the evaluation data
:= fifelse(
eval_long[, quantile == "median", 0.5,
prediction fifelse(
grepl("^lower_", prediction),
1 - as.numeric(sub("lower_", "", prediction)) / 100) / 2,
(fifelse(
grepl("^upper_", prediction),
1 + as.numeric(sub("upper_", "", prediction)) / 100) / 2,
(NA_real_
)
):= value][, value := NULL]
)][, prediction
# Sort the data by quantile
setorder(eval_long, quantile) %>%
head(10)
date true_value prediction quantile
<Date> <num> <num> <num>
1: 2020-04-22 2729 1526.00 0.05
2: 2020-04-23 3370 1623.90 0.05
3: 2020-04-24 2646 1911.95 0.05
4: 2020-04-25 3021 1599.00 0.05
5: 2020-04-26 2357 1701.95 0.05
6: 2020-04-27 2324 1507.00 0.05
7: 2020-04-28 1739 1237.85 0.05
8: 2020-04-22 2729 1895.00 0.25
9: 2020-04-23 3370 2056.75 0.25
10: 2020-04-24 2646 2420.00 0.25
Now we can use the scoringutils
package to compute the scores for the forecasts. By default, a range of scores will be computed, including the interval_score, dispersion, underprediction, overprediction, coverage_deviation, bias, and ae_median.
<- scoringutils::score(eval_long) %>%
scored_scores summarise_scores()
::kable(scored_scores, digits = 3) knitr
date | model | interval_score | dispersion | underprediction | overprediction | coverage_deviation | bias | ae_median |
---|---|---|---|---|---|---|---|---|
2020-04-22 | Unspecified model | 353.429 | 91.429 | 262.000 | 0.000 | -0.171 | -0.9 | 542.0 |
2020-04-23 | Unspecified model | 663.620 | 105.477 | 558.143 | 0.000 | -0.171 | -0.9 | 993.5 |
2020-04-24 | Unspecified model | 168.424 | 138.710 | 0.000 | 29.714 | 0.114 | 0.5 | 178.0 |
2020-04-25 | Unspecified model | 365.889 | 129.789 | 236.100 | 0.000 | -0.171 | -0.9 | 603.5 |
2020-04-26 | Unspecified model | 249.103 | 150.246 | 0.000 | 98.857 | 0.114 | 0.5 | 356.0 |
2020-04-27 | Unspecified model | 166.162 | 149.591 | 0.000 | 16.571 | 0.400 | 0.2 | 116.0 |
2020-04-28 | Unspecified model | 230.661 | 138.061 | 0.000 | 92.600 | 0.114 | 0.5 | 321.0 |
Interpretation of scores
The table above provides a summary of the forecast performance by metrics. Note that it is hard to make sense of the scores currently as they are not relative to other models. A proper comparison can only be done against other models.
A common practice is to have a baseline model (Stapper and Funk 2025), such as a naive model, and compare the scores of the model against the baseline.
Future updates
In future updates, I will explore more advanced techniques for model diagnostics and evaluation, including posterior predictive checks, cross-validation, and more using packages like loo
, shinystan, and bayesplot