Update: Be sure to check out the update to this post where we gratifyingly produce a huge chart with all the states on it.
Disclaimer: I am not an epidemiologist. Further to what Kevin said, I am not going to try and invent a new model, but rather help estimate an already-existing-one, that was developed by epidemiologists.
The background on this is this post (and this technical notebook) by Kevin Systrom, founder of Instagram, which makes the case that we need to have accurate real-time estimates of \(R_t\), the reproduction number, and provides a methodology for estimating them.
While his approach is clear and awesome, I think that there are two major ways it can be improved:
I could well be wrong about #1, and have misunderstood his procedure at some point. However, the model below, which explicitly models \(R_t\) as a moving target, has wider uncertainty near present day viz. Systrom’s model
The main equation we start with is
This is equation (2) in the reference paper
\[I_{t + \tau} = I_t e^{\tau\gamma(R_t - 1)}\]
Finance people will notice that this formula is the same as continuously compound interest \[PV e^{i\tau}\] with \(i\) the interest rate and \(\tau\) the time period So we can think of \(\gamma(R_t-1)\) as our “interest rate”, which we call \(\theta\) below
Where:
There is a major issue with \(I_t\). Specifically, we never observe it. All we observe are the number of new cases each day, and the cumulative case count. And deaths, of course, but this won’t figure in to the analysis below.
I am not sure if it boils down to the same issue or not, but Systrom says:
For epidemics that have \(R_t\gt1\) for a long time and then become under control (\(R_t\lt1\)), the posterior gets stuck. It cannot forget about the many days where \(R_t\gt1\) so eventually \(P(R_t|k)\) asymptotically approaches 1 when we know it’s well under 1. The authors note this in the paper as a footnote. Unfortunately this won’t work for us. The most critical thing to know is when we’ve dipped below the 1.0 threshold!
So, I propose to only incorporate the last \(m\) days of the likelihood function.
We have to do something similar. Here’s why:
If we’re modeling:
\[I_{t + 1} \sim poisson(I_te^{\gamma(R_t - 1)})\] And we use cumulative cases covering the entire history as \(I_t\), then it will always be the case that
\[I_{t + 1} \geq I_t\]
Which implies
\[e^{\gamma(R_t - 1)} \geq 1\]
\[\gamma(R_t - 1) \geq 0\] \[R_t \geq 1\]
I set \(W\) (WINDOW
in my code) to 20
.
So, to allow \(R_t \lt 1\), we need to allow \(I_{t + \tau} \lt I_t\), which we can accomplish the same way – by only considering the past \(W\) days in our analysis.
Using a state space approach seems natural for this problem, since we’re dealing with streaming data that we want to use to recursively estimate a moving target.
Going back to our model setup:
\[I_{t + 1} \sim poisson(I_te^{\gamma(R_t - 1)})\]
This is exactly equivalent to a poisson regression with an offset
or exposure
term equal to \(I_t\).
Pulling from wikipedia The poisson model is set up as: \[E[Y | \theta] = exposure*e^{\theta}\] We then apply the log link function: \[log(E[Y | \theta]) = log(exposure) + \theta\] And here, we can substitute \(\gamma(R_t-1)\) for \(\theta\) and \(I_t\) for \(exposure\)
The other change we’ll make is to call \(\gamma(R_t-1) = \theta\) (see the note to the right).
Then a state space model with just an intercept \(\theta_t\) and offset
\(= I_t\) will be:
\[I_{t + 1} \sim poisson(I_te^{\theta_t})\] \[\theta_t \sim N(\theta_{t-1}, \sigma)\] It is this second term in the model that allows us to explicitly model the process variance in the “interest rate” \(\gamma(R_t-1)\)
Doing this in R
is straightforward:
KFAS is a package for “Exponential Family State Space Models in R” zoo we use for the function rollsum
library(tidyverse)
library(KFAS)
library(zoo)
#### data ####
url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
dat = read_csv(url)
#### constants ####
WINDOW = 20
SERIAL_INTERVAL = 4
GAMMA = 1 / SERIAL_INTERVAL
STATE = "New York"
All we’re doing here is to rebuild the “cumulative cases”, only considering the new cases in WINDOW
. We left pad the series to make things line up correctly.
#### building the dataset ####
# rolling window
series =
dat %>%
filter(state == STATE) %>%
filter(cases>0) %>%
pull(cases) %>%
diff %>%
{. + 1} %>%
{c(rep(0, WINDOW-1), .)} %>%
rollsum(., WINDOW)
# dates
dates = dat %>%
filter(state == STATE) %>%
filter(cases>0) %>%
pull(date) %>%
.[c(-1,-2)]
it
here is equal to \(I_t\), and itp1
is \(I_{t+1}\)
it = series[-length(series)]
itp1 = series[-1]
u
is the exposure parameter. Often you will see this entered into a glm
as log(.)
. (See the margin note above) However, KFAS
in its documentation makes clear that they log this term inside the function
mod = SSModel(
itp1 ~ 1,
u = it,
distribution = "poisson"
)
This is KFAS
’s way of estimating parameters with maximum likelihood. This corresponds to \(\sigma\) above (the process variance in \(\theta\), which is our “interest rate” \(\gamma(R_t-1)\))
mod$Q[1,1,1] = NA
mod_fit = fitSSM(mod, c(1,1))
Once we’ve estimated \(\sigma\), we can recursively filter and smooth the \(\theta\)s
mod_fit_filtered = KFS(
mod_fit$model, c("state", "mean"), c("state", "mean"))
We can inspect how the model fits by comparing our one-step-ahead forecasts of cases with actuals (on a log scale):
It’s important to note that these are true forecasts of the day ahead. They are not smoothed values (which would be trivial to extract as well) The reason is that we want to estimate the true uncertainty that a decision-maker would face, which is regarding the estimate of \(R_t\) in the future
tibble(predictions = mod_fit_filtered$m,
actuals = series[-1]) %>%
mutate_all(~ c(NA, diff(.x))) %>%
mutate(date = dates) %>%
.[-1, ] %>%
gather(-date, key = series, value = value) %>%
ggplot() +
aes(x = date, y = value, color = series) +
geom_line() +
scale_y_log10("New case count") +
labs(x = "") +
scale_color_brewer("", type = "qual", palette = 1) +
theme(legend.position = c(0.2, 0.8))
Satisfied that the model fits well, we can proceed to estimate where \(R_t\) is:
Here we extract the estimates of \(\theta\) with a traditional 95% confidence interval
theta = tibble(
mean_estimate = mod_fit_filtered$a[, 1],
upper = mean_estimate + 1.96 * sqrt(mod_fit_filtered$P[1,1,]),
lower = mean_estimate - 1.96 * sqrt(mod_fit_filtered$P[1,1,])
)[-1, ] # throw away the initial observation
We now have to invert \(\theta = \gamma(R_t-1)\)
rt = theta / GAMMA + 1
And voilà we can now plot our estimate of \(R_t\), along with associated uncertainty, over time:
rt %>%
mutate(date = dates) %>%
filter(date > lubridate::ymd("20200301")) %>%
ggplot() +
aes(x = date, y = mean_estimate, ymin = lower, ymax = upper) +
geom_line(color = "grey") +
geom_ribbon(alpha = .5, fill = "grey") +
geom_hline(yintercept = 1) +
labs(y = "Estimate of Rt", x = "") +
scale_y_continuous(breaks = c(1,2,3,4)) +
coord_cartesian(ylim=c(NA, 5)) +
NULL
Our final result is 1.13 with standard error 0.26
This is very similar to Systrom’s estimate for New York State:
This is due to the inclusion of \[\theta_t \sim N(\theta_{t-1}, \sigma)\] in our model specification
The main difference, however, is that the uncertainty in the state space version of this model never collapses as far as it does in Systrom’s model. This is due to the fact that we are assuming that \(\gamma(R_t-1)\) (and by extension, \(R_t\)) is a moving target.
Further extensions of the state space approach could allow for including:
How can we do this? It’s fairly trivial using known state space approaches.
Let’s add a trend as an example:
We have the mean of our poisson process to be \(\lambda = e^\theta\). Our existing model assumed that \(\theta\) was simply a varying intercept, but there is no reason that we couldn’t set it up as a time-varying trend:
\[\theta_t \sim ~ N(\mu_t, \sigma_\theta)\] \[\mu_t \sim ~ N(\mu_{t-1} + \rho_{t-1}, \sigma_\mu)\] \[\rho_t \sim ~ N(\rho_{t-1}, \sigma_\rho)\]
with \(\rho\) as the slope that varies in time (notice that it’s getting added to the level, \(\mu\), at every time step)
Here’s how to do that in R
and KFAS
This doesn’t noticeably improve this model – so no need to include it. If we had longer time series, or were pooling different time series, this may have a larger effect.
mod2 = SSModel(
itp1 ~ SSMtrend(2, Q = list(NA, NA)),
u = it,
distribution = "poisson"
)
mod2_fit = fitSSM(mod2, c(1,1))
mod2_fit_filtered = KFS(mod2_fit$model, "state")
theta2 = tibble(
mean_estimate = mod2_fit_filtered$a[, 1],
upper = mean_estimate + 1.96 * sqrt(mod2_fit_filtered$P[1,1,]),
lower = mean_estimate - 1.96 * sqrt(mod2_fit_filtered$P[1,1,])
)[-1, ] # throw away the initial observation
rt2 = theta2 / GAMMA + 1
rt2 %>%
mutate(date = dates) %>%
filter(date > lubridate::ymd("20200301")) %>%
ggplot() +
aes(x = date, y = mean_estimate, ymin = lower, ymax = upper) +
geom_line(color = "grey") +
geom_ribbon(alpha = .5, fill = "grey") +
geom_hline(yintercept = 1) +
labs(y = "Estimate of Rt", x = "") +
scale_y_continuous(breaks = c(1,2,3,4)) +
coord_cartesian(ylim=c(NA, 5)) +
NULL
Thoughts, questions?
tom at gradientmetrics dot com