Using State Space Models to Estimate Rt

Thomas P. Vladeck

2020-04-16

Update: Be sure to check out the update to this post where we gratifyingly produce a huge chart with all the states on it.

Disclaimer: I am not an epidemiologist. Further to what Kevin said, I am not going to try and invent a new model, but rather help estimate an already-existing-one, that was developed by epidemiologists.

This is the post

The background on this is this post (and this technical notebook) by Kevin Systrom, founder of Instagram, which makes the case that we need to have accurate real-time estimates of \(R_t\), the reproduction number, and provides a methodology for estimating them.

While his approach is clear and awesome, I think that there are two major ways it can be improved:

I could well be wrong about #1, and have misunderstood his procedure at some point. However, the model below, which explicitly models \(R_t\) as a moving target, has wider uncertainty near present day viz. Systrom’s model

  1. Systrom’s procedure to estimate \(R_t\) seemed to assume a static \(R_t\), of which we got succesively better estimates with additional data. Ideally we would assume that \(R_t\) could evolve over time (indeed, this seemed to be implictly assumed in the post), but in order to accommodate we need to include some kind of process variance
  2. I had a hard time going through Systrom’s estimation procedure. I am sure it’s right, but it’s also purpose-built for just this problem. If my math is right below, we can use the tried-and-true Kalman Filter to recursively estimate \(R_t\). This has the advantage of using well-tested approaches; it also brings with it a suite of additional enhancements (like modeling trend, seasonal effects, etc.)

The main equation we start with is

This is equation (2) in the reference paper

\[I_{t + \tau} = I_t e^{\tau\gamma(R_t - 1)}\]

Finance people will notice that this formula is the same as continuously compound interest \[PV e^{i\tau}\] with \(i\) the interest rate and \(\tau\) the time period So we can think of \(\gamma(R_t-1)\) as our “interest rate”, which we call \(\theta\) below

Where:

There is a major issue with \(I_t\). Specifically, we never observe it. All we observe are the number of new cases each day, and the cumulative case count. And deaths, of course, but this won’t figure in to the analysis below.

I am not sure if it boils down to the same issue or not, but Systrom says:

For epidemics that have \(R_t\gt1\) for a long time and then become under control (\(R_t\lt1\)), the posterior gets stuck. It cannot forget about the many days where \(R_t\gt1\) so eventually \(P(R_t|k)\) asymptotically approaches 1 when we know it’s well under 1. The authors note this in the paper as a footnote. Unfortunately this won’t work for us. The most critical thing to know is when we’ve dipped below the 1.0 threshold!

So, I propose to only incorporate the last \(m\) days of the likelihood function.

We have to do something similar. Here’s why:

If we’re modeling:

\[I_{t + 1} \sim poisson(I_te^{\gamma(R_t - 1)})\] And we use cumulative cases covering the entire history as \(I_t\), then it will always be the case that

\[I_{t + 1} \geq I_t\]

Which implies

\[e^{\gamma(R_t - 1)} \geq 1\]

\[\gamma(R_t - 1) \geq 0\] \[R_t \geq 1\]

I set \(W\) (WINDOW in my code) to 20.

So, to allow \(R_t \lt 1\), we need to allow \(I_{t + \tau} \lt I_t\), which we can accomplish the same way – by only considering the past \(W\) days in our analysis.

Using a state space approach seems natural for this problem, since we’re dealing with streaming data that we want to use to recursively estimate a moving target.

Going back to our model setup:

\[I_{t + 1} \sim poisson(I_te^{\gamma(R_t - 1)})\]

This is exactly equivalent to a poisson regression with an offset or exposure term equal to \(I_t\).

Pulling from wikipedia The poisson model is set up as: \[E[Y | \theta] = exposure*e^{\theta}\] We then apply the log link function: \[log(E[Y | \theta]) = log(exposure) + \theta\] And here, we can substitute \(\gamma(R_t-1)\) for \(\theta\) and \(I_t\) for \(exposure\)

The other change we’ll make is to call \(\gamma(R_t-1) = \theta\) (see the note to the right).

Then a state space model with just an intercept \(\theta_t\) and offset \(= I_t\) will be:

\[I_{t + 1} \sim poisson(I_te^{\theta_t})\] \[\theta_t \sim N(\theta_{t-1}, \sigma)\] It is this second term in the model that allows us to explicitly model the process variance in the “interest rate” \(\gamma(R_t-1)\)

Doing this in R is straightforward:

KFAS is a package for “Exponential Family State Space Models in R” zoo we use for the function rollsum

library(tidyverse)
library(KFAS)
library(zoo)
#### data ####

url = 'https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv'
dat = read_csv(url)
#### constants ####

WINDOW = 20
SERIAL_INTERVAL = 4
GAMMA = 1 / SERIAL_INTERVAL
STATE = "New York"

All we’re doing here is to rebuild the “cumulative cases”, only considering the new cases in WINDOW. We left pad the series to make things line up correctly.

#### building the dataset #### 
# rolling window
series = 
  dat %>% 
  filter(state == STATE) %>%
  filter(cases>0) %>% 
  pull(cases) %>% 
  diff %>% 
  {. + 1} %>% 
  {c(rep(0, WINDOW-1), .)} %>% 
  rollsum(., WINDOW) 

# dates
dates = dat %>% 
  filter(state == STATE) %>%
  filter(cases>0) %>% 
  pull(date) %>% 
  .[c(-1,-2)]

it here is equal to \(I_t\), and itp1 is \(I_{t+1}\)

it = series[-length(series)]
itp1 = series[-1]

u is the exposure parameter. Often you will see this entered into a glm as log(.). (See the margin note above) However, KFAS in its documentation makes clear that they log this term inside the function

mod = SSModel(
  itp1 ~ 1, 
  u = it,
  distribution = "poisson"
)


This is KFAS’s way of estimating parameters with maximum likelihood. This corresponds to \(\sigma\) above (the process variance in \(\theta\), which is our “interest rate” \(\gamma(R_t-1)\))

mod$Q[1,1,1] = NA
mod_fit = fitSSM(mod, c(1,1))


Once we’ve estimated \(\sigma\), we can recursively filter and smooth the \(\theta\)s

mod_fit_filtered = KFS(
  mod_fit$model, c("state", "mean"), c("state", "mean"))


We can inspect how the model fits by comparing our one-step-ahead forecasts of cases with actuals (on a log scale):

It’s important to note that these are true forecasts of the day ahead. They are not smoothed values (which would be trivial to extract as well) The reason is that we want to estimate the true uncertainty that a decision-maker would face, which is regarding the estimate of \(R_t\) in the future

tibble(predictions = mod_fit_filtered$m, 
       actuals = series[-1]) %>% 
  mutate_all(~ c(NA, diff(.x))) %>% 
  mutate(date = dates) %>% 
  .[-1, ] %>% 
  gather(-date, key = series, value = value) %>% 
  ggplot() + 
  aes(x = date, y = value, color = series) + 
  geom_line() + 
  scale_y_log10("New case count") + 
  labs(x = "") + 
  scale_color_brewer("", type = "qual", palette = 1) + 
  theme(legend.position = c(0.2, 0.8))


Satisfied that the model fits well, we can proceed to estimate where \(R_t\) is:

Here we extract the estimates of \(\theta\) with a traditional 95% confidence interval

theta  = tibble(
  mean_estimate = mod_fit_filtered$a[, 1],
  upper = mean_estimate + 1.96 * sqrt(mod_fit_filtered$P[1,1,]),
  lower = mean_estimate - 1.96 * sqrt(mod_fit_filtered$P[1,1,])
)[-1, ] # throw away the initial observation


We now have to invert \(\theta = \gamma(R_t-1)\)

rt = theta / GAMMA + 1


And voilà we can now plot our estimate of \(R_t\), along with associated uncertainty, over time:

rt %>% 
  mutate(date = dates) %>% 
  filter(date > lubridate::ymd("20200301")) %>% 
  ggplot() + 
  aes(x = date, y = mean_estimate, ymin = lower, ymax = upper) + 
  geom_line(color = "grey") + 
  geom_ribbon(alpha = .5, fill = "grey") + 
  geom_hline(yintercept = 1) + 
  labs(y = "Estimate of Rt", x = "") + 
  scale_y_continuous(breaks = c(1,2,3,4)) +
  coord_cartesian(ylim=c(NA, 5)) + 
  NULL

Our final result is 1.13 with standard error 0.26

This is very similar to Systrom’s estimate for New York State:

This is due to the inclusion of \[\theta_t \sim N(\theta_{t-1}, \sigma)\] in our model specification

The main difference, however, is that the uncertainty in the state space version of this model never collapses as far as it does in Systrom’s model. This is due to the fact that we are assuming that \(\gamma(R_t-1)\) (and by extension, \(R_t\)) is a moving target.

Further extensions of the state space approach could allow for including:

How can we do this? It’s fairly trivial using known state space approaches.

Let’s add a trend as an example:

We have the mean of our poisson process to be \(\lambda = e^\theta\). Our existing model assumed that \(\theta\) was simply a varying intercept, but there is no reason that we couldn’t set it up as a time-varying trend:

\[\theta_t \sim ~ N(\mu_t, \sigma_\theta)\] \[\mu_t \sim ~ N(\mu_{t-1} + \rho_{t-1}, \sigma_\mu)\] \[\rho_t \sim ~ N(\rho_{t-1}, \sigma_\rho)\]

with \(\rho\) as the slope that varies in time (notice that it’s getting added to the level, \(\mu\), at every time step)

Here’s how to do that in R and KFAS

This doesn’t noticeably improve this model – so no need to include it. If we had longer time series, or were pooling different time series, this may have a larger effect.

mod2 = SSModel(
  itp1 ~ SSMtrend(2, Q = list(NA, NA)), 
  u = it,
  distribution = "poisson"
)

mod2_fit = fitSSM(mod2, c(1,1))

mod2_fit_filtered = KFS(mod2_fit$model, "state")

theta2  = tibble(
  mean_estimate = mod2_fit_filtered$a[, 1],
  upper = mean_estimate + 1.96 * sqrt(mod2_fit_filtered$P[1,1,]),
  lower = mean_estimate - 1.96 * sqrt(mod2_fit_filtered$P[1,1,])
)[-1, ] # throw away the initial observation

rt2 = theta2 / GAMMA + 1

rt2 %>% 
  mutate(date = dates) %>% 
  filter(date > lubridate::ymd("20200301")) %>% 
  ggplot() + 
  aes(x = date, y = mean_estimate, ymin = lower, ymax = upper) + 
  geom_line(color = "grey") + 
  geom_ribbon(alpha = .5, fill = "grey") + 
  geom_hline(yintercept = 1) + 
  labs(y = "Estimate of Rt", x = "") + 
  scale_y_continuous(breaks = c(1,2,3,4)) +
  coord_cartesian(ylim=c(NA, 5)) + 
  NULL

Thoughts, questions?