As anyone whose read this blog recently can surmise, I’m pretty interested in how this election turned out, and have been doing some exploratory research into the makeup of our electorate. Over the past few weeks I’ve taken the analysis a step further and built a sophisticated regression that goes as far as anything I’ve seen to unpack what happened.
Background on probability distributions
(Skip this section if you’re familiar with the beta and binomial distributions.)
Before I get started explaining how the model works, we need to discuss some important probability distributions.
The first one is easy: the coin flip. In math, we call a coin flip a Bernoulli trial, but they’re the same thing. A flip of a fair coin is what a mathematician would call a “Bernoulli trial with p = 0.5”. The “p = 0.5” part simply means that the coin has a 50% chance of landing heads (and 50% chance of landing tails). But in principle you can weight coins however you want, and you can have Bernoulli trials with p = 0.1, p = 0.75, p = 0.9999999, or whatever.
Now let’s imagine we flip one of these coins 100 times. What is the probability that it comes up heads 50 times? Even if the coin is fair (p = 0.5), just by random chance it may come up heads only 40 times, or may come up heads more than you’d expect – like 60 times. It is even possible for it to come up 100 times in a row, although the odds of that are vanishingly small.
The distribution of possible times the coin comes up heads is called a binomial distribution. A probability distribution is a set of numbers that assigns a value to every possible outcome. In the case of 100 coin flips, the binomial distribution will assign a value to every number between 0 and 100 (which are all the possible numbers of times the coin could come up heads), and all of these values will sum to 1.
Now let’s go one step further. Let’s imagine you have a big bag of different coins, all with different weights. Let’s imagine we grab a bunch of coins out of the bag and then flip them. How can we model the distribution of the number of times those coins will come up heads?
First, we need to think about the distribution of possible weights the coins have. Let’s imagine we line up the coins from the lowest weight to the highest weight, and stack coins with the same weight on top of each other. The relative “heights” of each stack tell us how likely it is that we grab a coin with that weight.
Now we basically have something called the beta distribution, which is a family of distributions that tell us how likely it is we’ll get a number between 0 and 1. Beta distributions are very flexible, and they can look like any of these shapes and almost everything in between:
Taken from Bruce Hardie: http://www.brucehardie.com/talks/cba_tut_art_16_HO.pdf
So if you had a bag like the upper left, most of the coins would be weighted to come up tails, and if you had a bag like the lower right, most of the coins would be weighted to come up heads; if you had a bag like the lower left, the coins would either be weighted very strongly to come up tails or very strongly to come up heads.
This distribution is called the beta-binomial.
Model set up
You might now be seeing where this is going. While we can’t observe individuals’ voting behavior (other than whether or not they voted), we can look at the talleys at local levels, like counties. And let’s say, some time before the election, you lined up every voter in a county and stacked them the same way you did with coins as before, but instead of the probability of “coming up heads”, you’d be looking at a voter’s probability of voting for one of the two major candidates. That would look like a beta distribution. You could then model the number of votes for a particular candidate in a particular county would as a beta-binomial distribution.
So in our model we can say the number of votes
V[i] in county
i is distributed beta-binomial with
N[i] voters and voters with
p[i] propensity to vote for that candidate:
V[i] ~ binomial(p[i], N[i])
But we’re keeping in mind that
p[i] is not a single number but a beta distribution with parameters
p[i] ~ beta(alpha[i], beta[i])
So now we need to talk about
beta. A beta distribution needs two parameters to tell you what kind of shape it has. Commonly, these are called
beta (I know, it’s confusing to have the name of the distribution and one of its parameters be the same), and the way you can think about it is that
alpha “pushes” the distribution to the right (i.e. in the lower right above) and that
beta “pushes” the distribution to the left (i.e. in the upper left above). Both
beta have to be greater than zero.
Unfortunately, while this helps us understand what’s going on with the shape of the distribution, it’s not a useful way to encapsulate the information if we were to talk about voting behavior. If something (say unemployment) were to “push” the distribution one way (say having an effect on
alpha), it would also likely have an effect on
beta (because they push in opposite directions). Ideally, we’d separate alpha and beta into two unrelated pieces of information. Let’s see how we can do that.
It’s a property of the beta distribution that its average is:
alpha + beta
So let’s just define a new term called
mu that’s equal to this average.
mu = ------------
alpha + beta
And then we can define a new term
phi like so
phi = --------
With a few lines of arithmetic, we can solve for everything else:
phi = alpha + beta
alpha = mu * phi
beta = (1 - mu) * phi
alpha is the amount of “pushing” to the right and
beta is the amount of “pushing” to the left in the distribution, then
phi is all of the pushing (either left or right) in the distribution. This is a sort of “uniformity” parameter. Large values of
phi mean that almost all of the distribution is near the average (think the upper right beta distribution above) – the
beta are pushing up against each other – and small values of
phi mean that almost all the values are away from the average (think the beta distribution on the lower left above).
In this parameterization, we can model propensity and polarization independently.
So now we can use county-level information to set up regressions on
phi – and therefore on the county’s distribution of voters, and how they ended up voting. Since
mu has to be between 0 and 1 we use the
logit link function, and since
phi has to be greater than zero, we use the
exponential link function
logit(mu[i]) = linear function of predictors in county i
log(phi[i]) = linear function of predictors in county i
The “linear functions of predictors” have the format:
coef[uninsured] * uninsured[i] + coef[unemployment] * unemployment[i] + ...
uninsured[i] is the uninsurance rate in that county and
coef[uninsured] is the effect that uninsurance has on the average propensity of voters in that county (in the first equation) or the polarity/centrality of the voting distribution (in the second equation).
For each county, I extracted nine pieces of information:
- The proportion of residents that do not have insurance
- The rate of unemployment
- The rate of diabetes (a proxy for overall health levels)
- The median income
- The violent crime rate
- The median age
- The gini coefficient (an index of income heterogeneity)
- The rate of high-school graduation
- The proportion of residents that are white
Since each of the above pieces of information had two coefficients (one each for the equations for
phi) the model I used had twenty parameters against 3111 observations.
The source for the data is the same as in this post, and is available and described here.
BUGS model code is below: (all of the code is available here and the model code is in the file county_binom_model.bugs.R)
Model results / validation
The model performs very well on first inspection, especially when we take the log of the actual votes and the prediction (upper right plot), and even more so when we do that and restrict it only to counties with greater than 20,000 votes (lower left plot):
This is actually cheating a bit, since the number of votes for HRC (which the model is fitting) in any county is constrained by the number of votes overall. Here’s a plot showing the estimated proportion vs. the actual proportion of votes for HRC, weighted by the number of votes overall:
Here is the plot of coefficients for
mu (the average propensity within a county):
All else being equal, coefficients to the left of the vertical bar helped Trump, and to the right helped Clinton. As we can see, since more Democratic support is concentrated in dense urban areas, there are many more counties that supported Trump, so the intercept is far to the left. Unsurprisingly (but perhaps sadly)
whiteness was the strongest predictor overall and was very strong for Trump.
In addition, the rate of
uninsurance was a relatively strong predictor for Trump support, and
diabetes (a proxy for overall health) was a smaller but significant factor.
Economic factors (
gini / income inequality, and
unemployment) were either not a factor or predicted support for Clinton.
The effects on polarity can be seen here:
What we can see here (as the intercept is far to the right) is that most individual counties have a fairly uniform voter base. High rates of
whiteness predict high uniformity, and basically nothing except for
income inequality predicts diversity in voting patterns (and this is unsurprising).
What is also striking is that we can map
phi against each other. This is a plot of “uniformity” – how similar voting preferences are within a county vs. “propensity” – the average direction a vote will go within a county. In this graph,
mu is on the y axis, and
log(phi) is on the x axis, and the size of a county is represented by the size of a circle:
What we see is a positive relationship between support for Trump and uniformity within a county and vice versa.
And if you’re interested in bayesian inference using gibbs sampling, here are the trace plots for the parameters to show they converged nicely: mu trace / phi trace.
Conclusion and potential next steps
This modeling approach has the advantage of closely approximating the underlying dynamics of voting, and the plots showing the actual outcome vs. predicted outcome show the model has pretty good fit.
It also shows that whiteness was a major driver of Trump support, and that economic factors on their own were decidedly not a factor in supporting Trump. If anything, they predicted support for Clinton. It also provides an interesting way of directly modeling unit-level (in this case, county-level) uniformity / polarity among the electorate. This approach could perhaps be of use in better identifying “swing counties” (or at least a different approach in identifying them).
This modeling approach can be extended in an number of interesting ways. For example, instead of using a beta-binomial distribution to model two-way voting patterns, we could use a dirichlet-multinomial distribution (basically, the extension of beta-binomial to more than 2 possible outcomes) to model voting patterns across all candidates (including Libertarian and Green), and even flexibly model turnout by including not voting as an outcome in the distribution.
We could build similar regressions for past elections and see how coefficients have changed over time.
We could even match voting records across the ’12 and ’16 elections to make inferences about the components of the county-level vote swing: voters flipping their vote, voting in ’12 and not voting in ’16, or not voting in ’12 and then voting in ’16 – and which candidate they came to support.