Analytics2026-05-29

Bayesian opponent-adjusted FVOA: stabler week-to-week NFL rankings

By NickMay 29, 20265 min read

The problem with raw EPA leaderboards

Expected Points Added per play (EPA/play) is the right currency for team efficiency. It already credits a 4-yard pass on 3rd-and-3 differently from a 4-yard pass on 3rd-and-7, because Expected Points already knows the down and distance. But the moment you sort a leaderboard by team-average EPA, you walk into two well-known problems:

No opponent adjustment. Through Week 4, the Bears could lead the league in offensive EPA because they've played four bottom-10 defenses. Or the Broncos' defense could rank #1 because they've drawn three rookie quarterbacks in a row. Raw averages can't see that.

Small-sample variance. A team that has 47 offensive snaps through Week 1 will have a wider distribution of plausible "true" EPAs than a team with 280 snaps. Sorting them as if they're equally precise produces a leaderboard whose top and bottom flip wildly week to week.

The Bayesian opponent-adjusted FVOA we now ship on /nfl/rankings (the "Adj FVOA" column) and on every team's profile page solves both problems with one hierarchical model.

The model, in one paragraph

For every offensive play i in the season:


epa[i] ~ Normal(α + offense_effect[posteam[i]] - defense_effect[defteam[i]], σ_play)

offense_effect[t] and defense_effect[t] are per-team coefficients drawn from a shared Normal(0, σ) prior. They represent "this team's offense plays this much above/below league average per snap, net of the opponents they faced."

The minus sign on defense_effect is a sign convention: a defense that suppresses EPA gets a positive coefficient, so when we report fvoa_net = offense_effect + defense_effect, a 0.18 means the team is +0.18 EPA/play above average across both sides of the ball.

Why partial pooling fixes the week-3 problem

Without a hierarchical prior, you have two options:

- No pooling: fit a free coefficient per team. Tiny samples produce huge per-team variances. Week 3 leaderboards swing wildly. - Full pooling: force every team to the same number. The leaderboard is useless.

The hierarchical prior offense_effect[t] ~ Normal(0, σ_off) does both at once. σ_off is itself a parameter, fit from the data. When the season is young and effects are noisy, the posterior pulls every team's coefficient toward zero — partial pooling. When the season is mature and effects are clearly separated, σ_off grows and the pooling weakens.

The practical effect: Mahomes throwing a 65-yard garbage-time touchdown in Week 1 does not make Kansas City the #1 offense by Week 2. The model knows it shouldn't be that confident yet.

Identifiability: the sum-to-zero reparam

There's a sneaky degree of freedom in the model as written. You can add any constant c to every team's offense_effect and subtract c from α without changing the likelihood. The posterior would technically be improper without a constraint, and even with a weak prior on α you'd see the chains slowly drift along this ridge.

We fix it the standard way: after sampling offense_effect_raw ~ Normal(0, σ_off).expand([n_teams]), we deterministically subtract the cross-team mean before reporting:


off = numpyro.deterministic("offense_effect", off_raw - off_raw.mean())

This isn't centering for cosmetic reasons. It's making the parameter we report — "team T's deviation from the league average" — the actual sampled quantity, which is also exactly what every downstream consumer of this number wants.

Why we sample with NUTS

The conjugate-Bayes shortcut for this model would require restrictive prior choices (e.g., Normal-Inverse-Gamma) and would still need numerical work for the per-team credible intervals. We use the No-U-Turn Sampler (NUTS) via NumPyro because:

We can keep the priors honest (HalfNormal on the variance components, not Inverse-Gamma).
We get per-team posterior samples for free, and credible intervals are just np.quantile(samples, [0.025, 0.975]).
JAX's JIT-compiled gradient evaluations make this fast enough to refit a whole season in roughly a minute on CI hardware.

We run 1,000 warmup steps + 2,000 sampling steps × 2 sequential chains. The Gelman-Rubin r̂ statistic is reported per team — when a team's chains haven't mixed (r̂ > 1.5), we drop the row rather than publishing a number we can't stand behind.

What this isn't

This isn't ESPN's Football Power Index (FPI), which folds in preseason priors, schedule strength, and prior-season weights. We deliberately stay within the current season's plays so the rating is interpretable as "what this roster has actually produced in 2025."

This isn't Football Outsiders' DVOA, which uses success rate against league-average baselines computed per situation. Our raw input is EPA, which is itself derived from a model of expected points conditional on situation, so situational adjustment is already baked in. The thing we add on top is opponent adjustment with proper uncertainty.

Caveats we live with

- Early-season uncertainty. The 95% credible bands in Week 4 are wide. That's correct — we don't know the rankings yet. If our band overlaps with another team's, we genuinely don't know which is better.

- No score-state weighting. A team's blowout-loss plays count the same as their close-game plays. There are good arguments for downweighting garbage-time snaps, but we'd rather not put our thumb on the scale and let users see the raw posterior. A future revision will likely add a score_diff_abs < 21 filter as a toggle.

- Per-season fits, not per-week. We refit once per season using all available weeks. Refitting weekly would cost ~17× the compute for no qualitative gain past Week 4-5, where the partial pooling has already done its work.

- Special teams + penalties. EPA already credits them when they happen on offensive snaps. But a team that wins on returns and field-position swings will be slightly underrated by this rating, which only looks at scrimmage plays.

Reading the column

On /nfl/rankings, "Adj FVOA" is the team's fvoa_net_mean × 100 — a positive number means the team is above league average, scaled to be readable as EPA-per-100-plays. The color shade is the percentile of that team within the season's pool of 32. On a player profile, the "Team Adj FVOA" chip in the header carries the same number plus the team's 1-32 rank.

If you'd rather see receiving-room context than team strength, the same page now has a "Stable snap %" column — empirical-Bayes smoothing for snap shares, the topic of the next article in this series.

Free · No card

Run this on your own league.

Connect ESPN or Sleeper to get trade suggestions, league-aware rankings, and manager grades.

Bayesian opponent-adjusted FVOA: stabler week-to-week NFL rankings

The problem with raw EPA leaderboards

The model, in one paragraph

Why partial pooling fixes the week-3 problem

Identifiability: the sum-to-zero reparam

Why we sample with NUTS

What this isn't

Caveats we live with

Reading the column

More articles

What NFL betting markets price correctly (and the one thing they don't)

Do NFL sportsbooks get sharper through the season?

Catching injury noise: empirical-Bayes smoothing for NFL snap shares