The Negative Binomial GLM

Predicting Disease Waves: The Negative Binomial GLM for Weekly Epidemic Incidence

Your Statistical Surfboard for Riding Epidemic Waves


🌊 Introduction: Why Weekly Counts Need Special Handling

Imagine you’re tracking weekly flu cases in your city. Some weeks you see 120 cases, the next week 85, then 210—what’s going on? Is this just random noise, or is a real outbreak brewing?

This is where the Generalized Linear Model (GLM) for weekly incidence with Negative Binomial distribution becomes your statistical lifeline. Unlike simple linear regression (which assumes normally distributed errors), this model respects the fundamental nature of disease count data: discrete, non-negative integers that often show more variability than expected.

The key insight? Disease counts are inherently overdispersed—meaning they vary more than a simple Poisson model would predict. One week might have a superspreading event, another might catch the tail end of an outbreak, creating “extra” variability that needs proper statistical accounting.

Developed from the foundational work of Nelder and Wedderburn on GLMs [1] and refined through decades of epidemiological applications [2-3], the Negative Binomial GLM has become the workhorse model for routine epidemic surveillance, seasonal forecasting, and intervention evaluation.


🧮 Model Description: The Mathematical Engine

At its heart, the Negative Binomial GLM for weekly incidence combines three essential components: a probability distribution for the data, a linear predictor for systematic effects, and a link function connecting them.

Core Model Specification

Yₜ ~ NegBin(λₜ, θ)
log λₜ = α + β · Xₜ + A · sin(2πt / P) + C · cos(2πt / P)

Let’s unpack this elegant equation step by step:

  • Yₜ: Observed number of disease cases in week t
  • λₜ: Expected (mean) number of cases in week t
  • θ (theta): Overdispersion parameter (θ > 0)
  • α (alpha): Intercept—baseline log-incidence when all other factors are zero
  • β (beta): Vector of coefficients for measured covariates
  • Xₜ: Vector of explanatory variables in week t (e.g., temperature, vaccination coverage, mobility)
  • A, C: Amplitude parameters for seasonal variation
  • P: Period of seasonal cycle (typically P = 52 weeks for annual seasonality)
  • t: Week number (t = 1, 2, …, T)

Understanding the Negative Binomial Distribution

The Negative Binomial distribution is defined by its mean-variance relationship:

E[Yₜ] = λₜ
Var[Yₜ] = λₜ + λₜ² / θ

This is crucial! Notice that the variance has two components:

  • λₜ: The “Poisson-like” component (variance equals mean)
  • λₜ² / θ: The “extra” overdispersion component

When θ → ∞, the model approaches a Poisson distribution (Var[Yₜ] → λₜ). When θ is small, overdispersion is strong, accommodating the bursty nature of real epidemic data.

Seasonal Modeling with Sine and Cosine

The seasonal terms A · sin(2πt / P) + C · cos(2πt / P) create a flexible annual cycle. This can be rewritten as:

R · sin(2πt / P + φ)

Where:

  • R = √(A² + C²): Overall seasonal amplitude
  • φ = arctan(C / A): Phase shift (timing of peak season)

This formulation ensures smooth, periodic seasonal patterns without artificial discontinuities at year boundaries.


📊 Key Parameter Definitions and Typical Values

Understanding these parameters helps you interpret model outputs and assess model fit.

αBaseline log-incidence-5 to 2Higher α = higher overall disease burden
βCovariate effects-1 to 1β = 0.3 means 35% higher incidence per unit X
A, CSeasonal amplitudes-2 to 2Larger
θOverdispersion0.1 – 10Smaller θ = more overdispersion
PSeasonal period52 (annual)Fixed for most infectious diseases
TTotal weeks52 – 520Longer series = better seasonal estimation

Variance Interpretation Examples

  • If λₜ = 100 and θ = 10: Var[Yₜ] = 100 + 100²/10 = 1,100 (SD ≈ 33)
  • If λₜ = 100 and θ = 100: Var[Yₜ] = 100 + 100²/100 = 200 (SD ≈ 14)
  • Poisson case (θ → ∞): Var[Yₜ] = 100 (SD = 10)

Real epidemic data typically shows SD much larger than √λₜ, justifying the Negative Binomial approach.


⚠️ Assumptions and Applicability: When This Model Shines

The Negative Binomial GLM is powerful but works best under specific conditions.

✅ Ideal Applications

  • Weekly or monthly aggregated data: Natural time scale for surveillance
  • Moderate to high case counts: Yₜ typically > 5 on average
  • Clear seasonal patterns: Diseases like influenza, RSV, or enteroviruses
  • Available covariates: Weather, interventions, or behavioral data
  • Stationary seasonality: Seasonal pattern doesn’t change dramatically over time

❌ Limitations and Challenges

  • Very rare diseases: When Yₜ = 0 frequently, models become unstable
  • Rapidly changing dynamics: During emerging outbreaks with exponential growth
  • Non-stationary seasonality: Climate change altering seasonal patterns
  • Autocorrelation: Current week’s cases depend on previous weeks (requires time series extensions)

💡 Pro Tip: Always check residual plots—if you see patterns in residuals over time, you might need to add autoregressive terms or use a more complex time series model [4].


🚀 Model Extensions and Variants: Beyond Basic Seasonality

The basic Negative Binomial GLM serves as a foundation for numerous sophisticated extensions.

1. Autoregressive Negative Binomial Model

To account for week-to-week dependence (common in epidemics):

log λₜ = α + β · Xₜ + A · sin(2πt / P) + C · cos(2πt / P) + γ · log(Yₜ₋₁ + c)

Where γ captures the autoregressive effect and c is a small constant to handle zero counts [5].

2. Piecewise Linear Trend Model

For diseases with changing long-term trends (e.g., due to vaccination programs):

log λₜ = α + β · Xₜ + δ · (t − τ)₊ + seasonal terms

Where (t − τ)₊ = max(0, t − τ) and τ is a changepoint (e.g., vaccine introduction) [6].

3. Multivariate Negative Binomial Model

For multiple related diseases or age groups simultaneously:

Yₜ,ₖ ~ NegBin(λₜ,ₖ, θₖ)
log λₜ,ₖ = αₖ + βₖ · Xₜ + seasonalₖ + εₜ,ₖ

With Cov(εₜ,ₖ, εₜ,ₖ’) capturing cross-disease correlations [7].

4. Zero-Inflated Negative Binomial Model

For diseases with excess zeros (many weeks with no cases):

Yₜ ~ (1−πₜ) · δ₀ + πₜ · NegBin(λₜ, θ)
logit(πₜ) = α₀ + β₀ · Zₜ

This separates the probability of any cases from the intensity when cases occur [8].

5. Hierarchical Negative Binomial Model

For multi-region surveillance (e.g., all US states):

log λₜ,ᵣ = αᵣ + βᵣ · Xₜ,ᵣ + seasonalᵣ + uᵣ
uᵣ ~ Normal(0, σ²ᵤ)

Where regional parameters αᵣ, βᵣ are drawn from common distributions, enabling information sharing [9].

6. Real-Time Nowcasting Extension

To handle reporting delays in recent weeks:

Yₜᵒᵇˢ = ∑₍d=0₎ᴰ Yₜ₋d · p(d)
Yₜ ~ NegBin(λₜ, θ)

Where p(d) is the reporting delay distribution and Yₜᵒᵇˢ are observed (delayed) counts [10].


🎯 Conclusion: The Workhorse of Epidemic Surveillance

The Negative Binomial GLM for weekly incidence represents statistical epidemiology at its most practical and powerful. By respecting the discrete, overdispersed nature of disease count data while incorporating seasonal patterns and covariate effects, it provides a robust foundation for routine epidemic monitoring.

What makes this model particularly valuable is its interpretability—each parameter has a clear epidemiological meaning, making results accessible to public health practitioners and policymakers alike. The seasonal terms tell you when to expect peaks, the covariate coefficients quantify intervention effects, and the overdispersion parameter reveals how “bursty” your disease truly is.

As we’ve seen during the COVID-19 pandemic and seasonal influenza surveillance, models that balance simplicity with statistical rigor are essential for distinguishing real signals from random noise. The Negative Binomial GLM provides exactly this balance, serving as both a diagnostic tool for current conditions and a forecasting engine for near-term planning.

Whether you’re monitoring routine seasonal patterns, evaluating the impact of a new vaccine, or establishing baseline expectations for outbreak detection, this model belongs in every Statistical Epidemics Toolbox. In the complex world of infectious disease dynamics, sometimes the most powerful insights come from models that respect the fundamental nature of the data—count by count, week by week.


📚 References

[1] Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A, 135(3), 370–384. https://doi.org/10.2307/2344614

[2] Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data (2nd ed.). Cambridge University Press.

[3] Held, L., & Paul, M. (2012). Modeling seasonality in space-time infectious disease surveillance data. Biometrical Journal, 54(6), 824–843. https://doi.org/10.1002/bimj.201200048

[4] Chatfield, C. (2003). The Analysis of Time Series: An Introduction (6th ed.). Chapman & Hall/CRC.

[5] Jung, R. C., & Tremayne, A. R. (2011). Useful models for time series of counts or simply wrong ones? Advances in Statistical Analysis, 95(1), 59–91. https://doi.org/10.1007/s10182-010-0139-9

[6] Serfling, R. E. (1963). Methods for current statistical analysis of excess pneumonia-influenza deaths. Public Health Reports, 78(6), 494–506. https://doi.org/10.2307/4591848

[7] Min, Y., & Agresti, A. (2005). Random effect models for repeated measures of zero-inflated count data. Statistical Modelling, 5(1), 1–19. https://doi.org/10.1191/1471082X05st084oa

[8] Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14. https://doi.org/10.2307/1269547

[9] Lawson, A. B. (2018). Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology (3rd ed.). CRC Press.

[10] Höhle, M., & an der Heiden, M. (2014). Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011. Biometrics, 70(4), 993–1002. https://doi.org/10.1111/biom.12218