Predicting Disease Waves: The Negative Binomial GLM for Weekly Epidemic Incidence
Your Statistical Surfboard for Riding Epidemic Waves
🌊 Introduction: Why Weekly Counts Need Special Handling
Imagine you’re tracking weekly flu cases in your city. Some weeks you see 120 cases, the next week 85, then 210—what’s going on? Is this just random noise, or is a real outbreak brewing?
This is where the Generalized Linear Model (GLM) for weekly incidence with Negative Binomial distribution becomes your statistical lifeline. Unlike simple linear regression (which assumes normally distributed errors), this model respects the fundamental nature of disease count data: discrete, non-negative integers that often show more variability than expected.
The key insight? Disease counts are inherently overdispersed—meaning they vary more than a simple Poisson model would predict. One week might have a superspreading event, another might catch the tail end of an outbreak, creating “extra” variability that needs proper statistical accounting.
Developed from the foundational work of Nelder and Wedderburn on GLMs [1] and refined through decades of epidemiological applications [2-3], the Negative Binomial GLM has become the workhorse model for routine epidemic surveillance, seasonal forecasting, and intervention evaluation.
🧮 Model Description: The Mathematical Engine
At its heart, the Negative Binomial GLM for weekly incidence combines three essential components: a probability distribution for the data, a linear predictor for systematic effects, and a link function connecting them.
Core Model Specification
Yₜ ~ NegBin(λₜ, θ)
log λₜ = α + β · Xₜ + A · sin(2πt / P) + C · cos(2πt / P)
Let’s unpack this elegant equation step by step:
- Yₜ: Observed number of disease cases in week t
- λₜ: Expected (mean) number of cases in week t
- θ (theta): Overdispersion parameter (θ > 0)
- α (alpha): Intercept—baseline log-incidence when all other factors are zero
- β (beta): Vector of coefficients for measured covariates
- Xₜ: Vector of explanatory variables in week t (e.g., temperature, vaccination coverage, mobility)
- A, C: Amplitude parameters for seasonal variation
- P: Period of seasonal cycle (typically P = 52 weeks for annual seasonality)
- t: Week number (t = 1, 2, …, T)
Understanding the Negative Binomial Distribution
The Negative Binomial distribution is defined by its mean-variance relationship:
E[Yₜ] = λₜ
Var[Yₜ] = λₜ + λₜ² / θ
This is crucial! Notice that the variance has two components:
- λₜ: The “Poisson-like” component (variance equals mean)
- λₜ² / θ: The “extra” overdispersion component
When θ → ∞, the model approaches a Poisson distribution (Var[Yₜ] → λₜ). When θ is small, overdispersion is strong, accommodating the bursty nature of real epidemic data.
Seasonal Modeling with Sine and Cosine
The seasonal terms A · sin(2πt / P) + C · cos(2πt / P) create a flexible annual cycle. This can be rewritten as:
R · sin(2πt / P + φ)
Where:
- R = √(A² + C²): Overall seasonal amplitude
- φ = arctan(C / A): Phase shift (timing of peak season)
This formulation ensures smooth, periodic seasonal patterns without artificial discontinuities at year boundaries.
📊 Key Parameter Definitions and Typical Values
Understanding these parameters helps you interpret model outputs and assess model fit.
| α | Baseline log-incidence | -5 to 2 | Higher α = higher overall disease burden |
| β | Covariate effects | -1 to 1 | β = 0.3 means 35% higher incidence per unit X |
| A, C | Seasonal amplitudes | -2 to 2 | Larger |
| θ | Overdispersion | 0.1 – 10 | Smaller θ = more overdispersion |
| P | Seasonal period | 52 (annual) | Fixed for most infectious diseases |
| T | Total weeks | 52 – 520 | Longer series = better seasonal estimation |
Variance Interpretation Examples
- If λₜ = 100 and θ = 10: Var[Yₜ] = 100 + 100²/10 = 1,100 (SD ≈ 33)
- If λₜ = 100 and θ = 100: Var[Yₜ] = 100 + 100²/100 = 200 (SD ≈ 14)
- Poisson case (θ → ∞): Var[Yₜ] = 100 (SD = 10)
Real epidemic data typically shows SD much larger than √λₜ, justifying the Negative Binomial approach.
⚠️ Assumptions and Applicability: When This Model Shines
The Negative Binomial GLM is powerful but works best under specific conditions.
✅ Ideal Applications
- Weekly or monthly aggregated data: Natural time scale for surveillance
- Moderate to high case counts: Yₜ typically > 5 on average
- Clear seasonal patterns: Diseases like influenza, RSV, or enteroviruses
- Available covariates: Weather, interventions, or behavioral data
- Stationary seasonality: Seasonal pattern doesn’t change dramatically over time
❌ Limitations and Challenges
- Very rare diseases: When Yₜ = 0 frequently, models become unstable
- Rapidly changing dynamics: During emerging outbreaks with exponential growth
- Non-stationary seasonality: Climate change altering seasonal patterns
- Autocorrelation: Current week’s cases depend on previous weeks (requires time series extensions)
💡 Pro Tip: Always check residual plots—if you see patterns in residuals over time, you might need to add autoregressive terms or use a more complex time series model [4].
🚀 Model Extensions and Variants: Beyond Basic Seasonality
The basic Negative Binomial GLM serves as a foundation for numerous sophisticated extensions.
1. Autoregressive Negative Binomial Model
To account for week-to-week dependence (common in epidemics):
log λₜ = α + β · Xₜ + A · sin(2πt / P) + C · cos(2πt / P) + γ · log(Yₜ₋₁ + c)
Where γ captures the autoregressive effect and c is a small constant to handle zero counts [5].
2. Piecewise Linear Trend Model
For diseases with changing long-term trends (e.g., due to vaccination programs):
log λₜ = α + β · Xₜ + δ · (t − τ)₊ + seasonal terms
Where (t − τ)₊ = max(0, t − τ) and τ is a changepoint (e.g., vaccine introduction) [6].
3. Multivariate Negative Binomial Model
For multiple related diseases or age groups simultaneously:
Yₜ,ₖ ~ NegBin(λₜ,ₖ, θₖ)
log λₜ,ₖ = αₖ + βₖ · Xₜ + seasonalₖ + εₜ,ₖ
With Cov(εₜ,ₖ, εₜ,ₖ’) capturing cross-disease correlations [7].
4. Zero-Inflated Negative Binomial Model
For diseases with excess zeros (many weeks with no cases):
Yₜ ~ (1−πₜ) · δ₀ + πₜ · NegBin(λₜ, θ)
logit(πₜ) = α₀ + β₀ · Zₜ
This separates the probability of any cases from the intensity when cases occur [8].
5. Hierarchical Negative Binomial Model
For multi-region surveillance (e.g., all US states):
log λₜ,ᵣ = αᵣ + βᵣ · Xₜ,ᵣ + seasonalᵣ + uᵣ
uᵣ ~ Normal(0, σ²ᵤ)
Where regional parameters αᵣ, βᵣ are drawn from common distributions, enabling information sharing [9].
6. Real-Time Nowcasting Extension
To handle reporting delays in recent weeks:
Yₜᵒᵇˢ = ∑₍d=0₎ᴰ Yₜ₋d · p(d)
Yₜ ~ NegBin(λₜ, θ)
Where p(d) is the reporting delay distribution and Yₜᵒᵇˢ are observed (delayed) counts [10].
🎯 Conclusion: The Workhorse of Epidemic Surveillance
The Negative Binomial GLM for weekly incidence represents statistical epidemiology at its most practical and powerful. By respecting the discrete, overdispersed nature of disease count data while incorporating seasonal patterns and covariate effects, it provides a robust foundation for routine epidemic monitoring.
What makes this model particularly valuable is its interpretability—each parameter has a clear epidemiological meaning, making results accessible to public health practitioners and policymakers alike. The seasonal terms tell you when to expect peaks, the covariate coefficients quantify intervention effects, and the overdispersion parameter reveals how “bursty” your disease truly is.
As we’ve seen during the COVID-19 pandemic and seasonal influenza surveillance, models that balance simplicity with statistical rigor are essential for distinguishing real signals from random noise. The Negative Binomial GLM provides exactly this balance, serving as both a diagnostic tool for current conditions and a forecasting engine for near-term planning.
Whether you’re monitoring routine seasonal patterns, evaluating the impact of a new vaccine, or establishing baseline expectations for outbreak detection, this model belongs in every Statistical Epidemics Toolbox. In the complex world of infectious disease dynamics, sometimes the most powerful insights come from models that respect the fundamental nature of the data—count by count, week by week.
📚 References
[1] Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society: Series A, 135(3), 370–384. https://doi.org/10.2307/2344614
[2] Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data (2nd ed.). Cambridge University Press.
[3] Held, L., & Paul, M. (2012). Modeling seasonality in space-time infectious disease surveillance data. Biometrical Journal, 54(6), 824–843. https://doi.org/10.1002/bimj.201200048
[4] Chatfield, C. (2003). The Analysis of Time Series: An Introduction (6th ed.). Chapman & Hall/CRC.
[5] Jung, R. C., & Tremayne, A. R. (2011). Useful models for time series of counts or simply wrong ones? Advances in Statistical Analysis, 95(1), 59–91. https://doi.org/10.1007/s10182-010-0139-9
[6] Serfling, R. E. (1963). Methods for current statistical analysis of excess pneumonia-influenza deaths. Public Health Reports, 78(6), 494–506. https://doi.org/10.2307/4591848
[7] Min, Y., & Agresti, A. (2005). Random effect models for repeated measures of zero-inflated count data. Statistical Modelling, 5(1), 1–19. https://doi.org/10.1191/1471082X05st084oa
[8] Lambert, D. (1992). Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34(1), 1–14. https://doi.org/10.2307/1269547
[9] Lawson, A. B. (2018). Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology (3rd ed.). CRC Press.
[10] Höhle, M., & an der Heiden, M. (2014). Bayesian nowcasting during the STEC O104:H4 outbreak in Germany, 2011. Biometrics, 70(4), 993–1002. https://doi.org/10.1111/biom.12218