Lecture 05

Bayesian Model fit and comparison

Jihong Zhang

Educational Statistics and Research Methods

Today’s Lecture Objectives

  1. Bayesian methods for determining how well a model fits the data (absolute fit)

  2. Bayesian methods for determining which model fits better (relative model fit)

  3. In previous class…

    1. We estimated the empty model and the full model (you can try other constrained model between the empty and full model)

      • Empty model: a model without any covariate (predictors)

      • Full model: a model with all possible covariates and up-to highest interaction effects

    2. We also make the Stan code more efficient by vectorizing the parameters and the data

    3. The next question is how to determine the model is “good” enough

      • “good” has multiple meanings

      • a typical criteria is to what degree the model can generate a “simulated” data that is similar to the original data

      • or, the likelihood of original data given the parameters of model

Types of Model fit

  • Absolute model fit

    • PPMC

    • In SEM, RMSEA, chi-square, SRMR, and GFI

  • Relative model fit

    • information criterion
  • Incremental model fit (not frequently used other than SEM)

    • A special type of absolute model fit - how a model fits to a saturated model

    • In SEM, comparative fit index (CFI), Tucker-Lewis index (TLI)

Absolute Model Fit: PPMC

Posterior predictive model checking (Gelman, Meng, and Stern 1996) is a Bayesian method evaluation technique for determining if a model fits the data.

  • Absolute model fit: “Does my model fit my data well?”

    • Recall that “model is a simplified version of the true data-generation process”

    • Thus, the model should be able to reproduce the “data” that is similar to observed data

    • In machine learning, this is also called “validation”, typically used as a separate “validation data sets”

  • Overall idea: if a model fits the data well, then simulated data based on the model will resemble the observed data

Ingredients in PPMC:

  • Original data

    • Typically, characterized by some set of statistics (i.e., sample mean, standard deviation, covariance) applied to data
  • Simulated data

    • Generated from partial/all posterior draws in the Markov Chain

    • Summarized by the same set of statistics

PPMC Example: Linear Models

The linear model from our example was:

\[ \text{WeightLB}_p = \beta_0 + \beta_1 \text{HeightIN}_p + \beta_2 \text{Group2}_p + \beta_3\text{Group3}_p \\ +\beta_4 \text{HeightIN}_p\text{Group2}_p \\ +\beta_5 \text{HeightIN}_p\text{Group3}_p \\ + e_p \]

with:

  • \(\text{Group2}_p\) the binary indicator of person \(p\) being in group 2
  • \(\text{Group}3_p\) the binary indicator of person \(p\) being in group 3
  • \(e_p \sim N(0, \sigma_e)\)
fit_full_new$summary(variables = c('beta', 'sigma'))

PPMC Process

The PPMC Process can be summarized as follows:

  1. Select parameters from a single (sampling) iteration of the Markov Chain
  2. Using the selected parameters and the model to simulate a data set with the sample size (with same number of observations/variables)
  3. From the simulated data set, calculate selected summary statistics (e.g., mean, sd)
  4. Repeat steps 1-3 for a fixed number of iterations (perhaps across the whole chain)
  5. When done, compare position of observed summary statistics to that of the distribution of summary statistics from simulated data sets (predictive distribution)

Step 1: Assemble the posterior draws

For our linear model, let’s denote our observed dependent variable as \(Y\)

  • Note that independent variables are not modeled (not explained by statistical formula), so we cannot examine them.

First, let’s assemble the posterior draws (\(1000 \times 4 \times 7\)):

posteriorSample = fit_full_new$draws(variables = c('beta','sigma'), format = 'draws_matrix')
posteriorSample

Step 2: One draw from posterior

Next, let’s draw one set of posterior values of parameters at random with replacement:

set.seed(1234)
sampleIteration = sample(x = 1:nrow(posteriorSample), size = 1, replace = TRUE)
sampleIteration
posteriorSample[sampleIteration,]

Those draws of \(\boldsymbol{\beta}\) and \(\sigma\) can be used to generate a predicted \(\hat{y}\):

\[ \hat{y}_i \sim N(\mathbf{X}\boldsymbol{\beta}_i, \sigma_i) \]

  • \(i\) represents the \(i\) th draw

Step 3: One simulated data

We then generate data based on this sampled iteration and our model distributional assumption:

# Sample Size
N = nrow(dat)
# beta
betaVector = matrix(posteriorSample[sampleIteration, 1:6], ncol = 1)
betaVector
# sigma
sigma = posteriorSample[sampleIteration, 7]
# X
FullModelFormula = as.formula("WeightLB ~ HeightIN60 + DietGroup + HeightIN60*DietGroup")
X = model.matrix(FullModelFormula, data = dat)

simY = rnorm(n = N, mean = X %*% betaVector, sd = sigma)
head(simY)

Step 4: Summary Statistics of Simulated Data

Note that we do not want to directly compare simulated data to the observed data.

Instead, we extract some characteristics of simulated/observed data for comparison using summary statistics.

There are some advantages:

  1. We may have research interests in only some characteristics of data (whether our model predict the mean of dependent variables)
  2. We can QC detailed aspects of the fitting process of model

In this case, for example, we may be interested in whether the model captures the “location” or “scale” of WeightLB

mean(dat$WeightLB)
(simMean = mean(simY))
sd(dat$WeightLB)
(simSD = sd(simY))

Step 5: Looping across all posterior samples

We can repeat step 2-4 for a set number of samples

Optionally, we can choose to use up all iterations we have for Markov Chains (\(I = 4000\)) in practice

I = nrow(posteriorSample)
## create empty simSD and simMean as placeholders
simSD = simMean = rep(NA, I)
for (i in 1:I) {
  # beta
  betaVector = matrix(posteriorSample[i, 1:6], ncol = 1)
  # sigma
  sigma = posteriorSample[i, 7]
  # X
  simY = rnorm(n = N, mean = X %*% betaVector, sd = sigma)
  simMean[i] = mean(simY)
  simSD[i] = sd(simY)
}
par(mfrow = 1:2)
hist(simMean)
hist(simSD)

Compare to the observed mean

We can now compare our observed mean and standard deviation with that of the sample values.

  • Blue line: the average value of predicted WeightLB

  • Red line: the observed mean value of WeightLB

  • The PDF of predictive values of summary statistics of WeightLB is called posterior predictive distribution

PPMC can be checked using visual inspection:

library(ggplot2)
ppp <- data.frame(
  simMean = simMean, 
  simSD = simSD
)
ggplot(ppp) +
  geom_histogram(aes(x = simMean), fill = "grey", col = "black") +
  geom_vline(xintercept = mean(dat$WeightLB), col = "red", size = 1.4) + # red line: location of mean of predicted WeightLB by model
  geom_vline(xintercept = mean(simMean), col = "blue", size = 1.4, alpha = 0.5) # blue line: location of mean of WeightLB

Compare to the observed SD

Similarly, let’s compare SD to the posterior predictive distribution of SD of WeightLB

  • the observed SD is located as the center of posterior predictive distribution (PPD)

  • the average mean of PPD is slightly higher than the observed SD

ggplot(ppp) +
  geom_histogram(aes(x = simSD), fill = "grey", col = "black") +
  geom_vline(xintercept = sd(dat$WeightLB), col = "red", size = 1.4) + # red line: location of mean of predicted WeightLB by model
  geom_vline(xintercept = mean(simSD), col = "blue", size = 1.4, alpha = 0.5) # blue line: location of mean of WeightLB

PPMC Characteristics

PPMC methods are very useful

  • They provide a visual way to determine if the model fits the observed data
  • They are the main method of assessing absolute fit in Bayesian models
  • Absolute fit assesses if a model fits the data instead of comparing to another model

But, there are some drawbacks to PPMC methods

  • Almost any statistic can be used
    • Some are better than others (mean and SD of outcomes are nice choices for linear regression)
  • No standard determining how much misfit is too much
  • May be overwhelming to compute depending on your model

Question: Can PPMC be used for models with maximum likelihood estimation or ordinary least squares?

Posterior Predictive P-values

We can summarize the PPMC using a type of “p-value”

Personally, I don’t like the name “p-value”, sounds like we are trying to justify our results using significance testing

Different from the frequentist “p-value” (if the null hypothesis is true, the probability of the observed data existing)

  • The PPP-value: the proportion of times the statistic from the simulated data exceeds that of the observed data

  • Useful to determine how far off a statistic is from its posterior predictive distribution

If these p-values were:

  1. near 0 or 1, indicating your model is far off your data
  2. near .5, indicating your model fits your data in terms of the statistics you examined

The PPP-value for mean:

mean(simMean > mean(dat$WeightLB))

The PPP-value for SD:

mean(simSD > sd(dat$WeightLB))

Compute PPP-values within Stan

We can use the generated quantities block of Stan to compute PPP-values for us:

generated quantities{
  // simulated data
  array[N] real weightLB_rep = normal_rng(X*beta, sigma);
  // posterior predictive distribution for mean and SD
  real mean_weightLB = mean(weightLB);
  real sd_weightLB = sd(weightLB);
  real mean_weightLB_rep = mean(to_vector(weightLB_rep));
  real<lower=0> sd_weightLB_rep = sd(to_vector(weightLB_rep));
  // ppp-values for mean and sd
  int<lower=0, upper=1> ppp_mean = (mean_weightLB_rep > mean_weightLB);
  int<lower=0, upper=1> ppp_sd = (sd_weightLB_rep > sd_weightLB);
}

It will give us:

fit_full_ppp$summary(variables = c('mean_weightLB_rep', 'sd_weightLB_rep', 'ppp_mean', 'ppp_sd'))

Advantages or disadvantages of Computing PPP within Stan

Pros:

  1. Built-in functions of Stan to generate simulated data for example normal_rng(), making PPP-values estimated much faster
  2. Nice visual inspection tools existed - bayesplot

Cons:

  1. Not allowed to debug each step in PPMC if something wrong
  2. Cannot adjust the statistics and need to re-run the whole MCMC sampling, which is time-consuming
bayesplot::mcmc_dens_chains(fit_full_ppp$draws('mean_weightLB_rep'))

Relative Model Fit

Relative model fit: used to compare 2 or more competing models in terms of their mode fit. Sometime, it is also called model selection.

  • In non-Bayesian models, Information Criteria are often used to make comparisons

    • AIC, BIC, DIC etc.

    • Typically IC is a function of log-likelihood and penalty

    • The model with the lowest IC is the model that fits best

  • Bayesian model fit is similar

    • Uses an index value

    • The model with the lowest index is the model that fits best

  • Recent advances in Bayesian model fit use indices that are tied to make cross-validation predictions (inspired by machine learning):

    • Fit model leaving one observation out (LOO)

    • Calculate statistics related to prediction (for instance, log-likelihood of that observation conditional on model parameters)

    • Do for all observations

  • New Bayesian indices try to mirror these leave-one-out predictions (but approximate these due to time constraints)

Deviance Information Indices

When late 1990s and early 2000s, the Deviance Information Criterion was popular for relative Bayesian model fit comparisons. It is proved not as good as LOO or WAIC. But let’s have a look at:

\[ \text{DIC} = p_D + \overline{D(\theta)} \]

where \(p_D\) is the estimated number of parameters as follows:

\[p_D = \overline{D(\theta)} - D(\bar\theta)\]and where

\[ D(\theta) = -2 \log(p(y \mid \theta)) + C \]

C is a constant that cancels out when model comparisons are made

Here.

  • \(\overline{D(\theta)}\) is the average log likelihood of the data (y) given the parameters (\(\theta\)) computed across all samples

  • \(D(\bar\theta)\) is the log likelihood of the data (y) computed at the average of the parameters (\(\theta\)) computed across all samples

DIC in Stan

Some program like JAGS will have provided DIC directly, but Stan does not provides direct ways of calculating DIC:

We can manually calculate approximated DIC by replacing \(\bar\theta\) as median of posteriors \(\dot\theta\) if the posterior distributions are almost normal.

DIC for Full Model

# extract log-likelihood and parameters
posterior_draws <- fit_full_ppp$draws(c('log_lik', 'beta'), format = 'draws_matrix')
log_like_draws <- rowSums(posterior_draws[,colnames(posterior_draws) %in% paste0("log_lik[", 1:30, "]")]) 
# find draws of loglikelihood that have parameters equaling to median values
theta_draws <- posterior_draws[, colnames(posterior_draws) %in% paste0("beta[", 1:6, "]")]
theta_ave_iteration = apply(round(theta_draws, 3), 2, \(x) which(x == median(x))) |> unlist() |> as.numeric()
log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])
mean_log_like_draws = (-2)*mean(log_like_draws)
# compute DIC
DIC_full = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_draws
DIC_full

DIC for Empty Model

# extract log-likelihood and parameters
posterior_draws <- fit_empty_ppp$draws(c('log_lik', 'beta0'), format = 'draws_matrix')
log_like_draws <- rowSums(posterior_draws[,colnames(posterior_draws) %in% paste0("log_lik[", 1:30, "]")]) 
# find draws of loglikelihood that have parameters equaling to median values
theta_draws <- posterior_draws[, colnames(posterior_draws) == "beta0"]
theta_ave_iteration = which(round(theta_draws,2) == median(round(theta_draws,2)))
log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])
mean_log_like_draws = (-2)*mean(log_like_draws)
# compute DIC
DIC_empty = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_draws
DIC_empty

Issues to DIC

The DIC has fallen out of favor recently (not used in Stan)

  • Has issues when parameters are discrete

  • Not fully Bayesian (point estimate of average of parameter values)

  • Can give negative values for estimated numbers of parameters in a model

WAIC (Widely applicable or Watanabe-Akaike information criterion, Watanabe, 2010) corrects some of the problems with DIC:

  • Fully Bayesian (uses entire posterior distribution)

  • Asymptotically equal to Bayesian cross-validation

  • Invariant to parameterization

Watanabe-Akaike information criterion (WAIC)

  • A more frequently used model comparison indices for Bayesian analysis

  • Using the loo R package, we can calculate WAIC with the waic() function:

WAIC for the full model

loo::waic(fit_full_ppp$draws('log_lik'))

WAIC for the empty model

loo::waic(fit_empty_ppp$draws('log_lik'))
  • Here:
    • elpd_waic is the expected log pointwise predictive density for WAIC

    • p_waic is the WAIC-version estimated number of parameter, similar to \(p(D)\) in DIC, which is a penalty to the likelihood for more parameters

    • waic is the WAIC index used for model comparisons (lowest value is best fitting; -2*elpd_waic)

  • Note that WAIC needs a log_lik variable in the model analysis to be calculated correctly. cmdstan will automate calculate this variable.

LOO: Approximation to Leave-one-out

Big picture:

  1. Besides WAIC, other comparative fit indices include LOO via Pareto Smoothed Important Sampling (PSIS) via Stan’s LOO package

  2. WAIC/LOO can be used for model comparison with lowest value suggesting better model fit

  3. Different from DIC, LOO via PSIS attempts to “approximate” the process of leave-one-out cross-validation (LOO-CV) using a sampling based-approach

    • Gives a finite-sample approximation
    • Implemented in Stan
    • Can quickly compare models
    • Gives warnings when it may be less reliable to use
  4. The details of computation of LOO are very technical, but are nicely compiled in Vehtari, Gelman, and Gabry (2017).

LOO: Leave-one-out in Stan

Using loo package, we can calculate efficient approximate leave-one-out cross-validation (LOO-CV)

Full model’s LOO:

full_loo_res <- fit_full_ppp$loo('log_lik', save_psis = TRUE)
full_loo_res$estimates

Empty model’s LOO:

empty_loo_res <- fit_empty_ppp$loo('log_lik', save_psis = TRUE)
empty_loo_res$estimates

Here:

  • elpd_loo is the expected log pointwise predictive density for LOO (recall that posterior predictive distribution has some uncertainty around the mean value…)

  • p_loo is the LOO calculation of number of model parameters (a penalty to the likelihood for more parameters)

  • looic is the LOO index used for model comparisons — lowest value suggests best fitting -2elpd_loo

LOO: Comparing Model with LOO

Alternative, you can use the built-in function loo_compare in loo package to compare alternative models:

loo::loo_compare(list(empty = fit_empty_ppp$loo(), full = fit_full_ppp$loo()))

This function calculats the standard error of the difference in elpd (expected log pointwise predictive density) between models

  • The SE gives an indication of the standard error in the estimate (relative to the size)
  • Can use this to downweight the choice of models when the standard error is high (Open Question: how high is high?)
    • \[ 0 \in \text{elpd_diff}\ \pm\ 1.96*\text{se_diff} \]
  • Note: elpd_diff is looic divided by -2 (on the log-likelihood scale, not the deviance scale)
    • Here, we interpret the results as the model with full predictors is preferred to the model with empty predictors.
    • The “confidence interval” of the elpd_diff does not include 0, indicating we can be fairly certain of this result

LOO: Pareto smoothed importance sampling (PSIS)

  • Estimated a vector of Pareto shape parameters \(k\) for individuals which represents the reliability of sampling:

    • \(k < .5\) (good) suggests the estimate converges quickly

    • \(.5 < k < .7\) (ok) suggests the estimate converges slowly

    • \(.7 < k < 1\) (bad) suggests bad performance

    • \(k > 1\) (very bad)

  • PSIS screens all cases that have bad diagnostic values. The percentage may tells some information regarding reliability of Bayesian estimation.

  • Criteria: Estimated tail shape parameter

  • This is new area, please see more references (Vehtari, Gelman, and Gabry 2016)

pareto_k_table(full_loo_res)

Compile the relative model fit indices

Below shows the all three model comparison fit indices for the empty model and the full model.

data.frame(
  Model = c("Full Model", "Empty Model"),
  DIC = c(DIC_full, DIC_empty),
  WAIC = c(WAIC_full, WAIC_empty),
  LOOIC = c(LOO_full, LOO_empty)
)
  • All fit indices suggest the full model is better than the empty model

  • For simple models, the results of DIC/WAIC/LOOIC should be consistent

  • In some situations that they are inconsistent, please use WAIC/LOO as standards.

General points about Bayesian Model Comparison

  • Note, WAIC and LOO will converge as sample size increases (WAIC is asymptotic value of LOO)
  • Latent variable models present challenges (could be your dissertation project)
    • Need log likelihood with latent variable being integrated out
  • Missing data present challenges
    • Need log likelihood with missing data integrated out
  • Generally, using LOO is recommended (but providing both is appropriate)

Cutting-Edge Research Field: Approximation Algorithm

  • MCMC sampling could be very computational time-consuming.

  • For number of parameters up to thousands or even millions, it could be a challenge to get the estimation.

  • Thus, approximation algorithms has been developed, such as Laplace Approximation, Variational Inference (Dhaka et al. 2020; Yao et al. 2018), Pathfinder methods (Zhang et al. 2021).

Quoted from Gelman’s blog:

Advantages of Laplace:

  • Relatively cheap to compute, given that we already have a mode-finder in Stan.
  • Easy to understand, use, and communicate.
  • Works reasonably well in many examples (wherever the posterior can be well approximated by a normal distribution).
  • Easy to take draws from the normal approximation, also easy to compute importance ratios and use Pareto-smoothed importance sampling.

Limitations of Laplace:

  • Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails).

  • Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode.

Code
# Stan's LBFGS algorithm
fit_full_optim <- mod_full_ppp$optimize(data = data_full_new, seed = 1234, jacobian = TRUE)
fit_full_laplace <- mod_full_ppp$laplace(data = data_full_new, mode = fit_full_optim, draws = 4000)

# Run 'variational' method to use ADVI to approximate posterior
fit_full_vb <- mod_full_ppp$variational(data = data_full_new, seed = 1234, draws = 4000)

# Run 'pathfinder' method, a new alternative to the variational method
fit_pf <- mod_full_ppp$pathfinder(data = data_full_new, seed = 1234, draws = 4000)

Approximation Algorithm vs. Full MCMC

Though they are very quicker to converge, the accuracy of these approximation algorithms largely depend on the complexity of modeling (number of parameters, prior distributions, latent variables etc.)

Code
summ_list <- lapply(c(fit_full_ppp, fit_full_laplace, fit_full_vb, fit_pf), \(x) 
                    x$summary(c("beta", "sigma"))[c('variable', 'mean',"q5", "q95")])
summ_list[[1]]$Algorithm = "MCMC"
summ_list[[2]]$Algorithm = "Laplace Approx."
summ_list[[3]]$Algorithm = "Variational Inference"
summ_list[[4]]$Algorithm = "Pathfinder"
summ_forplot <- Reduce(rbind, summ_list)  
summ_forplot$Algorithm = factor(summ_forplot$Algorithm, levels = c("MCMC", "Laplace Approx.", "Pathfinder", "Variational Inference"))

ggplot(summ_forplot) +
  geom_col(aes(x = variable, y = mean, fill = Algorithm), position = position_dodge()) +
  geom_errorbar(aes(x = variable, ymax = q95, ymin = q5, fill = Algorithm),position = position_dodge2(padding = 0.6)) +
  labs(x = '', y = '') +
  theme_classic()
library(tidyverse)
rbind(
  cbind(data.frame(Algorithm = "Laplace Approx."), fit_full_laplace$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]),
  cbind(data.frame(Algorithm = "Variational Inference"), fit_full_vb$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]),
  cbind(data.frame(Algorithm = "Pathfinder"), fit_pf$draws(paste0("beta[", 1:6, "]"))[, 1:6]),
  cbind(data.frame(Algorithm = "MCMC"), fit_full_ppp$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6])
) |>
  mutate(Algorithm = factor(Algorithm, levels = c("MCMC", "Laplace Approx.", "Pathfinder", "Variational Inference"))) |> 
  pivot_longer(starts_with("beta"), names_to = "Parameter", values_to = "Draws") |>
  ggplot() +
  geom_density(aes(x = Draws, fill = Algorithm), alpha = 0.4) +
  facet_wrap(~Parameter, scales = "free")

Wrapping up

The three lectures using linear models was built to show nearly all parts needed in a Bayesian analysis

  • MCMC specifications

  • Prior specifications

  • Assessing MCMC convergence

  • Reporting MCMC results

  • Determining if a model fits the data (absolute fit)

  • Determining which model fits the data better (relative fit)

All of these topics will be with us when we start model complicated models in our future lecture.

Next Class

  1. Generalized measurement models

Reference

Dhaka, Akash Kumar, Alejandro Catalina, Michael Riis Andersen, Måns Magnusson, Jonathan H. Huggins, and Aki Vehtari. 2020. “Robust, Accurate Stochastic Optimization for Variational Inference.” arXiv. https://doi.org/10.48550/ARXIV.2009.00666.
Gelman, Andrew, Xiao-Li Meng, and Hal Stern. 1996. “Posterior Predictive Assessment of Model Fitness Via Realized Discrepancies.” Statistica Sinica.
Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2016. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.” Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Yes, but Did It Work?: Evaluating Variational Inference.” arXiv. https://doi.org/10.48550/ARXIV.1802.02538.
Zhang, Lu, Bob Carpenter, Andrew Gelman, and Aki Vehtari. 2021. “Pathfinder: Parallel Quasi-Newton Variational Inference.” https://doi.org/10.48550/ARXIV.2108.03782.