Bayesian Model fit and comparison
Educational Statistics and Research Methods
Bayesian methods for determining how well a model fits the data (absolute fit)
Bayesian methods for determining which model fits better (relative model fit)
In previous class…
We estimated the empty model and the full model (you can try other constrained model between the empty and full model)
Empty model: a model without any covariate (predictors)
Full model: a model with all possible covariates and up-to highest interaction effects
We also make the Stan code more efficient by vectorizing the parameters and the data
The next question is how to determine the model is “good” enough
“good” has multiple meanings
a typical criteria is to what degree the model can generate a “simulated” data that is similar to the original data
or, the likelihood of original data given the parameters of model
Absolute model fit
PPMC
In SEM, RMSEA, chi-square, SRMR, and GFI
Relative model fit
Incremental model fit (not frequently used other than SEM)
A special type of absolute model fit - how a model fits to a saturated model
In SEM, comparative fit index (CFI), Tucker-Lewis index (TLI)
Posterior predictive model checking (Gelman, Meng, and Stern 1996) is a Bayesian method evaluation technique for determining if a model fits the data.
Absolute model fit: “Does my model fit my data well?”
Recall that “model is a simplified version of the true data-generation process”
Thus, the model should be able to reproduce the “data” that is similar to observed data
In machine learning, this is also called “validation”, typically used as a separate “validation data sets”
Overall idea: if a model fits the data well, then simulated data based on the model will resemble the observed data
Original data
Simulated data
Generated from partial/all posterior draws in the Markov Chain
Summarized by the same set of statistics
The linear model from our example was:
\[ \text{WeightLB}_p = \beta_0 + \beta_1 \text{HeightIN}_p + \beta_2 \text{Group2}_p + \beta_3\text{Group3}_p \\ +\beta_4 \text{HeightIN}_p\text{Group2}_p \\ +\beta_5 \text{HeightIN}_p\text{Group3}_p \\ + e_p \]
with:
The PPMC Process can be summarized as follows:
For our linear model, let’s denote our observed dependent variable as \(Y\)
First, let’s assemble the posterior draws (\(1000 \times 4 \times 7\)):
Next, let’s draw one set of posterior values of parameters at random with replacement:
Those draws of \(\boldsymbol{\beta}\) and \(\sigma\) can be used to generate a predicted \(\hat{y}\):
\[ \hat{y}_i \sim N(\mathbf{X}\boldsymbol{\beta}_i, \sigma_i) \]
We then generate data based on this sampled iteration and our model distributional assumption:
# Sample Size
N = nrow(dat)
# beta
betaVector = matrix(posteriorSample[sampleIteration, 1:6], ncol = 1)
betaVector
# sigma
sigma = posteriorSample[sampleIteration, 7]
# X
FullModelFormula = as.formula("WeightLB ~ HeightIN60 + DietGroup + HeightIN60*DietGroup")
X = model.matrix(FullModelFormula, data = dat)
simY = rnorm(n = N, mean = X %*% betaVector, sd = sigma)
head(simY)
Note that we do not want to directly compare simulated data to the observed data.
Instead, we extract some characteristics of simulated/observed data for comparison using summary statistics.
There are some advantages:
In this case, for example, we may be interested in whether the model captures the “location” or “scale” of WeightLB
We can repeat step 2-4 for a set number of samples
Optionally, we can choose to use up all iterations we have for Markov Chains (\(I = 4000\)) in practice
I = nrow(posteriorSample)
## create empty simSD and simMean as placeholders
simSD = simMean = rep(NA, I)
for (i in 1:I) {
# beta
betaVector = matrix(posteriorSample[i, 1:6], ncol = 1)
# sigma
sigma = posteriorSample[i, 7]
# X
simY = rnorm(n = N, mean = X %*% betaVector, sd = sigma)
simMean[i] = mean(simY)
simSD[i] = sd(simY)
}
par(mfrow = 1:2)
hist(simMean)
hist(simSD)
We can now compare our observed mean and standard deviation with that of the sample values.
Blue line: the average value of predicted WeightLB
Red line: the observed mean value of WeightLB
The PDF of predictive values of summary statistics of WeightLB is called posterior predictive distribution
PPMC can be checked using visual inspection:
library(ggplot2)
ppp <- data.frame(
simMean = simMean,
simSD = simSD
)
ggplot(ppp) +
geom_histogram(aes(x = simMean), fill = "grey", col = "black") +
geom_vline(xintercept = mean(dat$WeightLB), col = "red", size = 1.4) + # red line: location of mean of predicted WeightLB by model
geom_vline(xintercept = mean(simMean), col = "blue", size = 1.4, alpha = 0.5) # blue line: location of mean of WeightLB
Similarly, let’s compare SD to the posterior predictive distribution of SD of WeightLB
the observed SD is located as the center of posterior predictive distribution (PPD)
the average mean of PPD is slightly higher than the observed SD
ggplot(ppp) +
geom_histogram(aes(x = simSD), fill = "grey", col = "black") +
geom_vline(xintercept = sd(dat$WeightLB), col = "red", size = 1.4) + # red line: location of mean of predicted WeightLB by model
geom_vline(xintercept = mean(simSD), col = "blue", size = 1.4, alpha = 0.5) # blue line: location of mean of WeightLB
PPMC methods are very useful
But, there are some drawbacks to PPMC methods
Question: Can PPMC be used for models with maximum likelihood estimation or ordinary least squares?
We can summarize the PPMC using a type of “p-value”
Personally, I don’t like the name “p-value”, sounds like we are trying to justify our results using significance testing
Different from the frequentist “p-value” (if the null hypothesis is true, the probability of the observed data existing)
The PPP-value: the proportion of times the statistic from the simulated data exceeds that of the observed data
Useful to determine how far off a statistic is from its posterior predictive distribution
We can use the generated quantities
block of Stan to compute PPP-values for us:
generated quantities{
// simulated data
array[N] real weightLB_rep = normal_rng(X*beta, sigma);
// posterior predictive distribution for mean and SD
real mean_weightLB = mean(weightLB);
real sd_weightLB = sd(weightLB);
real mean_weightLB_rep = mean(to_vector(weightLB_rep));
real<lower=0> sd_weightLB_rep = sd(to_vector(weightLB_rep));
// ppp-values for mean and sd
int<lower=0, upper=1> ppp_mean = (mean_weightLB_rep > mean_weightLB);
int<lower=0, upper=1> ppp_sd = (sd_weightLB_rep > sd_weightLB);
}
It will give us:
Pros:
normal_rng()
, making PPP-values estimated much fasterbayesplot
Cons:
Relative model fit: used to compare 2 or more competing models in terms of their mode fit. Sometime, it is also called model selection.
In non-Bayesian models, Information Criteria are often used to make comparisons
AIC, BIC, DIC etc.
Typically IC is a function of log-likelihood and penalty
The model with the lowest IC is the model that fits best
Bayesian model fit is similar
Uses an index value
The model with the lowest index is the model that fits best
Recent advances in Bayesian model fit use indices that are tied to make cross-validation predictions (inspired by machine learning):
Fit model leaving one observation out (LOO)
Calculate statistics related to prediction (for instance, log-likelihood of that observation conditional on model parameters)
Do for all observations
New Bayesian indices try to mirror these leave-one-out predictions (but approximate these due to time constraints)
When late 1990s and early 2000s, the Deviance Information Criterion was popular for relative Bayesian model fit comparisons. It is proved not as good as LOO or WAIC. But let’s have a look at:
\[ \text{DIC} = p_D + \overline{D(\theta)} \]
where \(p_D\) is the estimated number of parameters as follows:
\[p_D = \overline{D(\theta)} - D(\bar\theta)\]and where
\[ D(\theta) = -2 \log(p(y \mid \theta)) + C \]
C is a constant that cancels out when model comparisons are made
Here.
\(\overline{D(\theta)}\) is the average log likelihood of the data (y) given the parameters (\(\theta\)) computed across all samples
\(D(\bar\theta)\) is the log likelihood of the data (y) computed at the average of the parameters (\(\theta\)) computed across all samples
Some program like JAGS will have provided DIC directly, but Stan does not provides direct ways of calculating DIC:
We can manually calculate approximated DIC by replacing \(\bar\theta\) as median of posteriors \(\dot\theta\) if the posterior distributions are almost normal.
DIC for Full Model
# extract log-likelihood and parameters
posterior_draws <- fit_full_ppp$draws(c('log_lik', 'beta'), format = 'draws_matrix')
log_like_draws <- rowSums(posterior_draws[,colnames(posterior_draws) %in% paste0("log_lik[", 1:30, "]")])
# find draws of loglikelihood that have parameters equaling to median values
theta_draws <- posterior_draws[, colnames(posterior_draws) %in% paste0("beta[", 1:6, "]")]
theta_ave_iteration = apply(round(theta_draws, 3), 2, \(x) which(x == median(x))) |> unlist() |> as.numeric()
log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])
mean_log_like_draws = (-2)*mean(log_like_draws)
# compute DIC
DIC_full = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_draws
DIC_full
DIC for Empty Model
# extract log-likelihood and parameters
posterior_draws <- fit_empty_ppp$draws(c('log_lik', 'beta0'), format = 'draws_matrix')
log_like_draws <- rowSums(posterior_draws[,colnames(posterior_draws) %in% paste0("log_lik[", 1:30, "]")])
# find draws of loglikelihood that have parameters equaling to median values
theta_draws <- posterior_draws[, colnames(posterior_draws) == "beta0"]
theta_ave_iteration = which(round(theta_draws,2) == median(round(theta_draws,2)))
log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])
mean_log_like_draws = (-2)*mean(log_like_draws)
# compute DIC
DIC_empty = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_draws
DIC_empty
The DIC has fallen out of favor recently (not used in Stan)
Has issues when parameters are discrete
Not fully Bayesian (point estimate of average of parameter values)
Can give negative values for estimated numbers of parameters in a model
WAIC (Widely applicable or Watanabe-Akaike information criterion, Watanabe, 2010) corrects some of the problems with DIC:
Fully Bayesian (uses entire posterior distribution)
Asymptotically equal to Bayesian cross-validation
Invariant to parameterization
A more frequently used model comparison indices for Bayesian analysis
Using the loo
R package, we can calculate WAIC with the waic()
function:
elpd_waic
is the expected log pointwise predictive density for WAIC
p_waic
is the WAIC-version estimated number of parameter, similar to \(p(D)\) in DIC, which is a penalty to the likelihood for more parameters
waic
is the WAIC index used for model comparisons (lowest value is best fitting; -2*elpd_waic
)
log_lik
variable in the model analysis to be calculated correctly. cmdstan
will automate calculate this variable.Big picture:
Besides WAIC, other comparative fit indices include LOO via Pareto Smoothed Important Sampling (PSIS) via Stan’s LOO
package
WAIC/LOO can be used for model comparison with lowest value suggesting better model fit
Different from DIC, LOO via PSIS attempts to “approximate” the process of leave-one-out cross-validation (LOO-CV) using a sampling based-approach
The details of computation of LOO are very technical, but are nicely compiled in Vehtari, Gelman, and Gabry (2017).
Using loo
package, we can calculate efficient approximate leave-one-out cross-validation (LOO-CV)
Full model’s LOO:
Here:
elpd_loo
is the expected log pointwise predictive density for LOO (recall that posterior predictive distribution has some uncertainty around the mean value…)
p_loo
is the LOO calculation of number of model parameters (a penalty to the likelihood for more parameters)
looic
is the LOO index used for model comparisons — lowest value suggests best fitting -2elpd_loo
Alternative, you can use the built-in function loo_compare
in loo
package to compare alternative models:
This function calculats the standard error of the difference in elpd
(expected log pointwise predictive density) between models
elpd_diff
is looic
divided by -2 (on the log-likelihood scale, not the deviance scale)
elpd_diff
does not include 0, indicating we can be fairly certain of this resultEstimated a vector of Pareto shape parameters \(k\) for individuals which represents the reliability of sampling:
\(k < .5\) (good) suggests the estimate converges quickly
\(.5 < k < .7\) (ok) suggests the estimate converges slowly
\(.7 < k < 1\) (bad) suggests bad performance
\(k > 1\) (very bad)
PSIS screens all cases that have bad diagnostic values. The percentage may tells some information regarding reliability of Bayesian estimation.
Criteria: Estimated tail shape parameter
This is new area, please see more references (Vehtari, Gelman, and Gabry 2016)
Below shows the all three model comparison fit indices for the empty model and the full model.
All fit indices suggest the full model is better than the empty model
For simple models, the results of DIC/WAIC/LOOIC should be consistent
In some situations that they are inconsistent, please use WAIC/LOO as standards.
MCMC sampling could be very computational time-consuming.
For number of parameters up to thousands or even millions, it could be a challenge to get the estimation.
Thus, approximation algorithms has been developed, such as Laplace Approximation, Variational Inference (Dhaka et al. 2020; Yao et al. 2018), Pathfinder methods (Zhang et al. 2021).
Quoted from Gelman’s blog:
Advantages of Laplace:
- Relatively cheap to compute, given that we already have a mode-finder in Stan.
- Easy to understand, use, and communicate.
- Works reasonably well in many examples (wherever the posterior can be well approximated by a normal distribution).
- Easy to take draws from the normal approximation, also easy to compute importance ratios and use Pareto-smoothed importance sampling.
Limitations of Laplace:
Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails).
Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode.
# Stan's LBFGS algorithm
fit_full_optim <- mod_full_ppp$optimize(data = data_full_new, seed = 1234, jacobian = TRUE)
fit_full_laplace <- mod_full_ppp$laplace(data = data_full_new, mode = fit_full_optim, draws = 4000)
# Run 'variational' method to use ADVI to approximate posterior
fit_full_vb <- mod_full_ppp$variational(data = data_full_new, seed = 1234, draws = 4000)
# Run 'pathfinder' method, a new alternative to the variational method
fit_pf <- mod_full_ppp$pathfinder(data = data_full_new, seed = 1234, draws = 4000)
Though they are very quicker to converge, the accuracy of these approximation algorithms largely depend on the complexity of modeling (number of parameters, prior distributions, latent variables etc.)
summ_list <- lapply(c(fit_full_ppp, fit_full_laplace, fit_full_vb, fit_pf), \(x)
x$summary(c("beta", "sigma"))[c('variable', 'mean',"q5", "q95")])
summ_list[[1]]$Algorithm = "MCMC"
summ_list[[2]]$Algorithm = "Laplace Approx."
summ_list[[3]]$Algorithm = "Variational Inference"
summ_list[[4]]$Algorithm = "Pathfinder"
summ_forplot <- Reduce(rbind, summ_list)
summ_forplot$Algorithm = factor(summ_forplot$Algorithm, levels = c("MCMC", "Laplace Approx.", "Pathfinder", "Variational Inference"))
ggplot(summ_forplot) +
geom_col(aes(x = variable, y = mean, fill = Algorithm), position = position_dodge()) +
geom_errorbar(aes(x = variable, ymax = q95, ymin = q5, fill = Algorithm),position = position_dodge2(padding = 0.6)) +
labs(x = '', y = '') +
theme_classic()
library(tidyverse)
rbind(
cbind(data.frame(Algorithm = "Laplace Approx."), fit_full_laplace$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]),
cbind(data.frame(Algorithm = "Variational Inference"), fit_full_vb$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6]),
cbind(data.frame(Algorithm = "Pathfinder"), fit_pf$draws(paste0("beta[", 1:6, "]"))[, 1:6]),
cbind(data.frame(Algorithm = "MCMC"), fit_full_ppp$draws(paste0("beta[", 1:6, "]"), format = "draws_df")[, 1:6])
) |>
mutate(Algorithm = factor(Algorithm, levels = c("MCMC", "Laplace Approx.", "Pathfinder", "Variational Inference"))) |>
pivot_longer(starts_with("beta"), names_to = "Parameter", values_to = "Draws") |>
ggplot() +
geom_density(aes(x = Draws, fill = Algorithm), alpha = 0.4) +
facet_wrap(~Parameter, scales = "free")
The three lectures using linear models was built to show nearly all parts needed in a Bayesian analysis
MCMC specifications
Prior specifications
Assessing MCMC convergence
Reporting MCMC results
Determining if a model fits the data (absolute fit)
Determining which model fits the data better (relative fit)
All of these topics will be with us when we start model complicated models in our future lecture.