Select parameters from a single (sampling) iteration of the Markov Chain

Using the selected parameters and the model to simulate a data set with the sample size (with same number of observations/variables)

From the simulated data set, calculate selected summary statistics (e.g., mean, sd)

Repeat steps 1-3 for a fixed number of iterations (perhaps across the whole chain)

When done, compare position of observed summary statistics to that of the distribution of summary statistics from simulated data sets (predictive distribution)

Step 1: Assemble the posterior draws

For our linear model, let’s denote our observed dependent variable as \(Y\)

Note that independent variables are not modeled (not explained by statistical formula), so we cannot examine them.

First, let’s assemble the posterior draws (\(1000 \times 4 \times 7\)):

posteriorSample = fit_full_new$draws(variables =c('beta','sigma'), format ='draws_matrix')posteriorSample

Note that we do not want to directly compare simulated data to the observed data.

Instead, we extract some characteristics of simulated/observed data for comparison using summary statistics.

There are some advantages:

We may have research interests in only some characteristics of data (whether our model predict the mean of dependent variables)

We can QC detailed aspects of the fitting process of model

In this case, for example, we may be interested in whether the model captures the “location” or “scale” of WeightLB

mean(dat$WeightLB)

[1] 171

(simMean =mean(simY))

[1] 174.6217

sd(dat$WeightLB)

[1] 49.51907

(simSD =sd(simY))

[1] 46.67779

Step 5: Looping across all posterior samples

We can repeat step 2-4 for a set number of samples

Optionally, we can choose to use up all iterations we have for Markov Chains (\(I = 4000\)) in practice

I =nrow(posteriorSample)## create empty simSD and simMean as placeholderssimSD = simMean =rep(NA, I)for (i in1:I) {# beta betaVector =matrix(posteriorSample[i, 1:6], ncol =1)# sigma sigma = posteriorSample[i, 7]# X simY =rnorm(n = N, mean = X %*% betaVector, sd = sigma) simMean[i] =mean(simY) simSD[i] =sd(simY)}par(mfrow =1:2)hist(simMean)hist(simSD)

Compare to the observed mean

We can now compare our observed mean and standard deviation with that of the sample values.

Blue line: the average value of predicted WeightLB

Red line: the observed mean value of WeightLB

The PDF of predictive values of summary statistics of WeightLB is called posterior predictive distribution

PPMC can be checked using visual inspection:

library(ggplot2)ppp <-data.frame(simMean = simMean, simSD = simSD)ggplot(ppp) +geom_histogram(aes(x = simMean), fill ="grey", col ="black") +geom_vline(xintercept =mean(dat$WeightLB), col ="red", size =1.4) +# red line: location of mean of predicted WeightLB by modelgeom_vline(xintercept =mean(simMean), col ="blue", size =1.4, alpha =0.5) # blue line: location of mean of WeightLB

Compare to the observed SD

Similarly, let’s compare SD to the posterior predictive distribution of SD of WeightLB

the observed SD is located as the center of posterior predictive distribution (PPD)

the average mean of PPD is slightly higher than the observed SD

ggplot(ppp) +geom_histogram(aes(x = simSD), fill ="grey", col ="black") +geom_vline(xintercept =sd(dat$WeightLB), col ="red", size =1.4) +# red line: location of mean of predicted WeightLB by modelgeom_vline(xintercept =mean(simSD), col ="blue", size =1.4, alpha =0.5) # blue line: location of mean of WeightLB

PPMC Characteristics

PPMC methods are very useful

They provide a visual way to determine if the model fits the observed data

They are the main method of assessing absolute fit in Bayesian models

Absolute fit assesses if a model fits the data instead of comparing to another model

But, there are some drawbacks to PPMC methods

Almost any statistic can be used

Some are better than others (mean and SD of outcomes are nice choices for linear regression)

No standard determining how much misfit is too much

May be overwhelming to compute depending on your model

Question: Can PPMC be used for models with maximum likelihood estimation or ordinary least squares?

Posterior Predictive P-values

We can summarize the PPMC using a type of “p-value”

Personally, I don’t like the name “p-value”, sounds like we are trying to justify our results using significance testing

Different from the frequentist “p-value” (if the null hypothesis is true, the probability of the observed data existing)

The PPP-value: the proportion of times the statistic from the simulated data exceeds that of the observed data

Useful to determine how far off a statistic is from its posterior predictive distribution

If these p-values were:

near 0 or 1, indicating your model is far off your data

near .5, indicating your model fits your data in terms of the statistics you examined

The PPP-value for mean:

mean(simMean >mean(dat$WeightLB))

[1] 0.4985

The PPP-value for SD:

mean(simSD >sd(dat$WeightLB))

[1] 0.52075

Compute PPP-values within Stan

We can use the generated quantities block of Stan to compute PPP-values for us:

generated quantities{// simulated dataarray[N] real weightLB_rep = normal_rng(X*beta, sigma);// posterior predictive distribution for mean and SDreal mean_weightLB = mean(weightLB);real sd_weightLB = sd(weightLB);real mean_weightLB_rep = mean(to_vector(weightLB_rep));real<lower=0> sd_weightLB_rep = sd(to_vector(weightLB_rep));// ppp-values for mean and sdint<lower=0, upper=1> ppp_mean = (mean_weightLB_rep > mean_weightLB);int<lower=0, upper=1> ppp_sd = (sd_weightLB_rep > sd_weightLB);}

Relative model fit: used to compare 2 or more competing models in terms of their mode fit. Sometime, it is also called model selection.

In non-Bayesian models, Information Criteria are often used to make comparisons

AIC, BIC, DIC etc.

Typically IC is a function of log-likelihood and penalty

The model with the lowest IC is the model that fits best

Bayesian model fit is similar

Uses an index value

The model with the lowest index is the model that fits best

Recent advances in Bayesian model fit use indices that are tied to make cross-validation predictions (inspired by machine learning):

Fit model leaving one observation out (LOO)

Calculate statistics related to prediction (for instance, log-likelihood of that observation conditional on model parameters)

Do for all observations

New Bayesian indices try to mirror these leave-one-out predictions (but approximate these due to time constraints)

Deviance Information Indices

When late 1990s and early 2000s, the Deviance Information Criterion was popular for relative Bayesian model fit comparisons. It is proved not as good as LOO or WAIC. But let’s have a look at:

\[
\text{DIC} = p_D + \overline{D(\theta)}
\]

where \(p_D\) is the estimated number of parameters as follows:

\[p_D = \overline{D(\theta)} - D(\bar\theta)\]and where

\[
D(\theta) = -2 \log(p(y \mid \theta)) + C
\]

C is a constant that cancels out when model comparisons are made

Here.

\(\overline{D(\theta)}\) is the average log likelihood of the data (y) given the parameters (\(\theta\)) computed across all samples

\(D(\bar\theta)\) is the log likelihood of the data (y) computed at the average of the parameters (\(\theta\)) computed across all samples

DIC in Stan

Some program like JAGS will have provided DIC directly, but Stan does not provides direct ways of calculating DIC:

We can manually calculate approximated DIC by replacing \(\bar\theta\) as median of posteriors \(\dot\theta\) if the posterior distributions are almost normal.

DIC for Full Model

# extract log-likelihood and parametersposterior_draws <- fit_full_ppp$draws(c('log_lik', 'beta'), format ='draws_matrix')log_like_draws <-rowSums(posterior_draws[,colnames(posterior_draws) %in%paste0("log_lik[", 1:30, "]")]) # find draws of loglikelihood that have parameters equaling to median valuestheta_draws <- posterior_draws[, colnames(posterior_draws) %in%paste0("beta[", 1:6, "]")]theta_ave_iteration =apply(round(theta_draws, 3), 2, \(x) which(x ==median(x))) |>unlist() |>as.numeric()log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])mean_log_like_draws = (-2)*mean(log_like_draws)# compute DICDIC_full = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_drawsDIC_full

[1] 211.6029

DIC for Empty Model

# extract log-likelihood and parametersposterior_draws <- fit_empty_ppp$draws(c('log_lik', 'beta0'), format ='draws_matrix')log_like_draws <-rowSums(posterior_draws[,colnames(posterior_draws) %in%paste0("log_lik[", 1:30, "]")]) # find draws of loglikelihood that have parameters equaling to median valuestheta_draws <- posterior_draws[, colnames(posterior_draws) =="beta0"]theta_ave_iteration =which(round(theta_draws,2) ==median(round(theta_draws,2)))log_like_draws_meantheta = (-2)*mean(log_like_draws[theta_ave_iteration])mean_log_like_draws = (-2)*mean(log_like_draws)# compute DICDIC_empty = mean_log_like_draws - log_like_draws_meantheta + mean_log_like_drawsDIC_empty

[1] 322.1444

Issues to DIC

The DIC has fallen out of favor recently (not used in Stan)

Has issues when parameters are discrete

Not fully Bayesian (point estimate of average of parameter values)

Can give negative values for estimated numbers of parameters in a model

WAIC (Widely applicable or Watanabe-Akaike information criterion, Watanabe, 2010) corrects some of the problems with DIC:

A more frequently used model comparison indices for Bayesian analysis

Using the loo R package, we can calculate WAIC with the waic() function:

WAIC for the full model

loo::waic(fit_full_ppp$draws('log_lik'))

Computed from 4000 by 30 log-likelihood matrix
Estimate SE
elpd_waic -109.4 6.9
p_waic 6.7 2.6
waic 218.8 13.8
5 (16.7%) p_waic estimates greater than 0.4. We recommend trying loo instead.

WAIC for the empty model

loo::waic(fit_empty_ppp$draws('log_lik'))

Computed from 4000 by 30 log-likelihood matrix
Estimate SE
elpd_waic -161.0 2.7
p_waic 1.4 0.3
waic 322.0 5.3

Here:

elpd_waic is the expected log pointwise predictive density for WAIC

p_waic is the WAIC-version estimated number of parameter, similar to \(p(D)\) in DIC, which is a penalty to the likelihood for more parameters

waic is the WAIC index used for model comparisons (lowest value is best fitting; -2*elpd_waic)

Note that WAIC needs a log_lik variable in the model analysis to be calculated correctly. cmdstan will automate calculate this variable.

LOO: Approximation to Leave-one-out

Big picture:

Besides WAIC, other comparative fit indices include LOO via Pareto Smoothed Important Sampling (PSIS) via Stan’s LOO package

WAIC/LOO can be used for model comparison with lowest value suggesting better model fit

Different from DIC, LOO via PSIS attempts to “approximate” the process of leave-one-out cross-validation (LOO-CV) using a sampling based-approach

Gives a finite-sample approximation

Implemented in Stan

Can quickly compare models

Gives warnings when it may be less reliable to use

The details of computation of LOO are very technical, but are nicely compiled in Vehtari, Gelman, and Gabry (2017).

LOO: Leave-one-out in Stan

Using loo package, we can calculate efficient approximate leave-one-out cross-validation (LOO-CV)

elpd_loo is the expected log pointwise predictive density for LOO (recall that posterior predictive distribution has some uncertainty around the mean value…)

p_loo is the LOO calculation of number of model parameters (a penalty to the likelihood for more parameters)

looic is the LOO index used for model comparisons — lowest value suggests best fitting -2elpd_loo

LOO: Comparing Model with LOO

Alternative, you can use the built-in function loo_compare in loo package to compare alternative models:

loo::loo_compare(list(empty = fit_empty_ppp$loo(), full = fit_full_ppp$loo()))

elpd_diff se_diff
full 0.0 0.0
empty -51.3 7.9

This function calculats the standard error of the difference in elpd (expected log pointwise predictive density) between models

The SE gives an indication of the standard error in the estimate (relative to the size)

Can use this to downweight the choice of models when the standard error is high (Open Question: how high is high?)

Sometimes the normal approx is pretty bad (funnels, multimodal distributions, long tails).

Sometimes the joint mode is useless or does not even exist (funnels, etc.), in which case the model itself would need to be altered in some way to get a stable mode.

Code

# Stan's LBFGS algorithmfit_full_optim <- mod_full_ppp$optimize(data = data_full_new, seed =1234, jacobian =TRUE)fit_full_laplace <- mod_full_ppp$laplace(data = data_full_new, mode = fit_full_optim, draws =4000)# Run 'variational' method to use ADVI to approximate posteriorfit_full_vb <- mod_full_ppp$variational(data = data_full_new, seed =1234, draws =4000)# Run 'pathfinder' method, a new alternative to the variational methodfit_pf <- mod_full_ppp$pathfinder(data = data_full_new, seed =1234, draws =4000)

Approximation Algorithm vs. Full MCMC

Though they are very quicker to converge, the accuracy of these approximation algorithms largely depend on the complexity of modeling (number of parameters, prior distributions, latent variables etc.)

The three lectures using linear models was built to show nearly all parts needed in a Bayesian analysis

MCMC specifications

Prior specifications

Assessing MCMC convergence

Reporting MCMC results

Determining if a model fits the data (absolute fit)

Determining which model fits the data better (relative fit)

All of these topics will be with us when we start model complicated models in our future lecture.

Next Class

Generalized measurement models

Reference

Dhaka, Akash Kumar, Alejandro Catalina, Michael Riis Andersen, Måns Magnusson, Jonathan H. Huggins, and Aki Vehtari. 2020. “Robust, Accurate Stochastic Optimization for Variational Inference.”arXiv. https://doi.org/10.48550/ARXIV.2009.00666.

Gelman, Andrew, Xiao-Li Meng, and Hal Stern. 1996. “Posterior Predictive Assessment of Model Fitness Via Realized Discrepancies.”Statistica Sinica.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2016. “Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC.”Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Yes, but Did It Work?: Evaluating Variational Inference.”arXiv. https://doi.org/10.48550/ARXIV.1802.02538.

Zhang, Lu, Bob Carpenter, Andrew Gelman, and Aki Vehtari. 2021. “Pathfinder: Parallel Quasi-Newton Variational Inference.”https://doi.org/10.48550/ARXIV.2108.03782.