Generalized Linear Models
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2024-09-24
Statistical models can be broadly organized as:
All models have fixed effects, and then:
“Linear” means the fixed effects predict the link-transformed DV in a linear combination of
g−1(β0+β1X1+β2X2+⋯)
Substantive theory: what guides your study
Hypothetical causal process: what the statistical model is testing (attempting to falsify) when estimated
Observed outcomes: what you collect and evaluate based on your theory
Outcomes can take many forms:
Continuous variables (e.g., time, blood pressure, height)
Categorical variables (e.g., Likert-type response, ordered categories, nominal categories)
Combinations of continuous and categorical (e.g., either 0 or some other continuous number)
Generalized models map the substantive theory onto the sample space of the observed outcomes
The general idea is that the statistical model will not approximate the outcome well if the assumed distribution is not a good fit to the sample space of the outcome
The key to making all of this work is the use of differing statistical distributions for the outcome
Generalized models allow for different distributions for outcomes
Generalized Linear Models → General Linear Models whose residuals follow some not-normal distributions and in which a link transformed Y is predicted instead of the original scale of Y
Many kinds of non-normally distributed outcomes have some kind of generalized linear model to go with them:
Distribution | Support of distribution | Typical uses | Link name | Link function Xβ=g(μ) |
Mean function |
Normal | real: (−∞,+∞) |
Linear-response data | Identity | Xβ=μ |
μ=Xβ |
Exponential | real: (0,+∞) |
Exponential-response data, scale parameters | Negative inverse | Xβ=−μ−1 |
μ=−(Xβ)−1 |
Gamma | |||||
Inverse Gaussian |
real: (0,+∞) |
Inverse squared |
Xβ=μ−2 |
μ=(Xβ)−1/2 |
|
Poisson | integer: 0,1,2,… |
count of occurrences in fixed amount of time/space | Log | Xβ=ln(μ) |
μ=exp(Xβ) |
Bernoulli | integer: {0,1} |
outcome of single yes/no occurrence | Logit | Xβ=ln(μ1−μ) |
μ=exp(Xβ)1+exp(Xβ)=11+exp(−Xβ) |
Binomial | integer: 0,1,…,N |
count of # of "yes" occurrences out of N yes/no occurrences | Xβ=ln(μn−μ) |
||
Categorical | integer: [0,K) |
outcome of single K-way occurrence | Xβ=ln(μ1−μ) |
||
K-vector of integer: [0,1] |
|||||
Multinomial | K-vector of integer: [0,N] |
count of occurrences of different types (1, ..., K) out of N total K-way occurrences |
Link Function g(⋅) (main difference from GLM):
How a non-normal outcome gets transformed into something we can predict that is more continuous (unbounded)
For outcomes that are already normal, general linear models are just a special case with an “identity” link function (Y * 1)
Model for the Means (“Structural Model”):
How predictors linearly related to the link-transformed outcome
New link-transformed Yp=g−1(β0+β1Xp+β2Zp+β3XpZp)
Model for the Variance (“Sampling/Stochastic Model”):
Generalized models work by providing a mapping of the theoretical portion of the model (the right hand side of the equation) to the sample space of the outcome (the left hand side of the equation)
The link function is a non-linear function that takes the linear model predictors, random/latent terms, and constants and puts them onto the space of the outcome observed variables
Link functions are typically expressed for the mean of the outcome variable (we will only focus on that)
E(Yp)=ˆYp=μy
where E(⋅) stands for expectation.
… through a non-linear link function g(ˆYp) when used on conditional mean of outcome
or its inverse link function g−1(βX) when used on linear combination of predictors
The general form is:
E(Yp)=ˆYp=μy=g−1(β0+β1Xp+β2Zp+β3XpZp)
The red part is the linear combination of predictors and their effects.
g−1(⋅)=I(⋅)=1∗(linear combination of predictors)
The identity function does not alter the predicted values – they can be any real number
This matches the sample space of the normal distribution – the mean can be any real number
E(Yp)=ˆYp=μy=I(β0+β1Xp+β2Zp+β3XpZp)=β0+β1Xp+β2Zp+β3XpZp
The other parameter of the normal distribution described the variance of an outcome – called the error (residual) variance
We found that the model for the variance for the GLM was:
Var(Yp)=Var(β0+β1Xp+β2Zp+β3XpZp+ep)=Var(ep)=σ2e
ESRM64503
packagelibrary(ESRM64503)
library(tidyverse)
library(kableExtra)
Desp_GPA <- dataLogit |>
summarise(
Variable = "GPA",
N = n(),
Mean = mean(GPA),
`Std Dev` = sd(GPA),
Minimum = min(GPA),
Maximum = max(GPA)
)
Desp_Apply <- dataLogit |>
group_by(APPLY) |>
summarise(
Frequency = n(),
Percent = n() / nrow(dataLogit) * 100
) |>
ungroup() |>
mutate(
`Cumulative Frequency` = cumsum(Frequency),
`Cumulative Percent` = cumsum(Percent)
)
Desp_LLApply <- dataLogit |>
group_by(LLAPPLY) |>
summarise(
Frequency = n(),
Percent = n() / nrow(dataLogit) * 100
) |>
ungroup() |>
mutate(
`Cumulative Frequency` = cumsum(Frequency),
`Cumulative Percent` = cumsum(Percent)
)
Desp_PARED <- dataLogit |>
group_by(PARED) |>
summarise(
Frequency = n(),
Percent = n() / nrow(dataLogit) * 100
) |>
ungroup() |>
mutate(
`Cumulative Frequency` = cumsum(Frequency),
`Cumulative Percent` = cumsum(Percent)
)
Desp_PUBLIC <- dataLogit |>
group_by(PUBLIC) |>
summarise(
Frequency = n(),
Percent = n() / nrow(dataLogit) * 100
) |>
ungroup() |>
mutate(
`Cumulative Frequency` = cumsum(Frequency),
`Cumulative Percent` = cumsum(Percent)
)
APPLY | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|
0 | 220 | 55 | 220 | 55 |
1 | 140 | 35 | 360 | 90 |
2 | 40 | 10 | 400 | 100 |
LLAPPLY | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|
0 | 220 | 55 | 220 | 55 |
1 | 180 | 45 | 400 | 100 |
Variable | N | Mean | Std Dev | Minimum | Maximum |
---|---|---|---|---|---|
GPA | 400 | 2.998925 | 0.3979409 | 1.9 | 4 |
PARED | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|
0 | 337 | 84.25 | 337 | 84.25 |
1 | 63 | 15.75 | 400 | 100.00 |
PUBLIC | Frequency | Percent | Cumulative Frequency | Cumulative Percent |
---|---|---|---|---|
0 | 343 | 85.75 | 343 | 85.75 |
1 | 57 | 14.25 | 400 | 100.00 |
But if Yp is binary and link function is identity link, then ep can only be 2 things:
ep = Yp−ˆYp
If Yp=0 then ep = (0 - predicted probability)
If Yp=1 then ep = (1 - predicted probability)
The mean of errors would still be 0 … by definition
But variance of errors can’t possibly be constant over levels of X like we assume in general linear models
The mean and variance of a binary outcome are dependent!
As shown shortly, mean = p and variance = p * (1 - p), so they are tied together
This means that because the conditional mean of Y (p, the predicted probability Y = 1) is dependent on X, then so is the error variance
How can we have a linear relationship between X & Y?
Probability of a 1 is bounded between 0 and 1, but predicted probabilities from a linear model aren’t bounded
Linear relationship needs to ‘shut off’ somehow → made nonlinear
library(ggplot2)
ggplot(dataLogit) +
aes(x = GPA, y = LLAPPLY) +
geom_hline(aes( yintercept = 1)) +
geom_hline(aes( yintercept = 0)) +
geom_point(color = "tomato", alpha = .8) +
geom_smooth(method = "lm", se = FALSE, fullrange = TRUE, linewidth = 1.3) +
scale_x_continuous(limits = c(0, 8), breaks = 0:8) +
scale_y_continuous(limits = c(-0.4, 1.4), breaks = seq(-0.4, 1.4, .1)) +
labs(y = "Predicted Probability of Y") +
theme_classic() +
theme(text = element_text(size = 20))
ggplot(dataLogit) +
aes(x = GPA, y = LLAPPLY) +
geom_hline(aes( yintercept = 1)) +
geom_hline(aes( yintercept = 0)) +
geom_point(color = "tomato", alpha = .8) +
geom_smooth(method = "glm", se = FALSE, fullrange = TRUE,
method.args = list(family = binomial(link = "logit")), color = "yellowgreen", linewidth = 1.5) +
scale_x_continuous(limits = c(0, 8), breaks = 0:8) +
scale_y_continuous(limits = c(-0.4, 1.4), breaks = seq(-0.4, 1.4, .1)) +
labs(y = "Predicted Probability of Y") +
theme_classic() +
theme(text = element_text(size = 20))
Assumption Violation problem: GLM for continuous, conditionally normal outcome = residuals can’t be normally distributed
Restricted Range problem (e.g., 0 to 1 for outcomes)
Predictors should not be linearly related to observed outcome
Decision Making Problem: for GLM, the predicted value will a continuous predicted value with same scale of outcome (Y), how do we answer the question such as whether or not students will apply given certain value of GPA
Bernoulli distribution has following properties
Notation: Yp∼B(pp) (where p is the conditional probability of a 1 for person p)
Sample Space: Yp∈{0,1} (Yp can either be a 0 or a 1)
Probability Density Function (PDF):
Expected value (mean) of Y: E(yp)=μYp=pp
Variance of Y: Var(Yp)=σ2Yp=pp(1−pp)
Note
pp is the only parameter – so we only need to provide a link function for it …
Rather than modeling the probability of a 1 directly, we need to transform it into a more continuous variable with a link function, for example:
We could transform probability into an odds ratio
Odds ratio (OR): p1−p=Pr(Y=1)Pr(Y=0)
For example, if p=.7, then OR(1) = 2.33; OR(0) = .429
Odds scale is way skewed, asymmetric, and ranges from 0 to +∞
Take natural log of odds ratio → called “logit” link
logit(p)=log(p1−p)
For example, logit(.7)=.846 and logit(.3)=−.846
Logit scale is now symmetric at p=.5
The logit link is one of many used for the Bernoulli distribution
The link function for a logit is defined by:
g(E(Y))=log(P(Y=1)1−P(Y=1))=βXT(1)
where g called link function, E(Y) is the expectation of Y, βXT is the linear predictor
A logit can be translate back to a probability with some algebra:
P(Y=1)=exp(βXT)1+exp(βXT)=exp(β0+β1X+β2Z+β3XZ)1+exp(β0+β1X+β2Z+β3XZ)=(1+exp(−1∗(β0+β1X+β2Z+β3XZ)))−1(2)
From Equation 1 and Equation 2, we can know that g(E(Y)) has a range of [-∞, +∞], P(Y = 1) has a range of [0, 1].
# function to translate OR to Probability
OR_to_Prob <- function(OR){
p = OR / (1+OR)
return(p)
}
# function to translate Logit to Probability
Logit_to_Prob <- function(Logit){
OR <- exp(Logit)
p = OR_to_Prob(OR)
return(p)
}
Logit_to_Prob(Logit = -2.007) # p = .118
[1] 0.1184699
[1] 0.1343912
Intercept β0:
Logit: We can say the predicted logit value of Y = 1 for an individual when all predictors are zero; i.e., the average logit is -2.007
Probability: Alternatively, we can say the expected value of probability of Y = 1 is exp(β0)1+exp(β0) when all predictors are zero; i.e., the average probability of applying to grad score is 0.1184699
Odds Ratio: Alternatively, we can say the expected odds ratio (OR) of probability of Y = 1 is exp(Logit) when all predictors are zero; the average odds (ratio) of the probability of applying to grad school is exp(-2.007) = 0.13439
Slope β1:
Logit: We can say the predicted increase of logit value of Y = 1 with one-unit increase of X;
Probability: We can say the expected increase of probability of Y = 1 is exp(β0+β1)1+exp(β0+β1)−exp(β0)1+exp(β0) with one-unit increase of X; Note that the increase (Δ(β0,β1)) is non-linear and dynamic given varied value of X.
Odds Ratio: We can say the expected odds ratio (OR) of probability of Y = 1 is exp(β1) times larger with one-unit increase of X. (hint: the new odds ratio is exp(β0+β1)=exp(β0)exp(β1) when X = 1 and the old odds ratio is exp(β0) when X = 0)
Model 0: The empty model for logistic regression with binary variable (applying for grad school) as the outcome
Model 1: The logistic regression model including centered GPA and binary predictors (Parent has granduate degree, Student Attend Public University)
The statistical form of empty model:
P(Yp=1)=exp(β0)1+exp(β0)
or
logit(P(Yp=1))=β0
Takehome Note
Many generalized linear models don’t list an error term in the statistical form. This is because the error has fixed mean and fixed variances.
For the logit function, ep has a logistic distribution with a zero mean and a variance as π2/3 = 3.29.
ordinal
package and clm()
function, we can model categorical dependent variableslibrary(ordinal)
# response variable must be a factor
1dataLogit$LLAPPLY = factor(dataLogit$LLAPPLY, levels = 0:1)
# Empty model: likely to apply
2model0 = clm(LLAPPLY ~ 1, data = dataLogit, control = clm.control(trace = 1))
factor
formula
and data
arguments are identical to lm
; The control =
argument is only used here to show iteration history of the ML algorithm
iter: step factor: Value: max|grad|: Parameters:
0: 1.000000e+00: 277.259: 2.000e+01: 0
nll reduction: 2.00332e+00
1: 1.000000e+00: 275.256: 6.640e-02: 0.2
nll reduction: 2.22672e-05
2: 1.000000e+00: 275.256: 2.222e-06: 0.2007
nll reduction: 3.97904e-13
3: 1.000000e+00: 275.256: 2.307e-14: 0.2007
Optimizer converged! Absolute and relative convergence criteria were met
formula: LLAPPLY ~ 1
data: dataLogit
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 400 -275.26 552.51 3(0) 2.31e-14 1.0e+00
Threshold coefficients:
Estimate Std. Error z value
0|1 0.2007 0.1005 1.997
The clm
function output Threshold parameter (labelled as τ0) rather than intercept (β0).
The relationship between β0 and τ0 is β0=−τ0
Thus, the estimated β0 is -0.2007 with SE = 0.1005 for model 0
The predicted logit is -0.2007; the predicted probability is .55
The log-likelihood is -275.26; AIC is 552.51
P(Yp=1)=exp(β0+β1PAREDp+β2(GPAp−3)+β3PUBLICp)1+exp(β0+β1PAREDp+β2(GPAp−3)+β3PUBLICp)
or
logit(P(Yp=1))=β0+β1PAREDp+β2(GPAp−3)+β3PUBLICp
dataLogit$GPA3 <- dataLogit$GPA - 3
model1 = clm(LLAPPLY ~ PARED + GPA3 + PUBLIC, data = dataLogit)
summary(model1)
formula: LLAPPLY ~ PARED + GPA3 + PUBLIC
data: dataLogit
link threshold nobs logLik AIC niter max.grad cond.H
logit flexible 400 -264.96 537.92 3(0) 3.71e-07 1.0e+01
Coefficients:
Estimate Std. Error z value Pr(>|z|)
PARED 1.0596 0.2974 3.563 0.000367 ***
GPA3 0.5482 0.2724 2.012 0.044178 *
PUBLIC -0.2006 0.3053 -0.657 0.511283
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Threshold coefficients:
Estimate Std. Error z value
0|1 0.3382 0.1187 2.849
LL(Model 1) = -264.96 (LL(Model 0) = -275.26)
β0=−τ0=0.3382(0.1187)
β1(SE)=1.0596(0.2974) with p<.05
β2(SE)=0.5482(0.2724) with p<.05
β3(SE)=−0.2006(0.3053) with p=.511
Question #1: does Model 1 fit better than the empty model (Model 2)?
This question is equivalent to test the following hypothesis:
H0:β0=β1=β2=0H1:At least one not equal to 0
We can use Likelihood Ratio Test:
−2Δ=(−275.26−(−264.96))=20.586
DF = 4 (# of params of Model 0) - 1 (# of params of Model 1) = 3
p-value: p = .0001283
Likelihood ratio tests of cumulative link models:
formula: link: threshold:
model0 LLAPPLY ~ 1 logit flexible
model1 LLAPPLY ~ PARED + GPA3 + PUBLIC logit flexible
no.par AIC logLik LR.stat df Pr(>Chisq)
model0 1 552.51 -275.26
model1 4 537.92 -264.96 20.586 3 0.0001283 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Or we can use chi-square distribution to calculate p-value
as.numeric(pchisq(-2 * (logLik(model0)-logLik(model1)), 3, lower.tail = FALSE))
[1] 0.0001282981
Conclusion: reject H0 and we preferred to the empty model
Intercept β0=−0.3382(0.1187):
Logit: the predicted logit of probability of applying for the grad school is -0.3382 for a person with 3.0 GPA, parents without a graduate degree, and at a private university
Odds Ratio: the predicted OR of applying for the grad school is 0.7130 for a person with 3.0 GPA, parents without a graduate degree, and at a private university (OR < 1: the probability of applying is less than the probability of not applying)
Probability: the predicted probability of applying for the grad school is 41.7% for a person with 3.0 GPA, parents without a graduate degree, and at a private university
Slope of parents having a graduate degree: β1(SE)=1.0596(0.2974) with p<.05
Logit: the predicted logit of applying for the grad school will increase 1.0596 for whose parents having a graduate degree controlling other predictors.
Odds Ratio: the predicted OR will increase from 0.7139 to 2.05 for whose parents having a graduate degree controlling other predictors – students who have parents with a graduate degree has 3x the odds of rating the item with a “likely to apply”
Probability: Compared to those without parental graduate degree, the predicted probability of “likely to apply” for students with parental graduate degree increases from 0.416 to .673
exp(β0+β1)1+exp(β0+β1)=.673
exp(β0)1+exp(β0)=.416
Slope of students in public vs. private universities: β3(SE)=−0.2006(0.3053) with p=.511
## Interpret output
OR_p <- function(logit_old, logit_new){
data.frame(
OR_old = exp(logit_old),
OR_new = exp(logit_new),
p_old = exp(logit_old) / (1 + exp(logit_old)),
p_new = exp(logit_new) / (1 + exp(logit_new))
)
}
beta_0 <- -0.3382
beta_3 <- -0.2006
Result <- OR_p(logit_old = beta_0, logit_new = (beta_0 + beta_3))
Result |> show_table()
OR_old | OR_new | p_old | p_new |
---|---|---|---|
0.7130527 | 0.583448 | 0.4162468 | 0.3684668 |
[1] 0.8182397
[1] -0.04778001
Interpretation
Students has 0.81 times odds ratio of applying for grad.school when they are in public universities
The probability of applying for grad. school increases 4.78%.
This change in odds ratio and probability of applying for grad. school is not significant.
Slope of GPA3: β2(SE)=0.5482(0.2724) with p<.05
Interpretation
new_data <- data.frame(
GPA3 = seq(-3, 1, .1),
PARED = 0,
PUBLIC = 0
)
Pred_prob <- predict(model1, newdata=new_data)$fit
as.data.frame(cbind(GPA = new_data$GPA3+3, P_Y_0 = Pred_prob[,1], P_Y_1 = Pred_prob[,2])) |>
pivot_longer(starts_with("P_Y")) |>
ggplot() +
aes(x = GPA, y = value) +
geom_path(aes(group = name, color = name), linewidth = 1.3) +
labs(y = "Predicted Probability") +
scale_x_continuous(breaks = seq(0, 4, .2)) +
scale_color_discrete(labels = c("P(Y = 0)", "P(Y = 1)"), name = "") +
theme_classic()
Takehome Note
Generalized linear models are models for outcomes with distributions that are not necessarily normal
The estimation process is largely the same: maximum likelihood is still the gold standard as it provides estimates with understandable properties
Learning about each type of distribution and link takes time:
Logistic regression is one of the more frequently used generalized models – binary outcomes are common
ESRM 64503 - Lecture 06: Matrix Algebra