Jihong Zhang, Ph.D. – Lecture 12: ANCOVA

student_id	method	pretest	posttest
1	lecture	68.9	70.9
2	hands-on	66.1	62.8
3	lecture	61.6	53.5
4	self-paced	72.5	85.8
5	hands-on	64.9	73.1

Coefficient	Df	Sum Sq	Mean Sq	F value	Pr(>F)
pretest	1	3.774K	3.774K	150.059	0.000
method	2	726.111	363.055	14.436	0.000
pretest:method	2	66.190	33.095	1.316	0.271
Residuals	194	4.879K	25.148	NA	NA

Coefficient	Df	Sum Sq	Mean Sq	F value	Pr(>F)
pretest	1	3.774K	3.774K	149.576	0.000
method	2	726.111	363.055	14.390	0.000
Residuals	196	4.945K	25.230	NA	NA

True Experimental Design	Quasi-Experimental Design
- Relationship between covariate and DV underestimated, resulting in less adjustment than is necessary	- Relationship between covariate and DV underestimated, resulting in less adjustment than is necessary
- Less powerful F test	- Group effects (IV) may be seriously biased

ANCOVA: Hypothesis Test I

[Example] Step #1

Research/alternative hypothesis
- $H_{A}$ : Controlling for pretest, the adjusted means of math scores among three instructional groups will differ.
Null hypothesis
- $H_{0}$ : Controlling for pre-test, there is no difference between the adjusted means of math scores among three instructional groups.
  $\Rightarrow H_{0} : adjusted μ_{lecture} = adjusted μ_{hands-on} = adjusted μ_{self-paced}$

ANCOVA: Definition of Adjusted Means

Adjusted cell means are the observed cell mean minus a weighted within-cell deviation of the covariate values from the covariate cell means.
- When we do the adjustment, this is what happens:

{\bar{Y}}_{adjusted} = {\bar{Y}}_{original} - b ({\bar{X}}_{cell} - {\bar{X}}_{grand})

b is the pooled slope for the simple regression of the covariate on the DV
X is the covariate (cell mean and grand mean)
Y is the dependent variable (adjusted and unadjusted cell means)
If b is zero (relationship is zero) then there is no adjustment.
The bigger b is (the stronger the covariate/DV relationship), the more of an adjustment.

➔ The further a cell mean is from the covariate grand mean (the bigger the deviation), the more the cell mean is adjusted.

ANCOVA: Formula of Adjusted Means

Based on the ANCOVA adjusted means formula:

{\bar{Y}}_{adjusted} = {\bar{Y}}_{original} - b ({\bar{X}}_{cell} - {\bar{X}}_{grand})

We can compute the adjusted means for each group using the following steps:

R code to compute adjusted means

# Unadjusted means of posttest by group
unadjusted_means <- df %>%
  group_by(method) %>%
  summarise(posttest_mean = mean(posttest))

# Pretest means by group
pretest_means <- df %>%
  group_by(method) %>%
  summarise(pretest_mean = mean(pretest))

# Grand mean of pretest
grand_pretest_mean <- mean(df$pretest)

# Fit linear model to get pooled regression slope
model <- lm(posttest ~ pretest, data = df)

# Extract slope
pooled_slope <- coef(model)["pretest"]

# Combine into one table
results <- left_join(unadjusted_means, pretest_means, by = "method")

R code to display adjusted means as a table

# Calculate adjusted means using the ANCOVA adjustment formula
results2 <- results |>
  mutate(grand_pretest_mean = grand_pretest_mean) |>
  mutate(pooled_slope = pooled_slope) |>
  mutate(adjusted_mean = posttest_mean - pooled_slope * (pretest_mean - grand_pretest_mean))

# View results
gt(results2) |>
  fmt_number(
    columns = posttest_mean:adjusted_mean
  )

method	posttest_mean	pretest_mean	grand_pretest_mean	pooled_slope	adjusted_mean
hands-on	70.41	64.90	65.11	0.74	70.56
lecture	65.96	65.25	65.11	0.74	65.86
self-paced	69.10	65.20	65.11	0.74	69.03

Visualization

Based on the adjusted means, pretest, and posttest means, we can visualize the results using a bar plot.

Code

used_colors <- c("steelblue", "tomato", "seagreen4")
used_group_labels <- c("Pretest", "Posttest", "Adjusted")
results2 |>
  select(method, pretest_mean, posttest_mean, adjusted_mean) |>
  pivot_longer(ends_with("_mean"), names_to = "type", values_to = "Mean") |>
  mutate(type = factor(type, levels = paste0(c("pretest", "posttest", "adjusted"), "_mean"))) |>
  ggplot(aes(y = method, x = Mean)) +
  geom_col(aes(y = method, x = Mean, fill = type), position = position_dodge()) +
  geom_text(aes(x = Mean + 5, label = round(Mean, 2), color = type), position = position_dodge(width = .85)) +
  scale_color_manual(values = used_colors, labels = used_group_labels) +
  scale_fill_manual(values = used_colors, labels = used_group_labels) +
  labs(y = "", title = "Comparing Adjusted and Unadjusted Means") +
  theme_minimal() +
  theme(legend.position = "bottom")

ANCOVA: Hypothesis Test II

[Example] Step #2
Recall: What distinguishes ANCOVA from other analyses?
In addition to a grouping variable (IV), we have scores from some continuous measure (i.e., covariate) that is related to the DV
- Preferably, we are in a randomized-control experimental situation — use caution when in a quasi-experimental situation! (because it does not consider random assignments of treatments/levels of IV)
This is a situation in which we desire “statistical control” based upon the covariate.
- In some scenarios it may not make sense to adjust the group means as if they were the same on the covariate.
In ANCOVA:
- IV: same as before — grouping/categorical variable with 2 or more levels
- In addition to IV, we are adding a continuous covariate → this is NEW!
- DV: same as before — continuous variable

ANCOVA: Hypothesis Test III

[Example] Step #2
Recall: Continuous covariate?

➤ Categorical variable - ✓ Contain a finite number of categories or distinct groups. - ✓ Might not have a logical order. - ✓ Examples: gender, material type, and payment method.

➤ Discrete variable - ✓ Numeric variables that have a countable number of values between any two values. - ✓ Examples: number of customer complaints, number of items correct on an assessment, attempts on GRE. - ✓ It is common practice to treat discrete variables as continuous, as long as there are a large number of levels (e.g., 1–100 not 1–4).

➤ Continuous variable - ✓ Numeric variables that have an infinite number of values between any two values. - ✓ Examples: length, weight, time to complete an exam.

➔ We often assume the DV for ANCOVA is continuous, but we can sometimes “get away” with discrete, ordered outcomes if there are enough categories.

ANCOVA: Hypothesis Test IV

[Example] Step #2
What if we have a categorical outcome?

➤ Not related to this course, but categorical outcomes are commonly analyzed: - ✓ Examples: pass/fail a fitness test; pass/fail an academic test; retention (yes/no); on-time graduation (yes/no); proficiency (below, meeting, advanced), etc.

➔ These are not continuous, so we cannot use them in ANOVA

➤ Instead: logistic regression (PROC LOGISTIC or PROC GLM!) - ✓ Logistic regression can include both categorical and continuous IVs (and their interactions)

Code

# Load libraries
library(ggplot2)
library(dplyr)

# Simulate data
set.seed(123)
n <- 100
weight <- rnorm(n, 140, 20)
prob_obese <- 1 / (1 + exp(-(0.1 * weight -15)))  # logistic model
obese <- rbinom(n, size = 1, prob = prob_obese)

data <- data.frame(weight = weight, obese = obese)

# Linear model
lm_model <- lm(obese ~ weight, data = data)

# Logistic model
logit_model <- glm(obese ~ weight, data = data, family = "binomial")

# Prediction data
pred_data <- data.frame(weight = seq(min(weight), max(weight), length.out = 100))
pred_data$lm_pred <- predict(lm_model, newdata = pred_data)
pred_data$logit_pred <- predict(logit_model, newdata = pred_data, type = "response")

# Plot 1: Linear Regression
p1 <- ggplot(data, aes(x = weight, y = obese)) +
  geom_point(color = "red", size = 2) +
  geom_line(data = pred_data, aes(x = weight, y = lm_pred), color = "black") +
  labs(title = "Linear Regression", x = "weight", y = "Obesity (0/1)") +
  # ylim(0, 1.1) +
  theme_minimal()

# Plot 2: Logistic Regression
p2 <- ggplot(data, aes(x = weight, y = obese)) +
  geom_point(color = "red", size = 2) +
  geom_line(data = pred_data, aes(x = weight, y = logit_pred), color = "black") +
  labs(title = "Logistic Regression", x = "weight", y = "Predicted Probability") +
  # ylim(0, 1.1) +
  theme_minimal()

# Combine plots using patchwork
library(patchwork)
p1 + p2

ANOVA: Degree of Freedom

In addition to the traditional degrees of freedom for an ANOVA, you now lose a degree of freedom for each covariate.
Degrees of Freedom → In our scenario, we have 1 IV with 3 groups and 1 covariate.
The $d f_{m e t h o d}$ is the same as before: k - 1, where k represents the number of groups.
- In our scenario, we have 3 groups, so the numerator df = 3 − 1 = 2
The $d f_{e r r o r}$ is different:
- N − k − #covariates, where k is the number of groups and #covariates is the number of continuous controls.
- In our scenario, if there are 200 students (N = 200), 3 groups, and 1 covariate (e.g., IQ), the df is
  200 − 3 − 1 = 196
The $d f_{c o v a r i a t e}$ is #covariates = 1
If the principal in the scenario assigned a total of 200 students, the degrees of freedom for this analysis would be:
- 2 (numerator) and 196 (denominator)

AFTER ANCOVA: Treatment Effect

Now we need to follow-up to see where the differences lie.
Planned and Pairwise comparisons
- based on the adjusted means and the error term after removing covariate variance
- Can interpret the same, just be sure to note that it is the effect for the adjusted means
  (i.e., the means of DV after controlling for covariate)
Post-hoc tests
- Not designed for situations in which a covariate is specified; however, you can obtain a limited selection.
- Tukey’s LSD; Bonferroni, Sidak Correction still work…

Problem: ANCOVA in Quasi-experimental Design I

	True Experimental Design	Quasi-Experimental Design
Assignment to treatment	The researcher randomly assigns subjects to control and treatment groups.	Some other, non-random method is used to assign subjects to groups.
Control over treatment	The researcher usually designs the treatment.	The researcher often does not have control over the treatment, but studies pre-existing groups.
Use of control groups	Requires the use of control and treatment groups.	Control groups are not required (although they are commonly used).

Problem: ANCOVA in Quasi-experimental Design II

ANCOVA for an experimental design is pretty straightforward
- It reduces error variance and increases power and is a great option if you have a good covariate and have met the ANCOVA assumptions.
ANCOVA for quasi-experimental designs is controversial and risky
- Some people say you shouldn’t do it at all. Others say you just have to be careful not to over-interpret and to be sure to do replications.
Accounting for Pre-Existing Group Differences:
- ➤ If people are not randomly assigned to conditions, there may be differences in the DV across the groups before the experiment starts.
- ➤ Some people use ANCOVA to “account for preexisting differences” of groups in a quasi-experimental design, then:
  - ANCOVA equates groups on a particular covariate, but there is no guarantee that this is the only or most important dimension that the groups differ on.
  - Equating pre-existing groups on one variable may accentuate differences on another variable.

Problem: ANCOVA in Quasi-experimental Design III

What is the risk/concern with quasi-experimental design?

➤ For example, the inclusion of the covariate adjusts the two groups’ means, as if they are the same on the covariate.
➤ Therefore, the two analyses are addressing two different research questions!
- Pre-post change analysis addressed the $H_{0} : μ_{female} = μ_{male}$
- ANCOVA analysis addressed the $H_{0} : adjusted μ_{female} = adjusted μ_{male}$

Biases the effect size of the IV	Values can’t be trusted
- Can remove real “effect variance” and attenuate the effect size	- Adjusted means are implausible values
- If other variables involved, can make it look like there is an effect when there isn’t	- Interaction and slope values could just apply to the cells observed, not the population

ANCOVA: Final Thoughts…

Use of covariates does not guarantee that groups will be “equivalent” —
even after using multiple covariates, there still may be some confounding variables operating that you are unaware of.
Best way to overcome differences between groups due to variables other than the IV
is to randomly assign subjects to groups.
Make sure that the covariate you are using is reliable!

Summary

This lecture leads you to go throught varied pieces of ANCOVA:
- Why ANCOVA is better than ANOVA?
- How are the ANCOVA hypotheses compared to ANOVA hypotheses
- Adjusted means to represent groups’ levels after incorporating covariate
- The potential problems of ANOVA when assumptions are violated

Lecture 12: ANCOVA

Overview of Lecture 12

1 Introduction

Introduction: Overview other design

Introduction: ANCOVA

ANCOVA: Example

ANCOVA definition

ANCOVA Advantage

Adding the covariate

ANCOVA: Example

ANCOVA: Example Data

2 ANCOVA: Assumption Check

ANCOVA: Assumption Check I

ANCOVA: Assumption Check II

ANCOVA: Assumption Check III

ANCOVA: Assumption Check IV

ANCOVA: Assumption Check V

ANCOVA: Assumption Check VI

ANCOVA: Assumption Check VII

ANCOVA: Assumption Check VIII

3 Hypothesis Test