Experimental Design in Education
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2025-02-10
Family-Wise Error Rate
Durbin-Watson (DW) Test
Note
The Durbin-Watson test is primarily used for detecting autocorrelation in time-series data.
In the context of ANOVA with independent groups, residuals are generally assumed to be independent.
Important
lmtest::dwtest()
Performs the Durbin-Watson test for autocorrelation of disturbances.
The Durbin-Watson test has the null hypothesis that the autocorrelation of the disturbances is 0.
This results suggest there are no autocorrelation at the alpha level of .05.
Normality within Each Group
Assume that the DV (Y) is distributed normally within each group for ANOVA
ANOVA is robust to minor violations of normality
shapiro.test()
set.seed(123) # For reproducibility
sample_normal_data <- rnorm(200, mean = 50, sd = 10) # Generate normal data
sample_nonnormal_data <- runif(200, min = 1, max = 10) # Generate non-normal data
shapiro.test(sample_normal_data)
Shapiro-Wilk normality test
data: sample_normal_data
W = 0.99076, p-value = 0.2298
Shapiro-Wilk normality test
data: sample_nonnormal_data
W = 0.95114, p-value = 2.435e-06
# Perform Kolmogorov-Smirnov test against a normal distribution
ks.test(scale(sample_normal_data), "pnorm")
Asymptotic one-sample Kolmogorov-Smirnov test
data: scale(sample_normal_data)
D = 0.054249, p-value = 0.5983
alternative hypothesis: two-sided
Asymptotic one-sample Kolmogorov-Smirnov test
data: scale(sample_nonnormal_data)
D = 0.077198, p-value = 0.1843
alternative hypothesis: two-sided
Practical Considerations
## Computes Levene's test for homogeneity of variance across groups.
car::leveneTest(outcome ~ group, data = data)
## Boxplots to visualize the variance by groups
boxplot(outcome ~ group, data = data)
## Brown-Forsythe test
onewaytests::bf.test(outcome ~ group, data = data)
## Bartlett's Test
bartlett.test(outcome ~ group, data = data)
Decision Tree
library(ellmer)
chat = ellmer::chat_ollama(model = "llama3.2")
chat$chat("How large can be considered as large sample size in one-way ANOVA")
In one-way ANOVA (Analysis of Variance), the sample size is a critical factor
that affects the analysis's power and accuracy. While there isn't a universally
agreed-upon "large" sample size, here are some general guidelines:
1. **Minimum sample size**: The minimum sample size for one-way ANOVA can vary
depending on the effect size and level of significance desired. A common rule
of thumb is to have at least 10-15 observations per group (i.e., more than 5
groups).
2. **Moderate sample size**: For a moderate sample size, many researchers use
values around:
* 30-50 observations per group for small effect sizes (<0.1) and moderate
significance levels (e.g., α=0.05).
* 60-80 observations per group for medium effect sizes (≈ 0.1-0.3) and
modest significance levels.
* 100 observation s per group for larger effect sizes (> 0.3) or more
stringent significance levels.
3. **Large sample size**: When there are:
* Very small cells (≈ 10 observation s per group): often, the recommended
approach would be to conduct pairwise comparisons using post-hoc tests like
Bonferroni, Newman-Keuls, or Tukey's honestly significant difference (HSD)
test.
For larger samples (>=50 observations s per group), you should consider
doing an ANOVA. To see if there is a "real" difference in mean values between 2
samples, use the F-statistic to check for significance:
4. **Effect size**: Keep in mind that no sample size is "large enough" for a
given effect size. The required sample size depends on the magnitude of the
expected effect.
Keep in mind that these are general guidelines and may not always hold true
depending on your specific research design, data distribution, and assumptions
(e.g., homogeneity of variance).
Always consult with a knowledgeable statistical consultant and consider various
sources if you're unsure about your study's sample size or need additional
guidance.
Robust to:
Not robust to:
The robustness of assumptions is something you should be carefull before/when you perform data collection. They are not something you can do after data collection has been finished.
Robustness to Violations of Normality Assumption
ANOVA assumes that the residuals (errors) are normally distributed within each group.
However, ANOVA is generally robust to violations of normality, particularly when the sample size is large.
Theoretical Justification: This robustness is primarily due to the Central Limit Theorem (CLT), which states that, for sufficiently large sample sizes (typically \(n≥30\) per group), the sampling distribution of the mean approaches normality, even if the underlying population distribution is non-normal.
This means that, unless the data are heavily skewed or have extreme outliers, ANOVA results remain valid and Type I error rates are not severely inflated.
The homogeneity of variance (homoscedasticity) assumption states that all groups should have equal variances. ANOVA can tolerate moderate violations of this assumption, particularly when:
Sample sizes are equal (or nearly equal) across groups – When groups have equal sample sizes, the F-test remains robust to variance heterogeneity because the pooled variance estimate remains balanced.
The degree of variance heterogeneity is not extreme – If the largest group variance is no more than about four times the smallest variance, ANOVA results tend to remain accurate.
The assumption of independence of errors means that observations within and between groups must be uncorrelated. Violations of this assumption severely compromise ANOVA’s validity because:
Fisher’s LSD
Bonferroni Correction
Family-wise Error Rate (adjusted p-values)
Tukey’s HSD
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Set seed for reproducibility
set.seed(123)
# Generate synthetic data for 4 groups
data <- tibble(
group = rep(c("Control", "G1", "G2", "G3"), each = 30),
verbal_score = c(
rnorm(30, mean = 70, sd = 10), # Control group
rnorm(30, mean = 75, sd = 12), # G1
rnorm(30, mean = 80, sd = 10), # G2
rnorm(30, mean = 85, sd = 8) # G3
)
)
# View first few rows
head(data)
# A tibble: 6 × 2
group verbal_score
<chr> <dbl>
1 Control 64.4
2 Control 67.7
3 Control 85.6
4 Control 70.7
5 Control 71.3
6 Control 87.2
# Fit the ANOVA model
anova_model <- lm(verbal_score ~ group, data = data)
# Install lmtest package if not already installed
# install.packages("lmtest")
# Load the lmtest package
library(lmtest)
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
# Perform the Durbin-Watson test
dw_test_result <- dwtest(anova_model)
# View the test results
print(dw_test_result)
Durbin-Watson test
data: anova_model
DW = 2.0519, p-value = 0.5042
alternative hypothesis: true autocorrelation is greater than 0
DW
value is close to 2, and the p-value is greater than 0.05, indicating no significant autocorrelation in the residuals.# Shapiro-Wilk normality test for each group
data %>%
group_by(group) %>%
summarise(
shapiro_p = shapiro.test(verbal_score)$p.value
)
# A tibble: 4 × 2
group shapiro_p
<chr> <dbl>
1 Control 0.797
2 G1 0.961
3 G2 0.848
4 G3 0.568
Warning in leveneTest.default(y = y, group = group, ...): group coerced to
factor.
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 1.1718 0.3236
116
Bartlett test of homogeneity of variances
data: verbal_score by group
Bartlett's K-squared = 3.5385, df = 3, p-value = 0.3158
Brown-Forsythe Test (alpha = 0.05)
-------------------------------------------------------------
data : verbal_score and group
statistic : 14.32875
num df : 3
denom df : 109.9888
p.value : 6.030497e-08
Result : Difference is statistically significant.
-------------------------------------------------------------
Df Sum Sq Mean Sq F value Pr(>F)
group 3 3492 1164.1 14.33 5.28e-08 ***
Residuals 116 9424 81.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
diff lwr upr p adj
G1-Control 7.611 1.545 13.677 0.008
G2-Control 10.715 4.649 16.782 0.000
G3-Control 14.720 8.654 20.786 0.000
G2-G1 3.104 -2.962 9.170 0.543
G3-G1 7.109 1.042 13.175 0.015
G3-G2 4.005 -2.062 10.071 0.318
Other multicomparison method allow you to choose which method for adjust p-values.
Loading required package: mvtnorm
Loading required package: survival
Loading required package: TH.data
Loading required package: MASS
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
Attaching package: 'TH.data'
The following object is masked from 'package:MASS':
geyser
# install.packages("multcomp")
### set up multiple comparisons object for all-pair comparisons
# head(model.matrix(anova_model))
comprs <- rbind(
"G1 - Ctrl" = c(0, 1, 0, 0),
"G2 - Ctrl" = c(0, 0, 1, 0),
"G3 - Ctrl" = c(0, 0, 0, 1),
"G2 - G1" = c(0, -1, 1, 0),
"G3 - G1" = c(0, -1, 0, 1),
"G3 - G2" = c(0, 0, -1, 1)
)
cht <- glht(anova_model, linfct = comprs)
summary(cht, test = adjusted("fdr"))
Simultaneous Tests for General Linear Hypotheses
Fit: aov(formula = verbal_score ~ group, data = data)
Linear Hypotheses:
Estimate Std. Error t value Pr(>|t|)
G1 - Ctrl == 0 7.611 2.327 3.270 0.00283 **
G2 - Ctrl == 0 10.715 2.327 4.604 3.20e-05 ***
G3 - Ctrl == 0 14.720 2.327 6.325 2.94e-08 ***
G2 - G1 == 0 3.104 2.327 1.334 0.18487
G3 - G1 == 0 7.109 2.327 3.055 0.00419 **
G3 - G2 == 0 4.005 2.327 1.721 0.10555
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Adjusted p values reported -- fdr method)
TukeyHSD
method and the multcomp
package stem from how each approach calculates and applies the multiple comparisons correction. Below is a detailed explanation of these differencesFeature | TukeyHSD() (Base R) |
multcomp::glht() |
---|---|---|
Distribution Used | Studentized Range (q-distribution) | t-distribution |
Error Rate Control | Strong FWER control | Flexible error control |
Simultaneous Confidence Intervals | Yes | Typically not (depends on method used) |
Adjustment Method | Tukey-Kramer adjustment | Single-step, Westfall, Holm, Bonferroni, etc. |
P-value Differences | More conservative (larger p-values) | Slightly different due to t-distribution |
Result
We first examined three assumptions of ANOVA for our data as the preliminary analysis. According to the Durbin-Watson test, the Shapiro-Wilk normality test, and the Bartletts’ test, the sample data meets all assumptions of the one-way ANOVA modeling.
A one-way ANOVA was then conducted to examine the effect of three intensive intervention methods (Control, G1, G2, G3) on verbal acquisition scores. There was a statistically significant difference between groups, \(F(3,116)=14.33\), \(p<.001\).
To further examine which intervention method is most effective, we performed Tukey’s post-hoc comparisons. The results revealed that all three intervention methods have significantly higher scores than the control group (G1-Ctrl: p = .003; G2-Ctrl: p < .001; G3-Ctrl: p < .001). Among three intervention methods, G3 seems to be the most effective. Specifically, G3 showed significantly higher scores than G1 (p = .004). However, no significant difference was found between G2 and G3 (p = .105).
Discussion
These findings suggest that higher intervention intensity improves verbal acquisition performance, which is consistent with prior literatures [xxxx/references]
Research Question:
Study Design
# Set seed for reproducibility
set.seed(42)
# Generate synthetic data for sleep study
sleep_data <- tibble(
sleep_group = rep(c("Short", "Moderate", "Long"), each = 30),
cognitive_score = c(
rnorm(30, mean = 65, sd = 10), # Short sleep group (≤5 hrs)
rnorm(30, mean = 75, sd = 12), # Moderate sleep group (6-7 hrs)
rnorm(30, mean = 80, sd = 8) # Long sleep group (≥8 hrs)
)
)
# View first few rows
head(sleep_data)
# A tibble: 6 × 2
sleep_group cognitive_score
<chr> <dbl>
1 Short 78.7
2 Short 59.4
3 Short 68.6
4 Short 71.3
5 Short 69.0
6 Short 63.9
Go through all six steps.
# Step 5: Perform One-Way ANOVA
anova_sleep <- aov(cognitive_score ~ sleep_group, data = sleep_data)
summary(anova_sleep)
Df Sum Sq Mean Sq F value Pr(>F)
sleep_group 2 3764 1881.9 15.88 1.32e-06 ***
Residuals 87 10311 118.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A one-way ANOVA was conducted to examine the effect of sleep duration on cognitive performance.
There was a statistically significant difference in cognitive test scores across sleep groups, \(F(2,87)=15.88\),\(p<.001\).
Tukey’s post-hoc test revealed that participants in the Long sleep group (M=81.52,SD=6.27) performed significantly better than those in the Short sleep group (M=65.68,SD=12.55), p<.01.
These results suggest that inadequate sleep is associated with lower cognitive performance.
ESRM 64503