Final Homework: Experimental Design (ESRM 64103)

Lectures 1–11 Comprehensive Review

Author
Affiliation

Jihong Zhang*, Ph.D

Educational Statistics and Research Methods (ESRM) Program*

University of Arkansas

Published

April 21, 2026

Instructions

This homework covers Lectures 1–6 and 9–11 from ESRM 64103. Select the single best answer for each of the 25 multiple-choice questions. Questions 20 and 22 require numerical computation; all other questions are conceptual or interpretive.

Submission: Use the Google Form link provided by your instructor.


Questions

Question 1

(Lecture 1 — Introduction to Experimental Design in Education)

Research Scenario: A curriculum researcher wants to determine whether a new mathematics program causes higher test scores. She considers either randomly assigning classrooms to use the new program or simply comparing classrooms that already chose to adopt it with those that did not.

Which type of study design allows the researcher to draw causal conclusions about the program’s effect on test scores?

A. An observational study that controls for as many confounds as possible B. A randomized experiment that assigns classrooms to conditions C. A longitudinal study that tracks students over multiple years D. A survey study using a large nationally representative sample

Key: B

Randomization is the defining feature of a true experiment; it is the only design that supports causal inference by controlling for known and unknown confounds.


Question 2

(Lecture 1 — Introduction to Experimental Design in Education)

Research Scenario: A professor introduces variable classification at the first lecture of an experimental design course. In a study comparing reading fluency across instructional approaches, three variables are identified: the instructional method assigned to each class, the student’s prior reading level measured at intake, and the fluency score measured at the end of the semester.

Which classification correctly identifies these three variables in the order listed?

A. Independent variable, covariate, dependent variable B. Dependent variable, independent variable, covariate C. Dependent variable, covariate, independent variable D. Covariate, moderating variable, outcome variable

Key: A

Instructional method is the manipulated independent variable, prior reading level is a covariate (measured pre-treatment), and fluency at semester end is the dependent (outcome) variable.


Question 3

(Lecture 2 — Hypothesis Testing)

Research Scenario: A researcher studies political attitudes comparing Democrats (n = 4), Republicans (n = 5), and Independents (n = 8). She reports an observed F-statistic of 20.31 and a p-value of .003 at α = .05.

What is the correct interpretation of p = .003 in this study?

A. There is a 0.3% probability that the null hypothesis is true B. The probability that the observed group differences are due to chance is 0.3% C. The effect is large enough to be practically meaningful at the 0.3% significance level D. If the null hypothesis were true, there is a 0.3% probability of obtaining an F-statistic as extreme as or more extreme than 20.31

Key: D

p-value is the probability of the observed (or more extreme) data given H₀ is true — it is NOT the probability that H₀ is true, which is a common misconception.


Question 4

(Lecture 2 — Hypothesis Testing)

Research Scenario: An evaluation researcher tests whether a reading intervention raises average scores above the district baseline of 70. After collecting data, she fails to reject the null hypothesis even though the intervention actually did raise scores.

Which combination correctly names this error and its definition?

A. Type I error — rejecting a true null hypothesis B. Type I error — failing to reject a false null hypothesis C. Type II error — failing to reject a false null hypothesis D. Type II error — rejecting a false null hypothesis

Key: C

A Type II error (β error) occurs when the null hypothesis is false but we fail to reject it — the intervention was effective but the test did not detect it.


Question 5

(Lecture 3 — Introduction to One-Way ANOVA)

Research Scenario: A researcher compares test scores across four teaching methods. A colleague suggests conducting all pairwise t-tests instead of ANOVA at α = .05, but the researcher objects, citing a key statistical problem with that approach.

Why is conducting multiple pairwise t-tests instead of a one-way ANOVA problematic when comparing more than two groups?

A. Pairwise t-tests require equal sample sizes, which is rarely met in practice B. The probability of making at least one Type I error across all comparisons increases beyond the nominal α level C. Each additional t-test reduces statistical power, making it harder to detect any effect D. Pairwise t-tests cannot detect differences between non-adjacent groups

Key: B

Running multiple tests at α = .05 inflates the family-wise error rate (FWER) well above .05; ANOVA protects against this inflation by testing all groups in a single omnibus F-test.


Question 6

(Lecture 3 — Introduction to One-Way ANOVA)

Research Scenario: A graduate student is learning ANOVA and asks why the procedure uses variance ratios rather than directly comparing mean differences across groups.

What does the F-ratio in a one-way ANOVA compare?

A. The variance due to treatment effects relative to the variance due to random error within groups B. The variance of group means to the total variance across all participants C. The sum of squares between groups to the sum of squares within groups, unadjusted for degrees of freedom D. The pooled standard deviation of all groups to the standard deviation of the grand mean

Key: A

F = MS_Between / MS_Within; the numerator reflects systematic between-group variation and the denominator reflects unsystematic within-group (error) variation.


Question 7

(Lecture 3 — Introduction to One-Way ANOVA)

Research Scenario: A researcher compares verbal acquisition scores across three instructional groups. Before running ANOVA, she checks whether the group variances are approximately equal, which is one of the key assumptions of the standard one-way ANOVA.

Which test is used to check the homogeneity of variance assumption in one-way ANOVA?

A. Shapiro-Wilk test B. Durbin-Watson test C. Mauchly’s test D. Levene’s test

Key: D


Question 8

(Lecture 4 — ANOVA Assumptions Checking)

Research Scenario: A researcher collects test scores from students within the same classroom over three consecutive weeks. A colleague recommends running a Durbin-Watson test on the residuals before interpreting the ANOVA results.

Which outcome of the Durbin-Watson test would indicate that the independence assumption is satisfied?

A. DW ≈ 2, indicating no autocorrelation in the residuals B. DW ≈ 0, indicating no systematic pattern in residuals C. DW ≈ 4, indicating strong negative autocorrelation in residuals D. DW between 1 and 3, indicating moderate positive autocorrelation

Key: A

The Durbin-Watson statistic ranges from 0 to 4; DW ≈ 2 indicates no autocorrelation, supporting the independence assumption.


Question 9

(Lecture 4 — ANOVA Assumptions Checking)

Research Scenario: A researcher has n = 35 per group and finds that the Shapiro-Wilk test is significant (p = .03), suggesting mild non-normality in the outcome variable.

According to the lecture, which ANOVA assumption is the researcher LEAST justified in worrying about given the sample sizes?

A. Homogeneity of variance across groups B. Independence of observations C. Normality of residuals within each group D. Equal sample sizes across conditions

Key: C

ANOVA is robust to moderate normality violations when n ≥ 30 per group (Central Limit Theorem), but NOT robust to violations of independence or severe HOV violations with unequal n.


Question 10

(Lecture 4 — ANOVA Assumptions Checking)

Research Scenario: A researcher compares vocabulary scores across four instructional methods (group means: 70, 75, 80, 85; n = 30 per group). A Tukey HSD post-hoc test shows that Method 1 vs. Method 4 is significant (p = .003), but Method 2 vs. Method 3 is not (p = .21).

What does the non-significant Tukey result for Methods 2 vs. 3 indicate?

A. Method 2 and Method 3 are identical in the population B. The observed mean difference between Methods 2 and 3 does not exceed the minimum significant difference after controlling the family-wise error rate C. The overall ANOVA F-test was likely non-significant as well D. A post-hoc test cannot be interpreted meaningfully when only some pairwise comparisons are significant

Key: B

Tukey HSD controls FWER across all pairwise comparisons; a non-significant pair means the observed difference falls below the minimum significant difference threshold, not that the two means are truly equal.


Question 11

(Lecture 5 — ANOVA Comparisons and Contrasts)

Research Scenario: A researcher studying STEM achievement plans to compare STEM departments (Engineering, Chemistry: weight = +1/2 each) against non-STEM departments (Education, Political Science, Psychology: weight = −1/3 each) before collecting any data.

What property of this contrast confirms it is a valid planned contrast?

A. The contrast was derived after inspecting the observed group means to identify the largest difference B. The absolute values of the positive weights equal the absolute values of the negative weights in each subset C. All five groups have equal sample sizes, making the contrast orthogonal by default D. The contrast weights sum to zero (+1/2 + 1/2 − 1/3 − 1/3 − 1/3 = 0), satisfying the required constraint

Key: D

A valid contrast requires that the sum of all weights equals zero; here 2(1/2) + 3(−1/3) = 1 − 1 = 0.


Question 12

(Lecture 5 — ANOVA Comparisons and Contrasts)

Research Scenario: A researcher reports that a new curriculum produced η² = 0.09 relative to a control condition. A reviewer asks whether this represents a meaningful effect size.

According to the effect size benchmarks in the lecture, how should η² = 0.09 be classified?

A. Small effect, below the conventional small threshold of η² = 0.01 B. Small-to-medium effect, between the small (0.01) and medium (0.06) benchmarks C. Medium-to-large effect, between the medium (0.06) and large (0.14) benchmarks D. Medium effect, at or near the medium benchmark of η² = 0.06

Key: C

The lecture defines η² benchmarks as small = 0.01, medium = 0.06, large = 0.14; η² = 0.09 falls between the medium and large thresholds.


Question 13

(Lecture 5 — ANOVA Comparisons and Contrasts)

Research Scenario: A researcher designs two planned contrasts: Contrast 1 compares Group 1 against the average of Groups 2, 3, and 4 (weights: +3, −1, −1, −1); Contrast 2 compares Group 2 against the average of Groups 3 and 4 (weights: 0, +2, −1, −1). She wants to verify whether these two contrasts are orthogonal.

Which property must hold for two planned contrasts to be orthogonal?

A. Both contrasts must use the same groups, with weights of opposite sign B. The sum of the products of their corresponding weights (dot product) must equal zero C. They must test mutually exclusive hypotheses about entirely separate groups D. The sum of squared weights in each contrast must equal the same value

Key: B

Two contrasts are orthogonal when the dot product of their weight vectors equals zero: (3)(0)+(−1)(2)+(−1)(−1)+(−1)(−1) = 0−2+1+1 = 0. Orthogonal contrasts provide non-redundant, independent tests.


Question 14

(Lecture 6 — Experimental Design and Validity)

Research Scenario: A district evaluates a new literacy program by testing students at the beginning and end of the school year without a control group. Students show significant gains. A colleague suggests the gains may simply reflect normal cognitive growth over the school year rather than the program’s effect.

Which internal validity threat does the colleague’s concern illustrate?

A. Maturation, because natural developmental changes over time could explain the observed gains B. Selection bias, because students self-selected into the program C. History, because an external event during the year could have affected literacy outcomes D. Attrition, because students who dropped out may differ systematically from those who remained

Key: A

Maturation refers to changes within participants over time (e.g., normal cognitive development) that could produce observed gains independent of the treatment.


Question 15

(Lecture 6 — Experimental Design and Validity)

Research Scenario: A researcher worries that administering a pretest may sensitize students to the intervention, artificially inflating the apparent treatment effect. She reviews her lecture notes on experimental designs to find one that explicitly controls for this threat.

Which design controls for both the testing effect and the interaction of testing with the treatment?

A. Pre-Post-Test Control Group Design B. Post-Test Only Control Group Design C. Solomon Four-Group Design D. Randomized Block Design

Key: C

The Solomon Four-Group Design includes conditions with and without a pretest in both treatment and control arms, isolating whether the pretest itself interacts with the treatment.


Question 16

(Lecture 6 — Experimental Design and Validity)

Research Scenario: A researcher evaluates a math tutoring program in three urban middle schools and finds a significant improvement in test scores. A colleague cautions that the results may not apply to rural schools, private schools, or students from different socioeconomic backgrounds.

Which type of validity does this concern address?

A. Internal validity — the study may have uncontrolled confounds affecting causal conclusions B. External validity — the findings may not generalize beyond the specific sample, setting, or context C. Construct validity — the math achievement test may not adequately measure the intended construct D. Statistical conclusion validity — the sample may be too small to support reliable conclusions

Key: B

External validity refers to the degree to which study findings generalize to other populations, settings, and time points beyond the original study context.


Question 17

(Lecture 9 — Introduction to Two-Way ANOVA)

Research Scenario: A researcher reviews Fisher’s original arguments for factorial designs and wants to explain to a colleague why a 3 × 2 factorial design examining tutoring frequency and school type is preferable to running two separate one-way ANOVAs.

Which advantage is most directly tied to the ability to study whether one factor’s effect depends on the level of another factor?

A. A factorial design always requires fewer total participants than two separate one-way ANOVAs B. A factorial design eliminates the need for post-hoc comparisons by controlling FWER through the F-ratio C. A factorial design increases external validity by restricting sampling to one level of each factor D. A factorial design allows testing whether the effect of tutoring frequency depends on school type — something separate one-way ANOVAs cannot detect

Key: D

Detecting interactions — whether one factor’s effect changes across levels of another — is uniquely possible in factorial designs and cannot be assessed using separate one-way ANOVAs.


Question 18

(Lecture 9 — Introduction to Two-Way ANOVA)

Research Scenario: A researcher designs a study with three tutoring frequencies (no tutor, once/week, daily) crossed with three school types (public, private-secular, private-religious), requiring n = 30 participants per cell.

How many cells does this factorial design have, and what is the minimum total number of participants required?

A. 6 cells; minimum 270 participants B. 9 cells; minimum 180 participants C. 9 cells; minimum 270 participants D. 6 cells; minimum 180 participants

Key: C

A 3 × 3 design has 3 × 3 = 9 cells; with n = 30 per cell, the total sample required is 9 × 30 = 270 participants.


Question 19

(Lecture 9 — Introduction to Two-Way ANOVA)

Research Scenario: Two researchers are planning factorial studies on tutoring and school type. The first assigns each participant to a single combination of tutoring frequency and school type. The second assigns school type between participants but measures all three tutoring conditions within the same participants.

Which statement correctly distinguishes the between-subjects (independent factorial) design from the split-plot design?

A. In a between-subjects factorial all participants are independently assigned to one treatment combination; in a split-plot design one factor is between-subjects and the other is within-subjects B. A between-subjects factorial uses different participants for all cells; a split-plot design uses the same participants for every cell C. A between-subjects factorial always has more power than a split-plot design because it eliminates carryover effects D. A split-plot design requires random assignment to all factors, while a between-subjects design allows quasi-experimental assignment

Key: A

A split-plot (mixed) design has one between-subjects factor and one within-subjects factor; a fully between-subjects factorial assigns each participant to exactly one combination of all factor levels.


Question 20

(Lecture 10 — Two-Way ANOVA II)

Research Scenario: A researcher studying the effects of tutoring frequency (A: no tutor, once/week, daily) and school type (B: public, private-secular, private-religious) on GPA reports the following ANOVA summary: SS_A = 450, SS_B = 300, SS_A×B = 120, SS_Error = 630, SS_Total = 1500.

What is η² for the interaction effect (A × B)?

A. 0.30 B. 0.20 C. 0.16 D. 0.08

Key: D

η²_(A×B) = SS_(A×B) / SS_Total = 120 / 1500 = 0.08, a small-to-medium interaction effect by Cohen’s η² guidelines.


Question 21

(Lecture 10 — Two-Way ANOVA II)

Research Scenario: A researcher plots marginal means for a tutoring frequency × school type study. The interaction plot shows three lines — one per school type — across the three tutoring frequency levels (no tutor, once/week, daily). All three lines are nearly parallel to each other.

What does this pattern indicate about the interaction effect?

A. The interaction is statistically significant and practically large B. The effect of tutoring frequency is approximately the same across all school types, suggesting no interaction C. Tutoring frequency has no main effect across any school type in this study D. School type significantly moderates the relationship between tutoring frequency and GPA

Key: B

Parallel lines in an interaction plot indicate that the differences among tutoring levels are approximately consistent across school types — the hallmark of a negligible or absent interaction.


Question 22

(Lecture 10 — Two-Way ANOVA II)

Research Scenario: A researcher runs a two-way ANOVA examining tutoring frequency × school type effects on GPA. She reports the following mean squares: MS_Tutoring = 225, MS_SchoolType = 150, MS_Interaction = 30, MS_Error = 21. She wants to verify the F-ratio for the interaction effect.

What is the F-ratio for the Interaction effect?

A. 0.70 B. 0.14 C. 1.43 D. 10.71

Key: C

F_Interaction = MS_Interaction / MS_Error = 30 / 21 ≈ 1.43.


Question 23

(Lecture 11 — Repeated Measures ANOVA)

Research Scenario: A researcher analyzes front squat performance at three time points (RM1, RM4, RM21) in 54 CrossFit athletes using repeated-measures ANOVA. Mauchly’s test of sphericity yields W = .914, p = .095.

What does this result indicate, and what should the researcher do next?

A. Sphericity is violated (p < .05); the Greenhouse-Geisser correction must be applied to the degrees of freedom B. Sphericity is violated (p < .05); the Huynh-Feldt correction must be applied because it is less conservative C. The Mauchly test result is inconclusive; both Greenhouse-Geisser and Huynh-Feldt corrections should be reported regardless D. Sphericity is not violated (p > .05); no correction is needed and the standard RM-ANOVA F-test is appropriate

Key: D

Mauchly’s W = .914, p = .095 > .05 means we fail to reject the sphericity assumption; the standard RM-ANOVA F-test can be used without any correction.


Question 24

(Lecture 11 — Repeated Measures ANOVA)

Research Scenario: A researcher uses a repeated-measures ANOVA comparing front squat performance at three time points using the same 54 athletes throughout, rather than randomly assigning different groups to each time point. A colleague asks why this design tends to produce a larger F-statistic.

What is the primary reason repeated-measures designs tend to have more statistical power than between-subjects designs?

A. They remove individual differences variance from the error term, resulting in a smaller MS_Error and a larger F-ratio B. They allow each participant to serve as their own control, which automatically doubles the effective sample size C. They eliminate the need to check any ANOVA assumptions, simplifying the analysis D. They require smaller effect sizes to achieve the same level of statistical significance

Key: A

In repeated-measures ANOVA, variability due to individual differences is partitioned out of the error term, reducing MS_Error and increasing the F-ratio compared to a between-subjects design with the same data.


Question 25

(Lecture 11 — Repeated Measures ANOVA)

Research Scenario: In the CrossFit front squat study, the researcher added gender as a between-subjects factor. Male athletes showed means of RM1 = 403.65, RM4 = 378.04, RM21 = 247.08; female athletes showed means of RM1 = 262.32, RM4 = 246.18, RM21 = 162.25. The Time × Gender interaction was statistically significant (p < .05).

What does the significant Time × Gender interaction indicate?

A. Male and female athletes both declined from RM1 to RM21, but at identical rates B. Gender has a significant main effect on front squat performance, regardless of time point C. The pattern of change in front squat performance over time differs between male and female athletes D. The sphericity assumption was violated separately within each gender group

Key: C

A significant Time × Gender interaction means the trajectory of performance change over time is not the same for males and females — the two groups show different patterns across the three time points.


Answer Key Summary

Question Key Rationale (≤15 words)
1 B Randomization enables causal inference; observational designs cannot
2 A IV, covariate, DV — classification independent of assignment method
3 D p-value = P(data given H₀ true), not P(H₀ is true)
4 C Type II error = failing to reject a false null hypothesis
5 B Multiple t-tests inflate FWER above nominal α; ANOVA prevents this
6 A F = MS_Between / MS_Within; treatment vs. error variance ratio
7 D Levene’s test checks the homogeneity of variance assumption
8 A DW ≈ 2 indicates no autocorrelation; independence assumption holds
9 C ANOVA robust to normality violations when n ≥ 30 (CLT)
10 B Non-significant pair falls below Tukey minimum significant difference
11 D Valid contrast requires contrast weights summing to zero
12 C η² = 0.09 falls between medium (0.06) and large (0.14) benchmarks
13 B Orthogonal contrasts: dot product of weight vectors equals zero
14 A Maturation = natural within-person change over time, not treatment
15 C Solomon Four-Group design isolates testing × treatment interaction
16 B External validity = generalizability beyond study sample and context
17 D Factorial designs uniquely detect interactions between two factors
18 C 3 × 3 = 9 cells; 9 × 30 = 270 total participants required
19 A Split-plot: one between, one within factor per participant
20 D η²_(A×B) = 120/1500 = 0.08 (small-to-medium interaction)
21 B Parallel lines indicate consistent factor effects; no interaction
22 C F_Interaction = MS_Interaction / MS_Error = 30/21 ≈ 1.43
23 D Mauchly p = .095 > .05; sphericity holds, no correction needed
24 A RM designs partition individual differences out of error term
25 C Significant interaction = different performance trajectories by gender
Back to top