Lecture 02: Hypothesis Testing

Experimental Design in Education

Jihong Zhang*, Ph.D

Educational Statistics and Research Methods (ESRM) Program*

University of Arkansas

2025-08-18

Presentation Outline

  • Types of Statistics
    • Descriptive: Summarize data (central tendency, variability, shape)
    • Inferential: Makes population inferences from samples
  • Hypothesis Testing
    1. State \(H_0\) and \(H_A\)
    2. Set α level
    3. Compute test statistics
    4. Conduct test
  • ANOVA Fundamentals
    • One DV, one IV with multiple levels
    • Between/within designs
    • Interaction effects
  • Test Components
    • Error types (I & II)
    • Variance analysis (\(SS_T\), \(SS_B\), \(SS_W\))
    • F-statistics and critical values
  • Examples
    • Political attitudes (Democrats, Republicans, Independents)
  • Decision Making
    • Compare F-observed vs F-critical
    • Compare p-value vs α
    • Interpret at (1-α) confidence level

Types of Statistics

1. Descriptive Statistics

  • Definition: Describes and summarizes the collected data using numbers/values
    • Central tendency: mean, median, mode
    • Variability: range, interquartile range (IQR), variance, standard deviation
    • Shape of distribution: skewness, kurtosis
moments::skewness(c(1:10, 100))
[1] 2.793716
moments::skewness(rnorm(100, 0, 1))
[1] 0.1482851
  • Examples of skewness with two graphs:

2. Inferential Statistics

  • Definition: Uses probability theory to infer/estimate population characteristics from a sample using hypothesis testing
  • Visual representation shows:
    • Population → Sampling → Sample
    • Sample → Inference → Population
      • Sample is analyzed using descriptive statistics
      • Inferential statistics used to make conclusions about population

3. Predictive Statistics

  • Definition: Use observed data to produce the most accurate prediction possible for new data. Here, the primary goal is that the predicted values have the highest possible fidelity to the true value of the new data.

  • Example: A simple example would be for a book buyer to predict how many copies of a particular book should be shipped to their store for the next month.

Which type statistics to use

  1. How many houses burned in California wildfire in the first week?

    • Descriptive
  2. Which factor is most important causing the fires?

    • Inference
  3. How likely the California wildfire will not happen again in next 5 years?

    • Predictive
  4. How likely human will live on Mars?

    • Not statistics. Sci-Fi
  5. Which type of statistics used by ChatGPT?

Statistical Hypothesis Testing Steps

To perform inference statistics, we need to go through following steps:

  1. State null hypothesis (H₀) and alternative hypothesis (\(H_A\))
    • Null hypothesis must be some statement that is statistically testable.
  2. Set alpha α (type I error rate) to determine significance levels
    • rejection region vs. p-value
  3. Compute test statistics (i.e., F-statistics)
  4. Conduct hypothesis testing:
    • Compare test statistics: critical value vs. observed value
    • Compare alpha and p-value

ANOVA Introduction

  • ANOVA is one of the most frequently used statistical tool for inference statistics in experimental design.
  • Settings for Analysis of Variance (ANOVA):
    • One dependent variable (DV), “Outcome”
    • One independent variable (IV) with multiple levels, “Group”
  • Example question: “Are there mean differences in SAT math scores (outcome) for different high school program types (group)?”
  • Course covers advanced ANOVA topics:
    • Group comparisons (Group A vs. B vs. C)
    • Model comparisons
    • Between/within-subject design
    • Interaction effects

Types of ANOVA: Key Differences

  • One-Way ANOVA

    • Purpose: Tests one factor with three or more levels on a continuous outcome.

    • Use Case: Comparing means across multiple groups (e.g., diet types on weight loss).

  • Two-Way ANOVA

    • Purpose: Examines two factors and their interaction on a continuous outcome.

    • Use Case: Studying effects of diet and exercise on weight loss.

  • Repeated Measures ANOVA

    • Purpose: Tests the same subjects under different conditions or time points.

    • Use Case: Longitudinal studies measuring the same outcome over time (e.g., cognitive tests after varying sleep durations).

  • Mixed-Design ANOVA

    • Purpose: Combines between-subjects and within-subjects factors in one analysis.

    • Use Case: Evaluating treatment effects over time with control and experimental groups.

  • Multivariate Analysis of Variance (MANOVA)

    • Purpose: Assesses multiple continuous outcomes (dependent variables) influenced by independent variables.

    • Use Case: Impact of psychological interventions on anxiety, stress, and self-esteem.

1 Example 1: Political Study on Tax Reform Attitudes

Background

  • A political scientist is interested in politicians’ attitudes toward tax reform and conducts a survey of Republicans, Independents, and Democrats.
  • Political scientist study on tax reform attitudes:
    • Groups: Democrats, Republicans, Independents
    • Data shows attitude scores (higher score = greater concern for tax reform)
    • Analysis conducted at α = .05
    • Data includes:
      • party: Democrats (4), Republicans (5), Independents (8)
      • scores: attitudes scores for the survey respondents
        • The higher the score, the greater the concern for tax reform
remotes::install_github("JihongZ/ESRM64103")
library(ESRM64103)
library(ESRM64103)
library(dplyr)
exp_political_attitude
         party scores
1     Democrat      4
2     Democrat      3
3     Democrat      5
4     Democrat      4
5     Democrat      4
6   Republican      6
7   Republican      5
8   Republican      3
9   Republican      7
10  Republican      4
11  Republican      5
12 Independent      8
13 Independent      9
14 Independent      8
15 Independent      7
16 Independent      8

Descriptive Statistics: summary statistics

  • Standard deviations and variances for each group

  • Grand mean: 5.625

# Grand mean
mean(exp_political_attitude$scores)
[1] 5.625
exp_political_attitude$party <- factor(exp_political_attitude$party, levels = c("Democrat", "Republican", "Independent"))
mean_byGroup <- exp_political_attitude |> 
  group_by(party) |> 
  summarise(Mean = mean(scores),
            SD = round(sd(scores), 2),
            Vars = round(var(scores), 2),
            N = n())
mean_byGroup
# A tibble: 3 × 5
  party        Mean    SD  Vars     N
  <fct>       <dbl> <dbl> <dbl> <int>
1 Democrat        4  0.71   0.5     5
2 Republican      5  1.41   2       6
3 Independent     8  0.71   0.5     5

Descriptive Statistics: histogram

library(ggplot2)
ggplot(data = mean_byGroup) +
  geom_bar(mapping = aes(x = party, y = Mean, fill = party), stat = "identity", width = .5) +
  geom_label(aes(x = party, y = Mean, label = Mean), nudge_y = .3) +
  labs(title = "Attitudes Toward the Tax Return") +
  theme(text = element_text(size = 15))

Analysis Steps

  1. State the null hypothesis and alternative hypothesis:

    • \(H_0\): \(\bar{X}_{dem}\) = \(\bar{X}_{rep}\) = \(\bar{X}_{ind}\)
    • \(H_A\): At least two groups are significantly different
    • Question: Why not testing \(\bar{SD}_{dem}\) = \(\bar{SD}_{rep}\) = \(\bar{SD}_{ind}\)?
    • Answer: You definitely can in statistics. Variances homogeneity.
  2. Set the significant alpha = 0.05

  3. Quick Review of F-statistics:

    \[ F_{obs} = \frac{SS_b/df_b}{SS_w/df_w} \]

    • \(df_b\) = 3 (groups) - 1 = 2, \(df_w\) = 16 (samples) - 3 (groups) =13
    • \(SS_b\) = \(\Sigma n_j(\bar{Y}_j - \bar{Y})^2\) = 43.75; where \(n_j\) is group sample sizes, \(\bar{Y}_j\) is group means, and \(\bar{Y}\) is the grand mean.
    • \(SS_w\) =\(\Sigma_{j=1}^{3} \Sigma_{1}^{n_j}(Y_{ij}-\bar{Y_j})^2\) = 14.00; where \(\bar{Y}_{ij}\) is the individual i’s score in group j
    GrandMean <- mean(exp_political_attitude$scores)
    ## Between-group Sum of Squares
    sum(mean_byGroup$N * (mean_byGroup$Mean - GrandMean)^2)
    ## Within-group Sum of Squares
    SSw_dt <- exp_political_attitude |> 
      group_by(party) |> 
      mutate(GroupMean = mean(scores),
             Diff_sq = (scores - GroupMean)^2) 
    sum(SSw_dt$Diff_sq)
    • F_critical (df_num = 2, df_deno = 13) = 3.81
    • F_observed = 20.31
    mod1 <- lm(scores ~ party, data = exp_political_attitude)
    anova(mod1)
    Analysis of Variance Table
    
    Response: scores
              Df Sum Sq Mean Sq F value    Pr(>F)    
    party      2  43.75 21.8750  20.312 9.994e-05 ***
    Residuals 13  14.00  1.0769                      
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Results show rejection of H₀ (F_obs > F_critical)

Step 1: State the null hypothesis and alternative hypothesis

  1. Formulate the null hypothesis (𝐻₀) and the alternative hypothesis (𝐻ₐ)
    • Prior to any statistical tests, start with a working hypothesis based on an initial guess about the phenomenon.
    • Example: Investigating whether diet groups affect weight loss.
      • Research question: “Is there a variance in weight loss among different diet groups?”
      • Hypothesis: “Different diet groups will show varying weight loss.
    • Operational Definitions:
      • Null hypothesis (𝐻₀): No observed difference or effect (“Something is something”).
        • Group A’s mean - Group B’s mean = 0
      • Alternative hypothesis (𝐻ₐ): Noticeable difference or effect, contrary to 𝐻₀. (“Something is not something”)
    • The adequacy of the data will dictate if 𝐻₀ can be confidently rejected.

Step 2: Rejection region

F-statistic has two degree of freedoms. This is the density distribution of F-statistics for degree of freedoms as 2 and 13.

Code
# Set degrees of freedom for the numerator and denominator
num_df <- 2  # Change this as per your specification
den_df <- 13  # Change this as per your specification

# Generate a sequence of F values
f_values <- seq(0, 8, length.out = 1000)

# Calculate the density of the F-distribution
f_density <- df(f_values, df1 = num_df, df2 = den_df)

# Create a data frame for plotting
data_to_plot <- data.frame(F_Values = f_values, Density = f_density)
data_to_plot$Reject05 <- data_to_plot$F_Values > 3.81
data_to_plot$Reject01 <- data_to_plot$F_Values > 6.70
# Plot the density using ggplot2
ggplot(data_to_plot) +
  geom_area(aes(x = F_Values, y = Density), fill = "grey", 
            data = filter(data_to_plot, !Reject05)) + # Draw the line
  geom_area(aes(x = F_Values, y = Density), fill = "yellow", 
            data = filter(data_to_plot, Reject05)) + # Draw the line
  geom_area(aes(x = F_Values, y = Density), fill = "tomato", 
            data = filter(data_to_plot, Reject01)) + # Draw the line
  geom_vline(xintercept = 3.81, linetype = "dashed", color = "red") +
  geom_label(label = "F_crit = 3.81 (alpha = .05)", x = 3.81, y = .5, color = "red") +
  geom_vline(xintercept = 6.70, linetype = "dashed", color = "royalblue") +
  geom_label(label = "F_crit = 6.70 (alpha = .01)", x = 6.70, y = .5, color = "royalblue") +
  ggtitle("Density of F-Distribution") +
  xlab("F values") +
  ylab("Density") +
  theme_classic()
  1. Set the alpha \(\alpha\) (i.e., type I error rate)—rejection rate, vs. p-value

    • Alpha can determine several values for the statistical hypothesis testing: the critical value of the test statistics, the rejection region, etc.

    • Large sample size needs lower alpha level : .01/.001 (more restrict rejection rate)

  2. When we conduct a hypothesis testing, these four cases might be occured

Type I & II Error
Reality
Decision True \(H_0\) False \(H_0\)
Fail to reject \(H_0\) Correct Decision

Error made.

Type II error (\(\beta\)).

Reject \(H_0\)

Error made.

Type I error (\(\alpha\))

Correct Decision (Power)

Step 3: Compute the test statistics

  • Investigate where the variability of outcome come from?

    • In this study, do people’s attitude scores differ because of political parties?

    • Imagine we have two factors: A and B, the variability of outcome can be separated as following:

F-statistics

  • Core idea of F-stats: comparing the variances between groups and within groups to ascertain if the means of different groups are significantly different from each other.

  • Logic: if the between-group variance (due to systematic differences caused by the independent variable) is significantly greater than the within-group variance (attributable to random error), the observed differences between group means are likely not due to chance.

  • F-statistics under 1-way ANOVA:

    \[ F_{obs} = \frac{SS_b/df_b}{SS_w/df_w}\]

    • \(df_b\) = 3 (groups) - 1 = 2, \(df_w\) = 16 (samples) - 3 (groups) =13
    • \(SS_b\) = \(\Sigma n_j(\bar{Y}_j - \bar{Y})^2\) = 43.75;
      • the variability in the differences between groups (weighted by the sample size of the group)
    • \(SS_w\) =\(\Sigma_{j=1}^{3} \Sigma_{1}^{n_j}(Y_{ij}-\bar{Y_j})^2\) = 14.00; where \(\bar{Y}_{ij}\) is the individual i’s score in group j
      • Random error with groups - people differ in attitudes for the unkown reason

Step 4: Conduct a hypothesis testing

  • In addition to the comparison of the critical value and the observed value of the test statistics, we also can compare the alpha and the p-value:

  • We determine F_crit by setting α value.
    • α = (acceptable) type I error rate = probability that we wrongly reject \(H_0\) when \(H_0\) is true
  • From the data, we can obtain F_obs with p-value.
    • p-value = probability of data sets have F-stats larger than F_obs
  • If the F statistic from the data (=F_obs) is larger than the F critical, then you are in the rejection region and you can reject the \(H_0\) and accept the \(H_A\) with (1-α) level of confidence.
  • If the p-value obtained from the ANOVA is less than α, then reject H0 and accept the HA with (1-α) level of confidence.

Step 5: Results Report

A one-way ANOVA was conducted to compare the level of concern for tax reform among three political groups: Democrats, Republicans, and Independents. There was a significant effect of political affiliation on tax reform concern at the p < .001 level for the three conditions [F(2, 13) = 20.31, p < .001]. This result indicates significant differences in the attitudes toward tax reform among the groups.

Note: Relationship Between P-values and Type I Error

  1. p-values: the probability of observing data as extreme as, or more extreme than, the data observed under the assumption that the null hypothesis is true.
    • Lower the p-values are, we are less likely to see the observed data given the null hypothesis is true
    • Question: Given that we already have the observed data, does lower p-values means the null hypothesis is unlikely to be true, which is our goal for inference statistics?
    • Answer: p(observed data exists | H0 = true) is not equal to p(H0 = true | observed data exists), P-values are often misconstrued as the probability that the null hypothesis is true given the observed data. However, this interpretation is incorrect.
  2. Type I error, also known as a “false positive,” occurs when the null hypothesis is incorrectly rejected when it is, in fact, true.
  3. The alpha level set before conducting a test (commonly α = 0.05) essentially defines the cut-off point for the p-value below which the null hypothesis will be rejected.
    • A p-value that is less than the alpha level suggests a low probability that the observed data would occur if the null hypothesis were true. Consequently, rejecting the null hypothesis in this context implies that there is a statistically significant difference likely not due to random chance.

Note: Limitations of p-values

Relying solely on p-values to reject the null hypothesis can be problematic for several reasons:

  • Binary Decision Making: The use of a threshold (e.g., α = 0.05) to determine whether to reject the null hypothesis reduces the complexity of the data and the underlying phenomena to a binary decision. This can oversimplify the interpretation and overlook the nuances in the data.

    • Confidence Intervals. Bayesian statistics - reporting posterior distribution.
  • Neglect of Effect Size: P-values do not convey the size or importance of an effect. A very small effect can produce a small p-value if the sample size is large enough, leading to a rejection of the null hypothesis even when the effect may not be practically significant.

    • Independent of sample size.
  • Probability of Extremes Under the Null: Since p-values quantify the extremeness of the observed data under the null hypothesis, they do not address whether similarly extreme data could also occur under alternative hypotheses. This can lead to an overemphasis on the null hypothesis and potentially disregard other plausible explanations for the data.

    • Explore theory. Find other explanations. Tried varied models.

2 Example 2: the Effect of Sleep on Academic Performance (Simulation)

Background

  • A study investigates the effect of different sleep durations on the academic performance of university students. Three groups are defined based on nightly sleep duration: Less than 6 hours, 6 to 8 hours, and more than 8 hours.

  • We can simulate the data

# Set seed for reproducibility
set.seed(42)

# Generate data for three sleep groups
less_than_6_hours <- rnorm(30, mean = 65, sd = 10)
six_to_eight_hours <- rnorm(50, mean = 75, sd = 8)
more_than_8_hours <- rnorm(20, mean = 78, sd = 7)

# Combine data into a single data frame
sleep_data <- data.frame(
  Sleep_Group = factor(c(rep("<6 hours", 30), rep("6-8 hours", 50), rep(">8 hours", 20))),
  Exam_Score = c(less_than_6_hours, six_to_eight_hours, more_than_8_hours)
)

# View the first few rows of the dataset
head(sleep_data)

Descriptive Statistics

  • Groups:

    • Less than 6 hours: 30 students

    • 6 to 8 hours: 50 students

    • More than 8 hours: 20 students

  • Performance Metric: Average exam scores out of 100.

    • Less than 6 hours: Mean = 65, SD = 10

    • 6 to 8 hours: Mean = 75, SD = 8

    • More than 8 hours: Mean = 78, SD = 7

You turn:

F-test

  • Analysis: One-way ANOVA was conducted to compare the average exam scores among the three groups.

  • Results: F_observed = XX.XX, p = 0.001

Interpretation

  • Alpha Level: α = 0.05

  • P-value Interpretation: The p-value (0.001) is less than the alpha level (0.05), indicating a statistically significant difference in exam scores among the different sleep groups.

  • Conclusion: The results suggest that the amount of sleep significantly affects academic performance, with students getting 6 hours or more of sleep performing better on average than those with less sleep. The findings highlight the importance of adequate sleep among students for optimal academic outcomes.

Homework 1

Due on 02/03 5PM.