Lecture 05: Maximum Likelihood Estimation and Generalized Univariate Models

Lecture 05: Maximum Lilkelihood Estimation

Jihong Zhang*, Ph.D

Educational Statistics and Research Methods (ESRM) Program*

University of Arkansas

2024-09-16

Today’s Class

  • Review last class - Random variables and their distribution
  • Review Homework System
  • Maximum Likelihood Estimation
    • The basics
    • ML-based Test (Likelihood ratio test, Wald test, Information criteria)
    • MLEs for GLMs

More Details about Homework Grade

  1. You should be able to check your current grade for homework here.
    • You need your 4-digit unique code to check your homework.
  2. Homework 0: 20 points; Homework 1: 21 points
  3. I adjusted everyone’s scores based on how many errors you have in your homework that are “key parts
    • If the position within vcov() is incorrect (i.e., answer is vcov(model)[1, 1], while the response is vcov(model)[1, 2]), you will lose 3 points
  4. If you found your scores incorrect or have any suggestions, feel free to talk to me and I am willing to help at any time.

Today’s Example Data #1

  • To use the class example data, type in following R code: pak::pak("JihongZ/ESRM64503", upgrade = TRUE) or devtools::install_github("JihongZ/ESRM64503", upgrade = "always") to upgrade ESRM64503 package

Data Information

  • Purpose: imagine an employer is looking to hire employees for a job where IQ is important

    • We will use 5 observations so as to show the math behind the estimation calculations
  • The employer collects two variables:

    • Predictor: IQ scores (labelled as iq)
    • Outcome: Job performance (labelled as perf)

Descriptive Statistics

# pak::pak("JihongZ/ESRM64503", upgrade = TRUE)
library(ESRM64503)
library(kableExtra)
library(tidyverse)

#data for class example
dat <- dataIQ |> 
  mutate(ID = 1:5) |> 
  rename(IQ = iq, Performance = perf) |> 
  relocate(ID)

show_table(dat)
ID Performance IQ
1 10 112
2 12 113
3 14 115
4 16 118
5 12 114
dat |> 
  summarise(
    across(c(Performance, IQ), list(Mean = mean, SD = sd))
  ) |> 
  t() |> 
  as.data.frame() |> 
  rownames_to_column(var = "Variable") |> 
  separate(Variable, into = c("Variable", "Stat"), sep = "_") |> 
  pivot_wider(names_from = "Stat", values_from = "V1") |> 
  show_table()
Variable Mean SD
Performance 12.8 2.280351
IQ 114.4 2.302173
dat |> 
  select(IQ, Performance) |> 
  cov() |> 
  as.data.frame() |> 
  rownames_to_column("Covariance Matrix") |> 
  show_table()
Covariance Matrix IQ Performance
IQ 5.3 5.1
Performance 5.1 5.2

How Estimation Works (More or Less)

We need models to test the hypothesis, while models need estimation method to provide estimates of effect size.

Most estimation routines do one of three things:

  1. Minimize Something: Typically found with names that have “least” in the title. Forms of least squares include “Generalized”, “Ordinary”, “Weighted”, “Diagonally Weighted”, “WLSMV”, and “Iteratively Reweighted.” Typically the estimator of last resort…
  2. Maximize Something: Typically found with names that have “maximum” in the title. Forms include “Maximum likelihood”, “ML”, “Residual Maximum Likelihood” (REML), “Robust ML”. Typically the gold standard of estimators
  3. Use Simulation to Sample from Something: more recent advances in simulation use resampling techniques. Names include “Bayesian Markov Chain Monte Carlo”, “Gibbs Sampling”, “Metropolis Hastings”, “Metropolis Algorithm”, and “Monte Carlo”. Used for complex models where ML is not available or for methods where prior values are needed.

Note

  • Loss function in Machine Learning belong to 1st type

  • lm use QR algorithm, which is one of methods for Ordinary Least Square (OLS) Estimation

  • Question: Maximum Likelihood Estimation belongs to 2nd type

Note: the estimates of regression coefficients provided by lm are using the OLS estimation method.

summary(lm(perf ~ iq, data = dataIQ))

Call:
lm(formula = perf ~ iq, data = dataIQ)

Residuals:
      1       2       3       4       5 
-0.4906  0.5472  0.6226 -0.2642 -0.4151 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) -97.2830    15.5176  -6.269  0.00819 **
iq            0.9623     0.1356   7.095  0.00576 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6244 on 3 degrees of freedom
Multiple R-squared:  0.9438,    Adjusted R-squared:  0.925 
F-statistic: 50.34 on 1 and 3 DF,  p-value: 0.005759

An Brief Introduction to Maximum Likelihood Estimation (MLE)

  • MLE has several good statistical properties that make it the mostly commonly used in multivariate statistics:

    1. Asymptotic Consistency: as the sample size increases, the estimator converges in probability to parameters’ true values

    2. Asymptotic Normality: as the sample size increases, the distribution of the estimator is normal (note: variance is given by “information” matrix)

    3. Efficiency: No other estimator will have a smaller standard error

Terminology

Estimator: the algorithm that can get the estimates of parameters

Asymptotic: the properties that will occur when the sample size is sufficiently large

ML-based estimators: other variants of maximum likelihood like robust maximum likelihood, full information maximum likelihood

Maximum Likelihood: Estimates Based on Statistical Distributions

  • Maximum likelihood estimates come from statistical distributions - assumed distributions of data

    • We will begin today with the univariate normal distribution but quickly move to other distributions
  • For a single random variable x, the univariate normal distribution is:

    \[ f(x) = \frac{1}{\sqrt{2\pi\color{tomato}{\sigma^2_x}}}\exp(-\frac{(x-\color{royalblue}{\mu_x})^2}{2\color{tomato}{\sigma^2_x}}) \]

    • Provides the height of the curve for a value of \(x\), \(\mu_x\), and \(\sigma^2\)
  • Last week we pretended we knew \(\mu_x\) and \(\sigma_x^2\)

    • Today we will only know values of \(x\) and “guess” the parameters \(\mu_x\) and \(\sigma^2_x\)

Key Idea: Constructing a Likelihood Function

  1. A likelihood function provides a value of a likelihood for a set of statistical parameters given observed data (denoted as \(L(\mu_x, \sigma^2_x|x)\))

  2. If we observe the data (known) but do not know the mean and/or variance (unkown), then we call this the sample likelihood function

    • The process of finding the values that maximizing the sample likelihood function is named Maximum Likelihood function
  3. Likelihood functions is a probability density function (PDFs) of random variables that input observed data values

    • Density functions are provided for each observation individually
    • Each observation has its likelihood function.
    • Mark this: We can get the “whole” likelihood of observed data by multiple all case-level likelihood.
  1. Rather than providing the height of the curve of any value of x, it provides the likelihood for any possible values of parameters (i.e., \(\mu_x\) and/or \(\sigma^2_x\))

    • Goal of maximizing the sample likelihood function is to find optimal values of parameters
  2. Assumption:

    • All observed random variables follow certain probability density functions (i.e., x follows normal dis. and y follows Bernoulli dist.).
    • The sample likelihood can be thought of as a joint distribution of all the observations, simultaneously
    • In univariate statistics, observations are considered independent, so the joint likelihood for the sample is constructed through a product
  3. To demonstrate, let’s consider the likelihood function for one observation that is used to estimate the mean and the variance of x

Example: One-Observation Likelihood function

  • Let’s assume:

    • We just observed the first value of IQ (x = 112)

    • Our initial guess is “the IQ comes from a normal distribution \(x \sim N(100, 5)\)

    • Our goal is to test whether this guess needed to be adjusted

  • Step 1: For this one observation, the likelihood function takes its assumed distribution and uses its PDF:

    \[ L(x_1=112|\mu_x = 100, \sigma^2_x = 5) = f(x_1, \mu_x, \sigma^2_x)= \frac{1}{\sqrt{2\pi*\color{tomato}{5}}}\exp(-\frac{(112-\color{royalblue}{100})^2}{2*\color{tomato}{5}}) \]

    • If you want to know the value of \(L(x_1)\), you can easily use R function dnorm(112, 100, sqrt(5)) , which gives your \(9.94\times 10^{-8}\).

    • Because it is too small, typical we use log transformation of likelihood, which is called log likelihood. The value is -16.12

More observations: Multiplication of Likelihood

  • \(L(x_1 = 112|\mu_x = 100, \sigma^2_x = 5)\) can be interpreted as the Likelihood of first observation value being 112 given the underlying distribution is a normal distribution with mean as 100 and variance as 5.

  • Step 2: Then, for following observations, we can calculate their likelihood just similar to the first observation

    • Now, we have \(L(x_2)\), \(L(x_3)\), …, \(L(x_N)\)
  • Step 3: Now, we multiple them together \(L(X|\mu_x=100, \sigma^2_x=5) =L(x_1)L(x_2)L(x_3)\cdots L(x_N)\), which represents the Likelihood of observed data existing given the underlying distribution is a normal distribution with mean as 100 and variance as 5.

  • Step 4: Again, we log transformed the likelihood to make the scale more easy to read: \(LL(X|\mu_x=100, \sigma^2_x=5)\)

  • Step 5: Finally, we change the value of parameters \(\mu_x\) and \(\sigma^2_x\) and calculate their LL (i.e., \(LL(X|\mu_x=101, \sigma^2_x=6)\))

    • Eventually, we can get a 3-D density plot with x-axis and y-axis are varied values of \(\mu_x\) and \(\sigma^2_x\); z-axis is their corresponding log-likelihood

    • The set of \(\mu_x\) and \(\sigma^2_x\) which has the highest log-likelihood are the estimates of MLE

Visualization of MLE

The values of \(\mu_x\) and \(\sigma_x^2\) that maximize \(L(\mu_x, \sigma^2_x|x)\) is \(\hat\mu_x = 114.4\) and \(\hat{\sigma^2_x} = 4.2\). We also know the mean of IQ is 114.2 and the variance of IQ is 5.3. Why the mean is same to estimate mu but variance is different? Answer: MLE is a biased estimator

Code
library(plotly)
 cal_LL <- function(x, mu, variance) {
      sum(log(sapply(x, \(x) dnorm(x, mean = mu, sd = sqrt(variance)))))
    }
    
x <- seq(80, 150, .1)
y <- seq(3, 8, .1)
x_y <- expand.grid(x, y)
colnames(x_y) = c("mean", "variance")
dat2 <- x_y
dat2 <- dat2 |> 
    rowwise() |> 
    mutate(LL = cal_LL(x = dat$IQ, mu = mean, variance = variance)) |> 
    ungroup() |> 
    mutate(highlight = ifelse(LL == max(LL), TRUE, FALSE))
plot_ly(data = dat2, x =~mean, y = ~variance, z = ~LL, 
        colors = c("royalblue", "tomato"), 
        color = ~highlight, stroke = "white", alpha_stroke = .5,
        alpha = .9, type = "scatter3d",
        width = 1600, height = 800)