Lecture 08: Path Analysis - Introduction

Multivariate Correlation Model

Jihong Zhang*, Ph.D.

Educational Statistics and Research Methods (ESRM) Program*

University of Arkansas

2024-10-09

Outline

Today's Class
  • Multivariate linear models: An introduction
  • How to form multivariate models in lavaan
  • What parameters mean
  • How they relate to the multivariate normal distribution

Multivariate Models: An Introduction

Multivariate Linear Models

  • The next set of lectures is provided to give an overview of multivariate linear models
    • Models for more than one dependent/outcome variables
  • Our focus today will be on models where the DV is plausibly continuous
    • so we’ll use error terms that are multivariate-normally distributed
    • Not a necessity – generalized multivariate models are possible

Classical vs. Contemporary Methods

  • In “classical” multivariate textbooks and classes, multivariate linear models fall under the names of Multivariate ANOVA (MANOVA) and multivariate regression

  • These methods rely upon least squares estimation, which:

    1. Inadequate with missing data
    2. Offers very limited methods of setting covariance matrix structures
    3. Does not allow for different sets of predictor variables for each outcome
    4. Does not give much information about model fit
    5. Does not provide adequate model comparison procedures
  • The classical methods have been subsumed into the modern (likelihood- or Bayes-based) multivariate methods

    • Subsume: include or absorb (something) in something else
    • Meaning: modern methods do what classical methods do (and more)

Contemporary Methods for Estimating Multivariate Linear Models

  • We will discuss three large classes of multivariate linear modeling methods:

    • Path analysis models (typically through structural equation modeling and path analysis software)
    • Linear mixed models (typically through linear models software)
    • Psychological network models (typically through psychological network software)
    • Bayesian networks (frequently not mentioned in the social sciences but subsume all we are doing)
  • The theory behind each is identical—the main difference is software

  • Some software does a lot (Mplus software is likely the most complete), but none (as of 2024) does it all

  • We will start with path analysis (via the lavaan package) as the modeling method is more direct, and then move to linear mixed-models software (via the nlme and lme4 packages)

  • Bayesian networks will be discussed in the Bayes section of the course and will use entirely different software

The “Curse” of Dimensionality: Shared Across Models

  • Having lots of parameters creates a number of problems

    • Estimation issues for small sample sizes
    • Power to detect effects
    • Model fit issues for large numbers of outcomes
  • Curse of dimensionality: for multivariate normal data, there is a quadratic increase in the number of parameters as the number of outcomes increases linearly

  • To be used as an analysis model, however, a model-implied covariance structure must “fit” as well as the saturated/unstructured covariance matrix

    \[ S \rightarrow \Sigma \]

Biggest Difference From Univariate Models: Model Fit

  1. In univariate linear models, the “model for the variance” wasn’t much of a model; there was one variance term possible (\(\sigma^2_e\)) and one term estimated
    • A saturated model: all variances/covariances are freely estimated without any constraints—model fit was always perfect
    • For example, a correlation matrix of 3 variables can be estimated upto 3 parameters.
  2. In multivariate linear models, because of the number of variances/covariances, oftentimes models are not saturated for the variances
    • Therefore, model fit becomes an issue
  3. Any non-saturated model for the variances must be shown to fit the data before being used for interpretation
    • “Fit the data” has differing standards depending on the software used.

Today’s Example Data

  • Data are simulated based on the results reported in:

  • Sample of 350 undergraduates (229 women, 121 men)

    • In simulation, 10% of variables were missing (using missing completely at random mechanism).
  • Make sure you install lavaan package

install.packages("lavaan")

Dictionary

  1. Prior Experience at High School Level (HSL)
  2. Prior Experience at College Level (CC)
  3. Perceived Usefulness of Mathematics (USE)
  4. Math Self-Concept (MSC)
  5. Math Anxiety (MAS)
  6. Math Self-Efficacy (MSE)
  7. Math Performance (PERF)
  8. Female (sex variable: 0 = male; 1 = female)
Code
library(ESRM64503)
library(kableExtra)
library(tidyverse)
library(DescTools) # Desc() allows you to quick screen data
options(digits=3)
head(dataMath)
  id hsl cc use msc mas mse perf female
1  1  NA  9  44  55  39  NA   14      1
2  2   3  2  77  70  42  71   12      0
3  3  NA 12  64  52  31  NA   NA      1
4  4   6 20  71  65  39  84   19      0
5  5   2 15  48  19   2  60   12      0
6  6  NA 15  61  62  42  87   18      0

Desc() in package DescTools allows you to quick screen data

dim(dataMath)
Desc(dataMath[,2:8], plotit = FALSE, conf.level = FALSE)
[1] 350   9
────────────────────────────────────────────────────────────────────────────── 
Describe dataMath[, 2:8] (data.frame):

data frame: 350 obs. of  7 variables
        145 complete cases (41.4%)

  Nr  Class  ColName  NAs         Levels
  1   int    hsl      36 (10.3%)        
  2   int    cc       37 (10.6%)        
  3   int    use      24 (6.9%)         
  4   int    msc      39 (11.1%)        
  5   int    mas      46 (13.1%)        
  6   int    mse      34 (9.7%)         
  7   int    perf     60 (17.1%)        


────────────────────────────────────────────────────────────────────────────── 
1 - hsl (integer)

  length      n    NAs  unique    0s   mean  meanCI'
     350    314     36       8     0   4.92    4.92
          89.7%  10.3%          0.0%           4.92
                                                   
     .05    .10    .25  median   .75    .90     .95
    3.00   3.00   4.00    5.00  6.00   6.00    7.00
                                                   
   range     sd  vcoef     mad   IQR   skew    kurt
    7.00   1.32   0.27    1.48  2.00  -0.26   -0.41
                                                   

   value  freq   perc  cumfreq  cumperc
1      1     1   0.3%        1     0.3%
2      2    11   3.5%       12     3.8%
3      3    34  10.8%       46    14.6%
4      4    71  22.6%      117    37.3%
5      5    79  25.2%      196    62.4%
6      6    88  28.0%      284    90.4%
7      7    27   8.6%      311    99.0%
8      8     3   1.0%      314   100.0%

' 0%-CI (classic)

────────────────────────────────────────────────────────────────────────────── 
2 - cc (integer)

  length      n    NAs  unique     0s   mean  meanCI'
     350    313     37      28     21  10.31   10.31
          89.4%  10.6%           6.0%          10.31
                                                    
     .05    .10    .25  median    .75    .90     .95
    0.00   2.00   6.00   10.00  14.00  19.00   20.00
                                                    
   range     sd  vcoef     mad    IQR   skew    kurt
   27.00   5.89   0.57    5.93   8.00   0.17   -0.43
                                                    
lowest : 0 (21), 1 (5), 2 (9), 3 (9), 4 (12)
highest: 23, 24, 25, 26, 27

heap(?): remarkable frequency (8.6%) for the mode(s) (= 10, 13)

' 0%-CI (classic)

────────────────────────────────────────────────────────────────────────────── 
3 - use (integer)

  length      n    NAs  unique     0s   mean  meanCI'
     350    326     24      71      1  52.50   52.50
          93.1%   6.9%           0.3%          52.50
                                                    
     .05    .10    .25  median    .75    .90     .95
   25.50  31.00  42.25   52.00  64.00  71.50   77.75
                                                    
   range     sd  vcoef     mad    IQR   skew    kurt
  100.00  15.81   0.30   16.31  21.75  -0.11   -0.08
                                                    
lowest : 0, 13, 15, 16, 19 (2)
highest: 84 (2), 86, 92, 95, 100

' 0%-CI (classic)

────────────────────────────────────────────────────────────────────────────── 
4 - msc (integer)

  length      n    NAs  unique     0s   mean  meanCI'
     350    311     39      76      0  49.79   49.79
          88.9%  11.1%           0.0%          49.79
                                                    
     .05    .10    .25  median    .75    .90     .95
   24.00  29.00  38.00   49.00  61.00  72.00   80.50
                                                    
   range     sd  vcoef     mad    IQR   skew    kurt
   98.00  16.96   0.34   16.31  23.00   0.23   -0.09
                                                    
lowest : 5, 9, 12, 15, 16 (2)
highest: 87 (3), 89, 91, 94, 103

' 0%-CI (classic)

────────────────────────────────────────────────────────────────────────────── 
5 - mas (integer)

  length      n    NAs  unique     0s   mean  meanCI'
     350    304     46      57      1  31.50   31.50
          86.9%  13.1%           0.3%          31.50
                                                    
     .05    .10    .25  median    .75    .90     .95
   13.00  17.00  24.75   32.00  39.00  45.70   50.00
                                                    
   range     sd  vcoef     mad    IQR   skew    kurt
   62.00  11.32   0.36   10.38  14.25  -0.13   -0.17
                                                    
lowest : 0, 2, 3 (2), 5, 6
highest: 54 (2), 55 (2), 56, 58, 62

' 0%-CI (classic)

────────────────────────────────────────────────────────────────────────────── 
6 - mse (integer)

  length      n    NAs  unique     0s   mean  meanCI'
     350    316     34      56      0  73.41   73.41
          90.3%   9.7%           0.0%          73.41
                                                    
     .05    .10    .25  median    .75    .90     .95
   54.00  58.00  65.00   73.00  81.00  88.50   92.25
                                                    
   range     sd  vcoef     mad    IQR   skew    kurt
   60.00  11.89   0.16   11.86  16.00   0.06   -0.26
                                                    
lowest : 45 (2), 46, 47, 49 (2), 50 (4)
highest: 98, 100, 101 (2), 103, 105 (2)

' 0%-CI (classic)

────────────────────────────────────────────────────────────────────────────── 
7 - perf (integer)

  length      n    NAs  unique     0s   mean  meanCI'
     350    290     60      19      0  13.97   13.97
          82.9%  17.1%           0.0%          13.97
                                                    
     .05    .10    .25  median    .75    .90     .95
    9.45  10.00  12.00   14.00  16.00  18.00   19.00
                                                    
   range     sd  vcoef     mad    IQR   skew    kurt
   18.00   2.96   0.21    2.97   4.00   0.16    0.14
                                                    
lowest : 5, 6, 7 (2), 8 (3), 9 (8)
highest: 19 (10), 20 (4), 21 (3), 22 (2), 23

heap(?): remarkable frequency (13.8%) for the mode(s) (= 13)

' 0%-CI (classic)

Using MVN Likelihoods in lavaan

The lavaan package is developed by Dr. Yves Rosseel to provide useRs, researchers and teachers a free open-source, but commercial-quality package for latent variable modeling. You can use lavaan to estimate a large variety of multivariate statistical models, including path analysis, confirmatory factor analysis, structural equation modeling and growth curve models.

  • lavaan’s default model is a linear (mixed) model that uses ML with the multivariate normal distribution

  • ML is sometimes called a full-information method (FIML)

    • Full information is the term used when each observation is used in a likelihood function
  • The contrast is limited information (not all observations are used; typically summary statistics are used)

install.packages('lavaan')
library(lavaan)

Revisiting Univariate Linear Regression

  • We will begin our discussion with perhaps the simplest model we will see: a univariate empty model
    • We will use the PERF variable from our example data: Performance on a mathematics assessment
  • The empty model for PERF is:
    • \(e_{i, PERF} \sim N(0, \sigma^2_{e,PERF})\)
    • Two parameters are estimated: \(\beta_{0,PERF}\) and \(\sigma^2_{e,PERF}\)

\[ \text{PERF}_i = \beta_{0,\color{tomato}{PERF}} + e_{i, \color{tomato}{PERF}} \]

  • Here, the additional subscript is added to denote these terms are part of the model for the PERF variable
    • We will need these when we get to multivariate models and path analysis

lavaan Syntax: General

  • Estimating a multivariate model using lavaan contains three steps:
    1. Model building: Build a string object including syntax for models (e.g., model01.syntax)
    2. Model fitting: Estimate the model with observed data and save it to a fit object (e.g., model01.fit <- cfa(model01.syntax, data))
    3. Model summary: Extract the results from the fit object using summary()
  • The lavaan package works by taking the typical R model syntax (as from lm()) and putting it into a quoted character variable
    • lavaan model syntax also includes other commands used to access other parts of the model (really, parts of the MVN distribution)
  • Here are rules of lavaan model syntax in Model building step:
    • ~~ indicates variance or covariance between variables on either side (perf ~~ perf estimates variance)
    • ~1 indicates intercept for one variable (perf~1)
    • ~ indicates LHS regresses on RHS (perf~use)
## Syntax for model01
1model01.syntax <- "
## Variances:
perf ~~ perf

## Means:
perf ~ 1
"
## Estimation for model01
2model01.fit <- cfa(model01.syntax, data=dataMath, mimic="MPLUS", fixed.x = TRUE, estimator = "MLR")

## Print output
summary(model01.fit)
1
model syntax is a string object containing our model specification
2
We use the cfa() function to run the model based on the syntax. Use ?lavOptions to find the interpretation of arguments.
lavaan 0.6-19 ended normally after 8 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         2

                                                  Used       Total
  Number of observations                           290         350
  Number of missing patterns                         1            

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
    perf             13.966    0.174   80.397    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    perf              8.751    0.756   11.581    0.000

Model Parameter Estimates and Assumptions

Summary of lavaan's cfa()
Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
    perf             13.966    0.174   80.397    0.000
Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    perf              8.751    0.756   11.581    0.000
  • Parameter Estimates:
    • \(\beta_{0,PERF} = 13.966\)
    • \(\sigma^2_{e,PERF} = 8.751\)
  • Using the model estimates, we know that:

\[ \text{PERF}_i \sim N(13.966, 8.751) \]

  • Based on our univariate empty model, Perf follows a normal distribution with mean as 13.966, variance as 8.751.
mean(dataMath$perf, na.rm = TRUE)
var(dataMath$perf, na.rm = TRUE)
[1] 14
[1] 8.78

Multivariate Correlation Models

Model 2: Adding One More Variable to the Model

  • Variables: We already know how to use lavaan to estimate a univariate model; let’s move on to model two variables as outcomes simultaneously:

    • Mathematics performance (perf)
    • Perceived Usefulness of Mathematics (use)
  • Assumption: We will assume both to be continuous variables (conditionally MVN)

  • Research Question: Are students’ performance and their perceived usefulness of mathematics significantly related?

  • Initially, we will only look at an empty model with these two variables:

    • Empty models are baseline models
    • We will use these to show how such models look based on the characteristics of the multivariate normal distribution
    • We will also show the bigger picture when modeling multivariate data: how we must be sure to model the covariance matrix correctly

Multivariate Empty Model: The Notation

  • The multivariate model for perf and use is given by two regression models, which are estimated simultaneously.

    • The model for the means:

\[ \text{PERF}_i = \beta_{0,\text{PERF}}+e_{i,\text{PERF}} \]

\[ \text{USE}_i = \beta_{0,\text{USE}}+e_{i,\text{USE}} \]

  • As there are two variables, the error terms have a joint distribution that will be multivariate normal

    • The model for the variances:

\[ \begin{bmatrix}e_{i,\text{PERF}}\\e_{i,\text{USE}}\end{bmatrix} \sim \mathbf{MVN} (\mathbf{0}=\begin{bmatrix}0\\0\end{bmatrix}, \mathbf{R}=\begin{bmatrix}\sigma^2_{e,\text{PERF}} & \sigma_{e,\text{PERF},\text{USE}}\\ \sigma_{e,\text{PERF},\text{USE}}&\sigma^2_{e,\text{USE}}\end{bmatrix}) \]

  • Each error term (diagonal elements in \(\mathbf{R}\)) has its own variance, but now there is a covariance between error terms
    • We will soon see that the overall \(\mathbf{R}\) matrix structure can be modified

Data Model

  • Before showing the syntax and the results, we must first describe how the multivariate empty model implies how our data should look

  • Multivariate model with matrices: \[ \boxed{\begin{bmatrix}\text{PERF}_i\\\text{USE}_i\end{bmatrix} =\begin{bmatrix}1&0\\0&1\end{bmatrix}\begin{bmatrix}\beta_{0,\text{PERF}}\\\beta_{0,\text{USE}}\end{bmatrix} + \begin{bmatrix}e_{i,\text{PERF}}\\e_{i,\text{USE}}\end{bmatrix}} \]

\[ \boxed{\mathbf{Y_i = X_iB + e_i}} \]

  • Using expected values and linear combination rules, we can show that:

\[ \boxed{\begin{bmatrix}\text{PERF}_i\\\text{USE}_i\end{bmatrix} \sim \mathbf{MVN}_2 (\boldsymbol{\mu}_i=\begin{bmatrix}\beta_{0,\text{PERF}}\\\beta_{0,\text{USE}}\end{bmatrix}, \mathbf{V}=\begin{bmatrix}\sigma^2_{e,\text{PERF}} & \sigma_{e,\text{PERF},\text{USE}}\\ \sigma_{e,\text{PERF},\text{USE}}&\sigma^2_{e,\text{USE}}\end{bmatrix})} \]

\[ \boxed{\mathbf{Y}_i \sim MVN_2(\mathbf{X}_i\mathbf{B},\mathbf{V}_i)} \]

Model 2: Multivariate Correlation Model with 2 variables

Aim: To untangle the simple correlation between perf and use. Note that no predictor is used to explain the variances of the two variables.

Parameters: The number of parameters is 5, including the two variables’ means and three variance-and-covariance components.

model02.syntax = "
# Residual Variances:
1perf ~~ perf
2use ~~ use

# Residual Covariance:
3perf ~~ use

# Means:
4perf ~ 1
5use ~ 1
"
1
\(\sigma^2_{e, \text{PERF}}\)
2
\(\sigma^2_{e,\text{USE}}\)
3
\(\sigma^2_{e, \text{PERF}, \text{USE}}\)
4
\(\beta_{0, \text{PERF}}\)
5
\(\beta_{0, \text{USE}}\)

Properties of this model:

  • This model is said to be saturated: all possible parameters are estimated. It is also called an unstructured covariance matrix.

  • No other structure for the covariance matrix can fit better (only as well as)

  • The model for the means are two simple regression models without any predictors

Multivariate Correlation Model Results

Output of lavaan

## Estimation for model01
model02.fit <- cfa(model02.syntax, data=dataMath, mimic="MPLUS", fixed.x = TRUE, estimator = "MLR") 

## Print output
summary(model02.fit)
lavaan 0.6-19 ended normally after 29 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         5

                                                  Used       Total
  Number of observations                           348         350
  Number of missing patterns                         3            

Model Test User Model:
                                              Standard      Scaled
  Test Statistic                                 0.000       0.000
  Degrees of freedom                                 0           0

Parameter Estimates:

  Standard errors                             Sandwich
  Information bread                           Observed
  Observed information based on                Hessian

Covariances:
                   Estimate  Std.Err  z-value  P(>|z|)
  perf ~~                                             
    use               6.847    2.850    2.403    0.016

Intercepts:
                   Estimate  Std.Err  z-value  P(>|z|)
    perf             13.959    0.174   80.442    0.000
    use              52.440    0.872   60.140    0.000

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)
    perf              8.742    0.754   11.596    0.000
    use             249.245   19.212   12.973    0.000

Question: Why is the sample size different from Model 1?

Model 1:


                                                  Used       Total
  Number of observations                           290         350
  Number of missing patterns                         1      

Model 2:

                                                 Used       Total
  Number of observations                           348         350
  Number of missing patterns                         3           

By default, lavaan::cfa() uses listwise deletion. In Model 1, cases with non-NA values of perf remain. In Model 2, cases with both non-NA in perf and non-NA in use remain.

table(is.na(dataMath$perf))

FALSE  TRUE 
  290    60 
table(is.na(dataMath$perf) & is.na(dataMath$use))

FALSE  TRUE 
  348     2 

Result: Correlation Between PERF and USE

Code
# Covariance Estimates
cat("Covariance Estimates: \n")
parameterestimates(model02.fit)[1:3,]

# Standardized Parameters
cat("Standardized Parameters (correlation): \n")
standardizedsolution(model02.fit)[1:3,]
Covariance Estimates: 
   lhs op  rhs    est     se    z pvalue ci.lower ci.upper
1 perf ~~ perf   8.74  0.754 11.6  0.000     7.26     10.2
2  use ~~  use 249.25 19.212 13.0  0.000   211.59    286.9
3 perf ~~  use   6.85  2.850  2.4  0.016     1.26     12.4
Standardized Parameters (correlation): 
   lhs op  rhs est.std   se    z pvalue ci.lower ci.upper
1 perf ~~ perf   1.000 0.00   NA     NA    1.000    1.000
2  use ~~  use   1.000 0.00   NA     NA    1.000    1.000
3 perf ~~  use   0.147 0.06 2.44  0.015    0.029    0.265

We can verify that: Correlation matrix can be calculated with S (covariance matrix) and variance components.

\[ R = D^{-1}SD \]

D = matrix(c(sqrt(8.742), 0, 0,  sqrt(249.245)), nrow = 2)
S = matrix(c(8.742, 6.847, 6.847,  249.245), nrow = 2)
solve(D) %*% S %*% solve(D)
      [,1]  [,2]
[1,] 1.000 0.147
[2,] 0.147 1.000
  • Result of z statistic: Students’ math performance is significantly correlated with their perceived usefulness of math (r = .147, z = 2.44, p = .015), but the correlation is weak.

Result: Comparing Model with Data

library(mvtnorm)
x = expand.grid(
  perf = seq(0, 25, .1),
  use = seq(0, 120, .1)
)
sim_dat <- cbind(
  x,
  density = dmvnorm(x, mean = c(13.959, 52.440), sigma = S)
)

  
mod2_density <- ggplot() +
  geom_contour_filled(aes(x = perf, y = use, z = density), data = sim_dat) +
  geom_point(aes(x = perf, y = use), data = dataMath, colour = "white", alpha = .5) +
  labs( x= "PERF", y="USE") +
  theme_classic() +
  theme(text = element_text(size = 20)) 
mod2_density

  • We can use our estimated parameters to simulate the model-implied density plot for perf and use, which can be compared to the observed data for perf and use.

Mutivariate normal density function

\[ \begin{bmatrix}\text{PERF}_i\\\text{USE}_i\end{bmatrix} \sim \mathbf{N}_2 (\boldsymbol{\mu}_i=\begin{bmatrix}13.959\\52.440\end{bmatrix}, \mathbf{V}_i =\mathbf{R}=\begin{bmatrix}8.742 & 6.847\\ 6.847 & 249.245 \end{bmatrix}) \]

Code
library(plotly)
density_surface <- sim_dat |> 
  pivot_wider(names_from = use, values_from = density) |> 
  select(-perf) |> 
  as.matrix()
plot_ly(z = density_surface, x = seq(0, 120, .1), y = seq(0, 25, .1), colors = "PuRd", color = "green",
        width = 1600, height = 800) |> add_surface() 

Multivariate Regression Model Estimated Density

A Different Model for the data

  • To demonstrate how models may vary in terms of model fit (and to set up a discussion of model fit and model comparisons), we will estimate a model where we set the covariance between PERF and USE to zero.
    • Zero covariance implies zero correlation – which is unlikely to be true given our previous analysis
  • You likely would not use this model in a real data analysis
    • If anything, you may start with a zero covariance and then estimate one
  • But this will help to introduce some concepts needed to assess the quality of the multivariate model

Model 3: Multivariate Empty Model with Zero Covariance

One way to double-check whether the “correlation” is important is to deliberately fix the correlation to 0 and then compare this model to the previous model to see if it gets much worse.

model03.syntax = "
# Variances:
perf ~~ perf 
use ~~ use   

# Covariance:
1perf ~~ 0*use

# Means:
perf ~ 1  
use ~ 1   
"
1
We constrain the residual covariance between perf and use, \(\sigma_{e, \text{PERF}, \text{USE}} = 0\), using 0*
## Estimation for model01
model03.fit <- cfa(model03.syntax, data=dataMath, mimic="MPLUS", fixed.x = TRUE, estimator = "MLR") 

## Print output
parameterestimates(model03.fit) |> show_table()
lhs op rhs est se z pvalue ci.lower ci.upper
perf ~~ perf 8.75 0.756 11.6 0 7.27 10.2
use ~~ use 249.20 19.212 13.0 0 211.55 286.9
perf ~~ use 0.00 0.000 NA NA 0.00 0.0
perf ~1 13.97 0.174 80.4 0 13.62 14.3
use ~1 52.50 0.874 60.0 0 50.79 54.2

Model Assumption

  • The zero covariance now leads to the following assumptions about the data:

    \[ \begin{bmatrix}\text{PERF}_i\\\text{USE}_i\end{bmatrix} \sim \mathbf{N}_2 (\boldsymbol{\mu}_i=\begin{bmatrix}13.959\\52.440\end{bmatrix}, \mathbf{V}_i =\mathbf{R}=\begin{bmatrix} 8.750 & \color{tomato}{0} \\ \color{tomato}{0} & 249.200 \end{bmatrix}) \]

  • Because these are MVN, we are assuming PERF is independent of USE (they have zero correlation/covariance)

Question

Is our new model (zero covariance) better- or worse-fitting compared to our previous model (freely estimated covariance)?

Answer

I think Model 3 is worse because of fewer parameters and stronger assumptions. We don’t know yet the degree of “worse” is significant or not.

anova(model02.fit, model03.fit)

Scaled Chi-Squared Difference Test (method = "satorra.bentler.2001")

lavaan->lavTestLRT():  
   lavaan NOTE: The "Chisq" column contains standard test statistics, not the 
   robust test that should be reported per model. A robust difference test is 
   a function of two standard (not robust) statistics.
            Df  AIC  BIC Chisq Chisq diff Df diff Pr(>Chisq)  
model02.fit  0 4180 4199  0.00                                
model03.fit  1 4184 4200  6.06       5.57       1      0.018 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Results

parameterestimates(model02.fit) |> show_table()
lhs op rhs est se z pvalue ci.lower ci.upper
perf ~~ perf 8.74 0.754 11.6 0.000 7.26 10.2
use ~~ use 249.25 19.212 13.0 0.000 211.59 286.9
perf ~~ use 6.85 2.850 2.4 0.016 1.26 12.4
perf ~1 13.96 0.174 80.4 0.000 13.62 14.3
use ~1 52.44 0.872 60.1 0.000 50.73 54.1
parameterestimates(model03.fit) |> show_table()
lhs op rhs est se z pvalue ci.lower ci.upper
perf ~~ perf 8.75 0.756 11.6 0 7.27 10.2
use ~~ use 249.20 19.212 13.0 0 211.55 286.9
perf ~~ use 0.00 0.000 NA NA 0.00 0.0
perf ~1 13.97 0.174 80.4 0 13.62 14.3
use ~1 52.50 0.874 60.0 0 50.79 54.2

Examining Model/Data Fit

Code
sim_dat_2 <- cbind(
  x,
  density = dmvnorm(x, mean = c(13.959, 52.440), 
                    sigma = matrix(c(8.750535, 0, 0, 249.200921), nrow = 2))
)

  
mod3_density <- ggplot() +
  geom_contour_filled(aes(x = perf, y = use, z = density), data = sim_dat_2) +
  geom_point(aes(x = perf, y = use), data = dataMath, colour = "white", alpha = .5) +
  labs( x= "PERF", y="USE", title = "Model 3's Density") +
  theme_classic() +
  theme(text = element_text(size = 20)) 

mod2_density

Code
mod3_density

Multivariate correlation model with three variables

  • Aim: the mutual relationships between three variables are of interests.

    • Math Performance (PERF)

    • Math Self-Concept (MSC)

    • Math Self-Efficacy (MSE)

head(dataMath[, c("perf", "msc", "mse")])
  perf msc mse
1   14  55  NA
2   12  70  71
3   NA  52  NA
4   19  65  84
5   12  19  60
6   18  62  87

Diagram

model04.syntax = "
# Residual Variances:
perf ~~ perf 
msc ~~ msc   
mse ~~ mse   

# Residual Covariance:
perf ~~ msc
perf ~~ mse
msc ~~ mse

# Means:
perf ~ 1  
msc ~ 1   
mse ~ 1
"
model04.fit <- cfa(model04.syntax, data=dataMath, mimic="MPLUS", fixed.x = TRUE, estimator = "MLR")

Interpret the results

What are estimated correlations, standard errors and p.values for the mutual correlations among three variables.

standardizedsolution(model04.fit)
   lhs op  rhs est.std    se    z pvalue ci.lower ci.upper
1 perf ~~ perf   1.000 0.000   NA     NA    1.000    1.000
2  msc ~~  msc   1.000 0.000   NA     NA    1.000    1.000
3  mse ~~  mse   1.000 0.000   NA     NA    1.000    1.000
4 perf ~~  msc   0.610 0.035 17.3      0    0.541    0.679
5 perf ~~  mse   0.734 0.026 28.3      0    0.683    0.784
6  msc ~~  mse   0.653 0.032 20.6      0    0.591    0.715
7 perf ~1        4.670 0.202 23.1      0    4.274    5.067
8  msc ~1        2.941 0.119 24.7      0    2.708    3.174
9  mse ~1        6.218 0.234 26.6      0    5.760    6.676

Exercise

Assume that researchers want to investigate whether there is correlation between self-efficacy (MSE) and self-concept (MSC). If the correlation is weak enough, can we assume that these two has no relationship.

  • Which model you prefer: the new constrained model or the previous one?
  • What is estimated relationships between MSC and Perf in the new model?
  • What is estimated relationships between MSE and Perf in the new model?
model04b.syntax = "
# Residual Variances:
perf ~~ perf 
msc ~~ msc   
mse ~~ mse   

# Residual Covariance:
____ ~~ ____
____ ~~ ____
____ ~~ ____

# Means:
perf ~ 1  
msc ~ 1   
mse ~ 1
"
model04b.fit <- cfa(model04b.syntax, data=dataMath, mimic="MPLUS", fixed.x = TRUE, estimator = "MLR")
model04b.syntax = "
# Residual Variances:
perf ~~ perf 
msc ~~ msc   
mse ~~ mse   

# Residual Covariance:
perf ~~ msc
perf ~~ mse
msc ~~ 0*mse

# Means:
perf ~ 1  
msc ~ 1   
mse ~ 1
"
model04b.fit <- cfa(model04b.syntax, data=dataMath, mimic="MPLUS", fixed.x = TRUE, estimator = "MLR")

standardizedsolution(model04b.fit)
   lhs op  rhs est.std    se     z pvalue ci.lower ci.upper
1 perf ~~ perf   1.000 0.000    NA     NA    1.000    1.000
2  msc ~~  msc   1.000 0.000    NA     NA    1.000    1.000
3  mse ~~  mse   1.000 0.000    NA     NA    1.000    1.000
4 perf ~~  msc   0.241 0.062  3.88      0    0.119    0.363
5 perf ~~  mse   0.656 0.044 14.81      0    0.569    0.743
6  msc ~~  mse   0.000 0.000    NA     NA    0.000    0.000
7 perf ~1        5.045 0.216 23.39      0    4.623    5.468
8  msc ~1        2.951 0.120 24.63      0    2.716    3.185
9  mse ~1        6.204 0.233 26.62      0    5.747    6.661
anova(model04.fit, model04b.fit)

Scaled Chi-Squared Difference Test (method = "satorra.bentler.2001")

lavaan->lavTestLRT():  
   lavaan NOTE: The "Chisq" column contains standard test statistics, not the 
   robust test that should be reported per model. A robust difference test is 
   a function of two standard (not robust) statistics.
             Df  AIC  BIC Chisq Chisq diff Df diff Pr(>Chisq)    
model04.fit   0 6190 6225     0                                  
model04b.fit  1 6353 6384   165        240       1     <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • We have two evidences to prefer the previous model:
    1. AIC and BIC for model 04b is larger than model 04.
    2. According to Likelihood ratio test, Model 04 is significantly better than model 04b, which means the correlation between MSE and MSC is likely to be nonzero. We prefer the model with the correlation larger than 0.

Wrapping Up

  1. This lecture was an introduction to the estimation of multivariate linear models for multivariate outcomes using the path analysis/SEM package lavaan

  2. We saw that the model for continuous data uses the multivariate normal distribution in its likelihood function

Next Class

  1. Does each model fit the data well (absolute model fit)?
  2. If not, how can we improve model fit?
  3. Which model fits better (relative model fit)?
  4. Answers are given in the following lectures…