ESRM 64503 - PCA and EFA
Educational Statistics and Research Methods (ESRM) Program*
University of Arkansas
2024-10-09
Course | Course Name | Survey Open | Survey Close |
---|---|---|---|
ESRM 64503 - 001 | APPLIED MULTIVARIATE STATS | Nov 21 | Dec 6 |
Methods for exploratory analysis
\[ \mathbf{\Lambda\Lambda^T = \Lambda^T\Lambda = I} \]
For example, \(\mathbf \Lambda\) is a orthogonal matrix
Orthogonal matrices are characterized by two properties
The matrix above is also called orthonormal
Orthonormal matrices are used in principal components and exploratory factor analysis
\[ \mathbf{\Sigma e} = \lambda \mathbf{e} \]
eigen()
function to get eigenvalues and eigenvectors of a square matrix (e.g., a correlation matrix).library(ESRM64503)
## Correlation matrix of SAT-Verbal and SAT-Math
sat_corrmat = cor(dataSAT[, c("SATV", "SATM")])
## eignvalues and eigenvectors of correlation matrix:
sat_eigen = eigen(x = sat_corrmat, symmetric = TRUE)
sat_eigen$values # eigenvalues
[1] 1.7752238 0.2247762
[,1] [,2]
[1,] 0.7071068 -0.7071068
[2,] 0.7071068 0.7071068
\[ \lambda_1 = 1.775; \lambda_2 = 0.224 \\ \]
\[ \mathbf{e}_1 = \begin{bmatrix}0.707 \\ 0.707\end{bmatrix}; \mathbf{e}_2 = \begin{bmatrix}-0.707 \\ 0.707\end{bmatrix} \]
\[ \mathbf{\Sigma =}\sum_{i=1}^{p} \lambda_i\mathbf{e}_i \mathbf{e}_i^T \]
where \(i\) is the index of row/column of square matrix
\[ \mathbf{R}_1 = \lambda_1 \mathbf{e}_i \mathbf{e}_i^T = 1.775 \begin{bmatrix}0.707 \\ 0.707\end{bmatrix} \begin{bmatrix}0.707 & 0.707\end{bmatrix} = \begin{bmatrix}0.890 & 0.890\\ 0.890 & 0.890\end{bmatrix} \]
\[ \begin{align} \mathbf{R}_2 &= \mathbf{R}_1 +\lambda_1 \mathbf{e}_i \mathbf{e}_i^T \\ &= \begin{bmatrix}0.890 & 0.890\\ 0.890 & 0.890\end{bmatrix} + 0.224 \begin{bmatrix}-0.707 \\ 0.707\end{bmatrix} \begin{bmatrix}-0.707 & 0.707\end{bmatrix} \\ &= \begin{bmatrix}1.000 & 0.780\\ 0.780 & 1.000 \end{bmatrix} \end{align} \]
# spectral decomposition
1corr_rank1 = sat_eigen$values[1] * tcrossprod(sat_eigen$vectors[,1])
corr_rank1
tcrossprod
: \(\mathbf{e}\mathbf{e}^T\)
[,1] [,2]
[1,] 0.8876119 0.8876119
[2,] 0.8876119 0.8876119
\[ tr(\mathbf\Sigma) = \sum_{i=1}^{p}\lambda_i \]
Note
The transformation of data: In Mathematics, an eigenvector corresponds to the real nonzero eigenvalues which point in the direction stretched by the transformation; eigenvalue is considered as a factor by which it is stretched.
\[ |\mathbf\Sigma| = \prod_{i=1}^p \lambda_i \]
Principal Components Analysis (PCA) is a method for re-expressing the covariance (or often correlation) between a set of variables
PCA has two objectives:
Data reduction - Moving from many original variables down to a few “components”
Interpretation - Determining which original variables contribute most to the new “components”
\[ \begin{array}{c} Z_{p 1}=\mathbf{e}_{1}^{T} \mathbf{Y}=e_{11} Y_{p 1}+e_{21} Y_{p 2}+\cdots+e_{V 1} Y_{p V} \\ Z_{p 2}=\mathbf{e}_{2}^{T} \mathbf{Y}=e_{12} Y_{p 1}+e_{22} Y_{p 2}+\cdots+e_{V 2} Y_{p V} \\ \vdots \\ Z_{p V}=\mathbf{e}_{V}^{T} \mathbf{Y}=e_{1 V} Y_{p 1}+e_{2 V} Y_{p 2}+\cdots+e_{V V} Y_{p V} \end{array} \]
The components (\(Z\)) are formed by the weights of the eigenvectors of the covariance or correlation matrix of the original data
Using the eigenvalue and eigenvectors means:
\[ \sum_{v=1}^{V} \operatorname{Var}\left(Z_{v}\right)=\operatorname{tr} \mathbf{\Sigma}=\sum_{v=1}^{V} \lambda_{v} \]
prcomp()
data01 <- dataSAT[, c("SATV", "SATM")]
# PCA of correlation matrix
sat_pca_corr = prcomp(x = data01, scale. = TRUE)
# show the results
sat_pca_corr
Standard deviations (1, .., p=2):
[1] 1.3323753 0.4741057
Rotation (n x k) = (2 x 2):
PC1 PC2
SATV -0.7071068 0.7071068
SATM -0.7071068 -0.7071068
Importance of components:
PC1 PC2
Standard deviation 1.3324 0.4741
Proportion of Variance 0.8876 0.1124
Cumulative Proportion 0.8876 1.0000
#create same analysis but with covariance matrix (for visual) scale.=FALSE (covariance matrix)
sat_pca_cov = prcomp(x = data01, scale. = FALSE)
#create augmented data matrix for plot
data01a = data01
data01a$type = "Raw"
data01b = data.frame(SATV = sat_pca_cov$x[,1], SATM = sat_pca_cov$x[,2], type="PC")
data01c = rbind(data01a, data01b)
plot(x = data01c$SATV, y = data01c$SATM, ylab = "SATM/PC2", xlab = "SATV/PC1", cex.main=1.5, frame.plot=FALSE, col=ifelse(data01c$type=="Raw", "red", "blue"))
legend(0, 400, pch=1, col=c("red", "blue"), c("Data", "PCs"), bty="o", box.col="darkgreen", cex=1.5)
data02 = read.csv(file="gambling_lecture12.csv",header=TRUE)
#listwise removal of missing data (common in PCA -- but still a problem)
data02a = data02[complete.cases(data02),]
dim(data02a)
[1] 1333 10
Item | Criterion | Question |
---|---|---|
GRI1 | 3 | I would like to cut back on my gambling. |
GRI3 | 6 | If I lost a lot of money gambling one day, I would be more likely to want to play again the following day. |
GRI5 | 2 | I find it necessary to gamble with larger amounts of money (than when I first gambled) for gambling to be exciting. |
GRI9 | 4 | I feel restless when I try to cut down or stop gambling. |
GRI10 | 1 | It bothers me when I have no money to gamble. |
GRI13 | 3 | I find it difficult to stop gambling. |
GRI14 | 2 | I am drawn more by the thrill of gambling than by the money I could win. |
GRI18 | 9 | My family, coworkers, or others who are close to me disapprove of my gambling. |
GRI21 | 1 | It is hard to get my mind off gambling. |
GRI23 | 5 | I gamble to improve my mood. |
#analysis of covariance matrix of gambling data items
gambling_pca_cov = prcomp(x = data02a, scale. = FALSE)
# gambling_pca_cov
summary(gambling_pca_cov)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0485 1.3229 1.0883 0.83608 0.75096 0.73365 0.68616
Proportion of Variance 0.4043 0.1686 0.1141 0.06736 0.05434 0.05186 0.04537
Cumulative Proportion 0.4043 0.5730 0.6871 0.75447 0.80881 0.86067 0.90604
PC8 PC9 PC10
Standard deviation 0.64836 0.58230 0.46438
Proportion of Variance 0.04051 0.03267 0.02078
Cumulative Proportion 0.94655 0.97922 1.00000
prop_var = t(summary(gambling_pca_cov)$importance[2:3,])
#creating a scree plot and a proportion of variance plot
par(mfrow = c(1,2))
plot(gambling_pca_cov, type="l", main = "Scree Plot of PCA Eigenvalues", lwd = 5)
matplot(prop_var, type="l", main = "Proportion of Variance Explained by Component", lwd = 5)
legend(x=5, y=.5, legend = c("Component Variance", "Cumulative Variance"), lty = 1:2, lwd=5, col=1:2)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
X1 -0.313 0.105 0.166 -0.844 0.365 -0.090 0.090 0.002 0.028 0.040
X3 -0.226 0.078 0.178 -0.134 -0.525 0.680 0.343 -0.192 -0.015 0.026
X5 -0.320 0.094 0.229 0.026 -0.492 -0.626 -0.090 -0.395 -0.044 0.187
X9 -0.243 0.087 0.160 0.031 -0.117 -0.062 -0.153 0.152 0.045 -0.917
X10 -0.280 0.087 0.235 0.085 -0.195 -0.026 -0.093 0.842 -0.027 0.307
X13 -0.329 0.156 0.186 0.220 0.326 0.295 -0.505 -0.225 -0.524 0.101
X14 -0.449 -0.862 -0.231 0.006 -0.004 0.026 -0.037 -0.001 -0.004 0.002
X18 -0.347 0.405 -0.834 -0.028 -0.100 -0.019 0.009 0.053 -0.074 0.002
X21 -0.272 0.137 0.062 0.198 0.179 0.155 -0.250 -0.143 0.842 0.128
X23 -0.326 0.098 0.139 0.416 0.386 -0.147 0.718 -0.025 -0.071 -0.033
Factor analysis works by hypothesizing that a set of latent factors helps to determine a person’s response to a set of variables
\[ \begin{array}{c} Y_{p 1}=\mu_{y_{1}}+\lambda_{11} F_{p 1}+\lambda_{12} F_{p 2}+\cdots+\lambda_{1 Q} F_{p Q}+e_{p 1} \\ Y_{p 2}=\mu_{y_{2}}+\lambda_{21} F_{p 1}+\lambda_{22} F_{p 2}+\cdots+\lambda_{2 Q} F_{p Q}+e_{p 2} \\ \vdots \\ Y_{p V}=\mu_{y_{V}}+\lambda_{V 1} F_{p 1}+\lambda_{V 2} F_{p 2}+\cdots+\lambda_{V Q} F_{p Q}+e_{p V} \end{array} \]
\[ \mathbf{Y_p = \mu_Y+\Lambda F_p^T + e_p} \]
The covariance matrix is modeled based on how it would look if a set of hypothetical (latent) factors had caused the data
For an analysis measuring \(F\) factors, each item in the EFA:
The initial estimation of factor loadings is conducted based on the assumption of uncorrelated factors
\[ \boldsymbol{\Sigma}_{Y}=\boldsymbol{\Lambda} \mathbf{\Lambda}^{T}+\boldsymbol{\Psi} \]
such that \(\mathbf\Delta\) is a diagonal matrix
This puts \(\frac{F(F-1)}{2}\) constraints on the model (that many fewer parameters to estimate).
This constraint is not well known – and how it functions is hard to describe
Note: the other methods of EFA “extraction” avoid this constraint by not being statistical models in the first place
The EFA constraints provide some detailed assumptions about the nature of the factor model and how it pertains to the data
For example, take a 2-factor model (one constraint):
\[ \sum_{v=1}^{V} \psi_{v}^{2} \prod_{f=1}^{Q=2} \lambda_{v f}=0 \]
In short, some combinations of factor loadings and unique variances (across and within items) cannot happen
The parameters of the EFA model under ML retain the same benefits and consequences of any model (i.e., CFA)
Furthermore, the same types of model fit indices are available in EFA as are in CFA
As with CFA, though, an EFA model must be a close approximation to the saturated model covariance matrix if the parameters are to be believed
factanal()
FunctionThe base R program has the factanal()
function that conducts ML-based EFA
Although the function use ML, you still cannot have missing data in the analysis based of the limitations of factanal
function
We will also not use a rotation method at first as to show how default constraints in EFA with ML are ridiculous
factanal()
function provides a rudimentary test for model fit
Call:
factanal(x = data02a, factors = 1, rotation = "none")
Uniquenesses:
X1 X3 X5 X9 X10 X13 X14 X18 X21 X23
0.677 0.728 0.550 0.417 0.527 0.488 0.857 0.816 0.538 0.573
Loadings:
Factor1
X1 0.569
X3 0.521
X5 0.670
X9 0.764
X10 0.688
X13 0.715
X14 0.378
X18 0.429
X21 0.680
X23 0.653
Factor1
SS loadings 3.828
Proportion Var 0.383
Test of the hypothesis that 1 factor is sufficient.
The chi square statistic is 161.14 on 35 degrees of freedom.
The p-value is 4.23e-18
chi_sq_test <- function(mod) {
num_factor <- mod$factors
chi_stat <- round(mod$STATISTIC, 3)
dof <- mod$dof
p_value <- round(mod$PVAL, 4)
print(glue::glue("{num_factor}-factor model: The chi square statistic is {chi_stat} on {dof} degrees of freedom. The p-value is {p_value}"))
}
#one-factor model
EFA_1factor = factanal(x = data02a, factors = 1, rotation = "none")
chi_sq_test(EFA_1factor)
#two-factor model
EFA_2factor = factanal(x = data02a, factors = 2, rotation = "none")
chi_sq_test(EFA_2factor)
#three-factor model
EFA_3factor = factanal(x = data02a, factors = 3, rotation = "none")
chi_sq_test(EFA_3factor)
#four-factor model
EFA_4factor = factanal(x = data02a, factors = 4, rotation = "none")
chi_sq_test(EFA_4factor)
1-factor model: The chi square statistic is 161.138 on 35 degrees of freedom. The p-value is 0
2-factor model: The chi square statistic is 56.428 on 26 degrees of freedom. The p-value is 5e-04
3-factor model: The chi square statistic is 31.235 on 18 degrees of freedom. The p-value is 0.027
4-factor model: The chi square statistic is 10.2 on 11 degrees of freedom. The p-value is 0.5125
As the four-factor solution fit best, we will interpret it
Unrotated solution of factor loadings:
Loadings:
Factor1 Factor2 Factor3 Factor4
X1 0.356 0.457 0.127 0.346
X3 0.322 0.403
X5 0.452 0.480 0.174
X9 0.468 0.630 0.196 -0.144
X10 0.458 0.502 0.186
X13 0.509 0.494
X14 0.292 0.227
X18 0.316 0.294 -0.141 0.128
X21 0.491 0.548 -0.408
X23 0.997
Factor1 Factor2 Factor3 Factor4
SS loadings 2.542 1.933 0.327 0.183
Proportion Var 0.254 0.193 0.033 0.018
Cumulative Var 0.254 0.447 0.480 0.499
Transformations of the factor loadings are possible as the matrix of factor loadings is only unique up to an orthogonal transformation
Historically, rotations use the properties of matrix algebra to adjust the factor loadings to more interpretable numbers
Modern versions of rotations/transformations rely on “target functions” that specify what a “good” solution should look like
Multiple types of rotations exist but two broad categories seem to dominate how they are discussed:
Orthogonal rotations: rotations that force the factor correlation to zero (orthogonal factors). The name orthogonal relates to the angle between axes of factor solutions being 90 degrees. The most prevalent is the varimax rotation.
Oblique rotations: rotations that allow for non-zero factor correlations. The name orthogonal relates to the angle between axes of factor solutions not being 90 degrees. The most prevalent is the promax rotation.
\[ \mathbf{\Lambda}^{*}=\boldsymbol{\Lambda} \mathbf{T} \]
where: \(\mathbf {TT^T=T^TT = I}\)
\[ \begin{array}{l} \boldsymbol{\Sigma}_{Y}=\boldsymbol{\Lambda}^{*} \boldsymbol{\Lambda}^{* T}+\boldsymbol{\Psi}=\boldsymbol{\Lambda} \mathbf{T}(\boldsymbol{\Lambda} \mathbf{T})^{T}+\boldsymbol{\Psi}=\boldsymbol{\Lambda} \mathbf{T T}^{T} \boldsymbol{\Lambda}^{T}+\boldsymbol{\Psi} \\ =\boldsymbol{\Lambda} \boldsymbol{\Lambda}^{T}+\boldsymbol{\Psi} \end{array} \]
Given a target function, rotation algorithms seek to find a rotated solution that simultaneously:
#varimax rotation
EFA_4factor_varimax = factanal(x = data02a, factors = 4, rotation = "varimax")
EFA_4factor_varimax$loadings
Loadings:
Factor1 Factor2 Factor3 Factor4
X1 0.301 0.209 0.114 0.569
X3 0.360 0.213 0.113 0.307
X5 0.532 0.224 0.202 0.301
X9 0.714 0.291 0.157 0.239
X10 0.597 0.231 0.202 0.234
X13 0.414 0.461 0.236 0.268
X14 0.224 0.129 0.162 0.242
X18 0.147 0.346 0.144 0.246
X21 0.300 0.746 0.184 0.166
X23 0.266 0.270 0.904 0.188
Factor1 Factor2 Factor3 Factor4
SS loadings 1.772 1.255 1.086 0.873
Proportion Var 0.177 0.125 0.109 0.087
Cumulative Var 0.177 0.303 0.411 0.499
It also brought about the following factor correlations:
Each factor explain different sets of items
#promax rotation
EFA_4factor_varimax = factanal(x = data02a, factors = 4, rotation = "promax")
EFA_4factor_varimax$loadings
Loadings:
Factor1 Factor2 Factor3 Factor4
X1 -0.104 0.861
X3 0.249 0.308
X5 0.530 0.171
X9 0.885
X10 0.699
X13 0.254 0.368 0.101
X14 0.274
X18 -0.126 0.314 0.269
X21 0.906 -0.106
X23 1.037
Factor1 Factor2 Factor3 Factor4
SS loadings 1.712 1.102 1.072 1.040
Proportion Var 0.171 0.110 0.107 0.104
Cumulative Var 0.171 0.281 0.389 0.493
Today we discussed the world of exploratory factor analysis and found the following:
ESRM 64503 - Lecture 12: PCA and EFA