5 functions to do Principal Components Analysis in R
Principal Component Analysis (PCA ) is a multivariate technique that allows us to summarize the systematic patterns of variations in the data. From a data analysis standpoint, PCA is used for studying one table of observations and variables with the main idea of transforming the observed variables into a set of new variables, the principal components, which are uncorrelated and explain the variation in the data. For this reason, PCA allows to reduce a “complex” data set to a lower dimension in order to reveal the structures or the dominant types of variations in both the observations and the variables.
PCA in R
In R, there are several functions from different packages that allow us to perform PCA. In this post I’ll show you 5 different ways to do a PCA using the following functions (with their corresponding packages in parentheses):
In R, there are several functions from different packages that allow us to perform PCA. In this post I’ll show you 5 different ways to do a PCA using the following functions (with their corresponding packages in parentheses):
- prcomp (stats)
- princomp (stats)
- PCA (FactoMineR)
- dudi.pca (ade4)
- acp (amap)
Brief note: It is no coincidence that the three external packages (FactoMineR, ade4, and amap) have been developed by French data analysts, which have a long tradition and preference for PCA and other related exploratory techniques.
No matter what function you decide to use, the typical PCA results should consist of a set of eigenvalues, a table with the scores or Principal Components (PCs), and a table of loadings (or correlations between variables and PCs). The eigenvalues provide information of the variability in the data. The scores provide information about the structure of the observations. The loadings (or correlations) allow you to get a sense of the relationships between variables, as well as their associations with the extracted PCs.
The Data
To make things easier, we’ll use the dataset USArrests that already comes in R. It’s a data frame with 50 rows (USA states) and 4 columns containing information about violent crime rates by US State. Since most of the times the variables are measured in different scales, the PCA must be performed with standardized data (mean=0, variance=1). The good news is that all of the functions that perform PCA come with parameters to specify that the analysis must be applied on standardized data.
To make things easier, we’ll use the dataset USArrests that already comes in R. It’s a data frame with 50 rows (USA states) and 4 columns containing information about violent crime rates by US State. Since most of the times the variables are measured in different scales, the PCA must be performed with standardized data (mean=0, variance=1). The good news is that all of the functions that perform PCA come with parameters to specify that the analysis must be applied on standardized data.
Option 1: using prcomp
The function prcomp comes with the default stats package, which means that you don’t have to install anything. It is perhaps the quickest way to do a PCA if you don’t want to install other packages.
The function prcomp comes with the default stats package, which means that you don’t have to install anything. It is perhaps the quickest way to do a PCA if you don’t want to install other packages.
# PCA with function prcomp pca1 = prcomp(USArrests, scale. = TRUE) pca1$sdev # sqrt of eigenvalues pca1$rotation # loadings pca1$x # PCs (aka scores)
Option 2: using princomp
The function princomp also comes with the default stats package, and it is very similar to her cousin prcomp. What I don’t like of princomp is that sometimes it won’t display all the values for the loadings, but this is a minor detail.
The function princomp also comes with the default stats package, and it is very similar to her cousin prcomp. What I don’t like of princomp is that sometimes it won’t display all the values for the loadings, but this is a minor detail.
# PCA with function princomp pca2 = princomp(USArrests, cor=TRUE) pca2$sdev # sqrt of eigenvalues pca2$loadings # loadings pca2$scores # PCs (aka scores)
Option 3: using PCA
A highly recommended option, especially if you want more detailed results and assessing tools, is the PCA function from the package FactoMineR . It is by far the best PCA function in R and it comes with a number of parameters that allow you to tweak the analysis in a very nice way.
A highly recommended option, especially if you want more detailed results and assessing tools, is the PCA function from the package FactoMineR . It is by far the best PCA function in R and it comes with a number of parameters that allow you to tweak the analysis in a very nice way.
# PCA with function PCA require(FactoMineR) pca3 = PCA(USArrests, graph=FALSE) pca3$eig # matrix with eigenvalues pca3$var$coord # correlations between variables and PCs pca3$ind$coord # PCs (aka scores)
Option 4: using dudi.pca
Another option is to use the dudi.pca function from the package ade4 which has a huge amount of other methods as well as some interesting graphics.
Another option is to use the dudi.pca function from the package ade4 which has a huge amount of other methods as well as some interesting graphics.
# PCA with function dudi.pca require(ade4) pca4 = dudi.pca(USArrests, nf=5, scannf=FALSE) pca4$eig # eigenvalues pca4$c1 # loadings pca4$co # correlations between variables and PCs pca4$li # PCs
Option 5: using acp
A fifth possibility is the acp function from the package amap .
A fifth possibility is the acp function from the package amap .
# PCA with function acp require(amap) pca5 = acp(USArrests) pca5$sdev # sqrt of eigenvalues pca5$loadings # loadings pca5$scores # scores
Of course these are not the only options to do a PCA, but I’ll leave the other approaches for another post.
PCA plots
Everybody uses PCA to visualize the data, and most of the discussed functions come with their own plot functions. But you can also make use of the great graphical displays of ggplot2. Just to show you a couple of plots, let’s take the basic results from prcomp
Everybody uses PCA to visualize the data, and most of the discussed functions come with their own plot functions. But you can also make use of the great graphical displays of ggplot2. Just to show you a couple of plots, let’s take the basic results from prcomp
# load ggplot2 require(ggplot2) # create data frame with scores scores = as.data.frame(pca1$x) # plot of observations ggplot(data=scores, aes(x=PC1, y=PC2, label=rownames(scores))) + geom_hline(yintercept=0, colour="gray65") + geom_vline(xintercept=0, colour="gray65") + geom_text(colour="tomato", alpha=0.8, size=4) + opts(title="PCA plot of USA States - Crime Rates")
# function to create a circle circle <- 2="" aes="" and="" arrows="data.frame(x1=c(0,0,0,0)," b="" between="" center="c(0,0)," circles="" colour="gray65" coordinates="" corcir="circle(c(0,0)," correlations="as.data.frame(cor(USArrests," cos="" create="" data.frame="" data="correlations," do="" frame="" function="" geom_hline="" geom_path="" geom_segment="" geom_text="" geom_vline="" ggplot="" label="rownames(correlations)))" length="npoints)" npoints="100)" open="" pca1="" pcs="" pi="" r="" return="" sin="" style="border: 0px; font-family: inherit; font-size: inherit; font-style: inherit; font-variant: inherit; line-height: inherit; margin: 0px; padding: 0px; vertical-align: baseline;" tt="" variables="" will="" with="" x2="correlations$PC1," x="PC1," xend="x2," xintercept="0," xlim="" xx="center[1]" y1="c(0,0,0,0)," y2="correlations$PC2)" y="PC2," yend="y2)," yintercept="0," yy="center[1]">.->
No comments:
Post a Comment
Thank you