-
Ghislain Durif authoredGhislain Durif authored
- Introduction
- Loading the libraries
- loading the data
- Performing your first PCA
- Computing your own PCA
- Scaling
- Data projection
- Evaluation of the projection
- Optimal PCA projection
- PCA base
- Comparison with prcomp
- PCA evaluation
- Explained inertia
- Scree plot
- Variable contribution
- Quality of the projection
- Adding new individuals
- Comparison with SVD decomposition
# https://www.gnu.org/licenses/agpl-3.0.txt
title: "Practice: Introduction to Principal Component Analysis"
author: "Ghislain Durif, Laurent Modolo, Franck Picard"
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License
if (!require("tidyverse"))
install.packages("tidyverse")
library(tidyverse) # to manipule data and make plot
if (!require("factoextra"))
install.packages("factoextra")
library(factoextra) # manipulate pca results
if (!require("palmerpenguins"))
install.packages("palmerpenguins")
library(palmerpenguins) # we load the data
if (!require("fontawesome"))
install.packages("fontawesome")
library(fontawesome)
rm(list = ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
if("conflicted" %in% .packages()) conflicted::conflicts_prefer(dplyr::filter)
first_pc_projection_code <- function(line_slope, x, y){
a <- c(x, y)
b <- c(1, line_slope)
scaled_b <- b / c(sqrt(sum(b^2)))
c(a %*% scaled_b) * scaled_b
}
Introduction
One of the most widely used tools in big data analysis is the principal component analysis or PCA method. PCA applications are multiples, it can be used for data visualization, data exploration or as a preprocessing step to reduce the dimension of your data before applying other methods.
Loading the libraries
library(tidyverse) # to manipule data and make plot
library(factoextra) # manipulate pca results
library(palmerpenguins) # we load the data
loading the data
We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program (part of the US Long Term Ecological Research Network).
The palmerpenguins
data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
The palmerpenguins
library load the penguins
dataset into your R environment. If you are not familiar with tibble
, you just have to know that they are equivalent to data.frame
(but easier to work with).
penguins
We have r ncol(penguins)
variables for r nrow(penguins)
individuals:
dim(penguins)
The data is tidy:
- Each variable has its own column.
- Each observation has its own row.
- Each value must have its own cell.
Meeting these 3 criteria for your data will simplify your data processing and analysis as most of the algorithm expect this format.
summary(penguins)
Solution
We first are going to drop the missing values (`NA`)
data <- penguins %>% drop_na()
dim(data)
If you are not familiar with the %>%
operator or pipe in R: It takes the output of the function on the left and pass it as the first argument of the function on the right.
For the sake of this practical, we are going to focus on the continuous variables in the data bill_length_mm
, bill_depth_mm
, flipper_length_mm
and body_mass_g
.
The function pairs
renders scatter plots of each possible pairs of variables
data %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
pairs()
To explore the PCA algorithm we are first going to focus on 2 two continuous variable in this data set: the bill length and depth (bill_length_mm
, bill_depth_mm
) for the female penguins (sex
).
data %>%
filter(sex == "female") %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))
Solution
```{r} data_f <- data %>% filter(sex == "female") %>% # we select the row where sexe is female select(c(bill_length_mm, bill_depth_mm)) # we select the two columns ```
Performing your first PCA
The prcomp
and princomp
functions are implementations of the PCA methods
data_f_pca <- prcomp(data_f, scale = T)
You can use the str()
function to explore the data_f_pca
object.
Solution
We display the detail of the data_f_pca
object with str()
str(data_f_pca)
data_f_pca$center
data_f_pca$scale
We can compare these values with the summary()
of the data_f
table
map(data_f, mean)
map(data_f, sd)
map
apply a function to each element of a list or vector.
The package factoextra
provides us with functions to manipulate and plot the output of the prcomp
function. The most common usage of the PCA results is to display the individuals on the first factorial plan.
You can do this with the fviz_pca_ind
function.
species_f <- data %>% filter(sex == "female") %>% pull(species)
Solution
```{r} fviz_pca_ind(data_f_pca, geom = "point", col.ind = species_f ) ```
We have a rotation of the point coordinates
Solution
- 71.3 % of total variation is explained by the first principal component
- 28.7 % of total variation is explained by the second principal component
You can compute these percentages with the following formula
data_f_pca$sdev^2 / sum(data_f_pca$sdev^2)
You can reproduce the same plot by using the new individual's coordinates in the data_f_pca
x
slot.
as_tibble(data_f_pca$x) %>%
bind_cols(data %>% filter(sex == "female")) %>% # we add all the other variables to color by species
ggplot() +
geom_point(aes(x = PC1, y = PC2, color = species))
Computing your own PCA
Now that we have access to the data_f_pca
results for our two variables, you are going to try to mirror the computation of prcomp
step by step as described this morning.
Scaling
Solution
```{r} diy_data_f <- data_f %>% mutate( bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm), bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm), ) ```
Solution
- centering set the empirical mean to 0, if the data are not center the distance matrix is not the covariance matrix
- scaling put every variable on the same scale
Data projection
Then we are going to try to manually find the best line on which to project our point (the first axis or first principal component of our PCA).
In the following code, we define a slope (line_slope
) to draw a line passing through the origin.
We then compute the orthogonal projection of the points on the resulting line.
line_slope <- 2 # to change
point_projection <- function(line_slope, x, y){
results <- first_pc_projection_code(line_slope, x, y) # to replace with your code
return(list(x = results[1], y = results[2]))
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
projection_x = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$x,
projection_y = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$y
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = line_slope, color = "red") +
geom_point(aes(x = projection_x, y = projection_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = projection_x,
yend = projection_y), color = "red", linewidth = 0.1) +
coord_equal()
For this you need to:
- create a vector passing through the origin with the angle `line_slope`
- scale this vector by its length
- use the `%*%` product between the scaled vector and your point coordinate to compute the point projection
- create the vector with the right angle and right length
Solution
```{r, eval=F} first_pc_projection_code <- function(line_slope, x, y){ a <- c(x, y) b <- c(1, line_slope) scaled_b <- b / c(sqrt(sum(b^2))) c(a %*% scaled_b) * scaled_b } ```
Evaluation of the projection
We can minimize the distance from each point to the red line. Or we can maximize the distance from each point to the point of origin.
The following code is the same as before but with the
-
SS
: Sum-square of distances of the projected point to the origin -
SR
: Sum of residuals, the distances of the point to their projection
line_slope <- 0.2
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
projection_x = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$x,
projection_y = point_projection(
line_slope = line_slope,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = ,# right formula
Residuals = # right formula
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = line_slope, color = "red") +
geom_point(aes(x = projection_x, y = projection_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = projection_x,
yend = projection_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()
Solution
```{r eval=F} S_dist = projection_x^2 + projection_y^2, Residuals = sqrt((projection_x - bill_length_mm)^2 + (projection_y - bill_depth_mm)^2) ```
Optimal PCA projection
You have seen this morning that the best line on which to project the point can be found with the covariance matrix of your data.
Solution
```{r} diy_cov <- diy_data_f %>% as.matrix() %>% cov() diy_cov ```
Solution
```{r} eigen(diy_cov) ```
Then you will need to compute the slope
value for the geom_abline
function from the diy_cov
results.
point_projection <- function(diy_cov, x, y){
a <- c(x, y)
b <- eigen(diy_cov)$vector[, 1]
results <- c(a %*% b) * b
list(x = results[1], y = results[2])
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$x,
pc1_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = pc1_x^2 + pc1_y^2,
Residuals = sqrt((pc1_x - bill_length_mm)^2 + (pc1_y - bill_depth_mm)^2)
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc1_x,
yend = pc1_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()
point_projection <- function(diy_cov, x, y){
# your code
list(x = results[1], y = results[2])
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$x,
pc1_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = , # right formula
Residuals = # right formula
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = , color = "red") + # missing slope value here
geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc1_x,
yend = pc1_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()
Solution
For the projection function:
point_projection <- function(diy_cov, x, y){
a <- c(x, y)
b <- eigen(diy_cov)$vector[, 1]
results <- c(a %*% b) * b
list(x = results[1], y = results[2])
}
For the slope value:
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
We are now going to plot the second principal component and the projection of the data point on it
point_projection <- function(diy_cov, x, y){
a <- c(x, y)
b <- eigen(diy_cov)$vector[, 2]
results <- c(a %*% b) * b
return(list(x = results[1], y = results[2]))
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc2_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$x,
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = pc2_x^2 + pc2_y^2,
Residuals = sqrt((pc2_x - bill_length_mm)^2 + (pc2_y - bill_depth_mm)^2)
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +
geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc2_x,
yend = pc2_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()
point_projection <- function(diy_cov, x, y){
# you corde
return(list(x = results[1], y = results[2]))
}
diy_pca <- diy_pca %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc2_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$x,
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm)$y,
S_dist = , # your code
Residuals = # your code
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = , color = "red") + # slope of the PC1
geom_abline(slope = , color = "blue") + # slope of the PC2
geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc2_x,
yend = pc2_y), color = "red", linewidth = 0.1) +
labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
", SR = ", round(sum(diy_pca$Residuals), 2))) +
coord_equal()
Solution
For the projection function:
point_projection <- function(diy_cov, x, y){
a <- c(x, y)
b <- eigen(diy_cov)$vector[, 2]
results <- c(a %*% b) * b
return(list(x = results[1], y = results[2]))
}
For the slope value:
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +
For the PCA construction we want, a PC2 orthogonal to the PC1.
point_projection <- function(diy_cov, x, y, PC){
a <- c(x, y)
b <- eigen(diy_cov)$vector[, PC]
results <- c(a %*% b) * b
return(list(x = results[1], y = results[2]))
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm, 1)$x,
pc1_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm, 1)$y,
pc2_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm, 2)$x,
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm, 2)$y
)
diy_pca %>%
ggplot() +
geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +
geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc1_x,
yend = pc1_y,), color = "red", linewidth = 0.1) +
geom_segment(
aes(x = bill_length_mm,
y = bill_depth_mm,
xend = pc2_x,
yend = pc2_y,), color = "red", linewidth = 0.1) +
coord_equal()
PCA base
Now that we know how to compute our first 2 PCs, we want to represent the data point in the PCs coordinates.
point_projection <- function(diy_cov, x, y, PC){
# your code
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 1),
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 2),
)
diy_pca %>%
bind_cols(
data %>% select(-colnames(diy_pca)[1:2]) %>% filter(sex == "female")
) %>%
ggplot() +
geom_point(aes(x = pc1_x, y = pc2_y, color = species))
point_projection <- function(diy_cov, x, y, PC){
a <- c(x, y)
b <- eigen(diy_cov)$vector[, PC]
a %*% b
}
diy_pca <- diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 1),
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 2),
)
diy_pca %>%
bind_cols(
data %>% select(-colnames(diy_pca)[1:2]) %>% filter(sex == "female")
) %>%
ggplot() +
geom_point(aes(x = pc1_x, y = pc2_y, color = species))
Solution
```{r eval=F} point_projection <- function(diy_cov, x, y, PC){ a <- c(x, y) b <- eigen(diy_cov)$vector[, PC] a %*% b } ```
prcomp
Comparison with In the prcomp
output you can directly get the coordinates in PCs space from the $x
slot.
diy_data_f %>%
rowwise() %>% # perform the subsequent opperation row by row
mutate(
pc1_x = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 1),
pc2_y = point_projection(
diy_cov = diy_cov,
x = bill_length_mm,
y = bill_depth_mm,
PC = 2),
) %>%
ungroup() %>%
mutate(
pc1_x_ref = data_f_pca$x[,1],
pc2_y_ref = data_f_pca$x[,2]
) %>%
bind_cols(
data %>% select(-colnames(diy_pca)[1:2]) %>% filter(sex == "female")
) %>%
ggplot() +
geom_point(aes(x = pc1_x, y = pc2_y, color = species), alpha = 0.5) +
geom_point(aes(x = pc1_x_ref, y = pc2_y_ref, color = species), size = 0.5)
PCA evaluation
Solution
```{r} data_f_pca <- data %>% filter(sex == "female") %>% select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% prcomp(scale = TRUE) ```
species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = species_f
)
Explained inertia
You have seen this morning that there are different criteria to evaluate the PCA projection.
-
SS(PC1)
: Eigenvalue for PC1 -
sqrt(SS(PC1))
: Singular value for PC1 -
SS(PC1)/(n-1)
: Variation for PC1 -
Var(PC1)/sum(Var(PCx))
: PC1 accounts for x Variation around the PCs
Solution
Using the get_eigenvalue
function
get_eigenvalue(data_f_pca)
using the data_f_pca
slot
data_f_pca$sdev^2
data %>%
filter(sex == "female") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
scale() %>%
cov() %>%
eigen()
(pc_var <- data_f_pca$sdev^2 / (ncol(data_f_pca$x) - 1))
pc_var / sum(pc_var)
Scree plot
The fviz_eig
function create a scree plot of your PCA.
fviz_eig(data_f_pca)
Solution
tibble(
pc = 1:4,
var = pc_var / sum(pc_var)
) %>%
ggplot() +
geom_point(aes(x = pc, y = var))
Variable contribution
The axis of the PCA are linear combinations of the variable axis.
Solution
It means that if `bill_depth_mm` contribute for 1 to PC1, `bill_length_mm` contribute for 0.5
As the number of variables increases, so is the complexity of the linear combinations for each PC. We can represent the variable axis in the new PCA axis, this representation is called the correlation circle.
fviz_pca_var(data_f_pca, col.var = "contrib")
Finally, we can use the fviz_pca_biplot
function to display the individuals and variable information on the same plot.
fviz_pca_biplot(
data_f_pca, geom = "point",
col.ind = (data %>% filter(sex == "female") %>% pull(species)),
)
Quality of the projection
You have seen this morning different metrics to check the quality of the representation for individuals and for variables.
The get_pca_var()
and get_pca_ind()
are two symmetrical functions to get relevant information for the individuals and the variables.
Solution
```{r} # results per variables res_var <- get_pca_var(data_f_pca) res_var$coord # coordinates res_var$contrib # contribution to the axes res_var$cos2 # quality of the representation ```
# results per individuals
res_ind <- get_pca_ind(data_f_pca)
species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = (res_ind$contrib[, 3])
)
# results per individuals
res_ind <- get_pca_ind(data_f_pca)
species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
geom = "point",
col.ind = (res_ind$cos2[, 3])
)
Adding new individuals
We can compute the coordinates of new individuals as follows:
- scale the new individual variables using the scaling information from the PCA
- Compute the coordinates by multipling by the eigenvalues of the principal components.
data_m_scale <- data %>%
filter(sex == "male") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
scale(
center = data_f_pca$center,
scale = data_f_pca$scale
)
coord_func <- function(ind, loadings){
r <- loadings * ind
apply(r, 2, sum)
}
data_m_pca <- t(apply(data_m_scale, 1, coord_func, data_f_pca$rotation ))
as_tibble(data_f_pca$x) %>%
bind_cols(
data %>% filter(sex == "female")
) %>%
bind_rows(
as_tibble(data_m_pca) %>%
bind_cols(
data %>% filter(sex == "male")
)
) %>%
ggplot() +
geom_point(aes(x = PC1, y = PC2, color = species, shape = sex))
Comparison with SVD decomposition
data_f_svd <- data %>%
filter(sex == "female") %>%
select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>%
mutate(
bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm),
bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm)) / sd(flipper_length_mm),
body_mass_g = (body_mass_g - mean(body_mass_g)) / sd(body_mass_g),
) %>%
svd()
The .Rmd file corresponding to this page is available here under the AGPL3 Licence