Skip to content
Snippets Groups Projects
Forked from LBMC / Hub / formations / ENS M1 ML
13 commits behind the upstream repository.
# https://www.gnu.org/licenses/agpl-3.0.txt
title: "Practice: Introduction to Principal Component Analysis"
author: "Ghislain Durif, Laurent Modolo, Franck Picard"

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License

if (!require("tidyverse"))
  install.packages("tidyverse")
library(tidyverse) # to manipule data and make plot
if (!require("factoextra"))
  install.packages("factoextra")
library(factoextra) # manipulate pca results
if (!require("palmerpenguins"))
  install.packages("palmerpenguins")
library(palmerpenguins) # we load the data
if (!require("fontawesome"))
  install.packages("fontawesome")
library(fontawesome)
rm(list = ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)

if("conflicted" %in% .packages()) conflicted::conflicts_prefer(dplyr::filter)

first_pc_projection_code <- function(line_slope, x, y){
  a <- c(x, y)
  b <- c(1, line_slope)
  scaled_b <- b / c(sqrt(sum(b^2)))
  c(a %*% scaled_b) * scaled_b
}

Introduction

One of the most widely used tools in big data analysis is the principal component analysis or PCA method. PCA applications are multiples, it can be used for data visualization, data exploration or as a preprocessing step to reduce the dimension of your data before applying other methods.

Loading the libraries

library(tidyverse) # to manipule data and make plot
library(factoextra) # manipulate pca results
library(palmerpenguins) # we load the data

loading the data

We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 to 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program (part of the US Long Term Ecological Research Network).

The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

The palmerpenguins library load the penguins dataset into your R environment. If you are not familiar with tibble, you just have to know that they are equivalent to data.frame (but easier to work with).

penguins

We have r ncol(penguins) variables for r nrow(penguins) individuals:

dim(penguins)

The data is tidy:

  • Each variable has its own column.
  • Each observation has its own row.
  • Each value must have its own cell.

Meeting these 3 criteria for your data will simplify your data processing and analysis as most of the algorithm expect this format.

summary(penguins)
What are the continuous and categorial variables in this table ? What is going to be the problem with the raw `penguins` table ?
Solution

We first are going to drop the missing values (`NA`)

data <- penguins %>% drop_na()
dim(data)

If you are not familiar with the %>% operator or pipe in R: It takes the output of the function on the left and pass it as the first argument of the function on the right.

For the sake of this practical, we are going to focus on the continuous variables in the data bill_length_mm, bill_depth_mm, flipper_length_mm and body_mass_g.

The function pairs renders scatter plots of each possible pairs of variables

data %>%
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
  pairs()
What can you say about the covariation structure of the data ?

To explore the PCA algorithm we are first going to focus on 2 two continuous variable in this data set: the bill length and depth (bill_length_mm, bill_depth_mm) for the female penguins (sex).

data %>% 
  filter(sex == "female") %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm, color = species))
Using the `filter` and `select` functions, create a `data_f` data set that meet the `sex == "female"` condition and contains only the `c(bill_length_mm, bill_depth_mm)` variables
Solution

```{r} data_f <- data %>% filter(sex == "female") %>% # we select the row where sexe is female select(c(bill_length_mm, bill_depth_mm)) # we select the two columns ```

Performing your first PCA

The prcomp and princomp functions are implementations of the PCA methods

data_f_pca <- prcomp(data_f, scale = T)

You can use the str() function to explore the data_f_pca object.

Compare the `center` and `scale` slot with the moments of the `data_f` table
Solution

We display the detail of the data_f_pca object with str()

str(data_f_pca)
data_f_pca$center
data_f_pca$scale

We can compare these values with the summary() of the data_f table

map(data_f, mean)
map(data_f, sd)

map apply a function to each element of a list or vector.

The package factoextra provides us with functions to manipulate and plot the output of the prcomp function. The most common usage of the PCA results is to display the individuals on the first factorial plan.

You can do this with the fviz_pca_ind function.

Compare the `fviz_pca_ind` output with the `bill_length_mm` and `bill_depth_mm` scatter plot (you need the species variable)
species_f <- data %>% filter(sex == "female") %>% pull(species)
Solution

```{r} fviz_pca_ind(data_f_pca, geom = "point", col.ind = species_f ) ```

We have a rotation of the point coordinates

What are the percentages in the Dim1 and Dim2 axes ?
Solution
  • 71.3 % of total variation is explained by the first principal component
  • 28.7 % of total variation is explained by the second principal component

You can compute these percentages with the following formula

data_f_pca$sdev^2 / sum(data_f_pca$sdev^2)

You can reproduce the same plot by using the new individual's coordinates in the data_f_pca x slot.

as_tibble(data_f_pca$x) %>% 
  bind_cols(data %>% filter(sex == "female")) %>% # we add all the other variables to color by species
  ggplot() +
  geom_point(aes(x = PC1, y = PC2, color = species))

Computing your own PCA

Now that we have access to the data_f_pca results for our two variables, you are going to try to mirror the computation of prcomp step by step as described this morning.

Scaling

The first step is to scale the data. With the `mutate()` function create a `diy_pca` tibble with the scaled version of the `bill_length_mm` and `bill_depth_mm` variables.
Solution

```{r} diy_data_f <- data_f %>% mutate( bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm), bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm), ) ```

Explain the importance of the centering and scaling steps of the data
Solution

  • centering set the empirical mean to 0, if the data are not center the distance matrix is not the covariance matrix
  • scaling put every variable on the same scale

Data projection

Then we are going to try to manually find the best line on which to project our point (the first axis or first principal component of our PCA).

In the following code, we define a slope (line_slope) to draw a line passing through the origin. We then compute the orthogonal projection of the points on the resulting line.

Play with the `line_slope` to see how this projection works
line_slope <- 2 # to change

point_projection <- function(line_slope, x, y){
  results <- first_pc_projection_code(line_slope, x, y) # to replace with your code
  return(list(x = results[1], y = results[2]))
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    projection_x = point_projection(
      line_slope = line_slope,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    projection_y = point_projection(
      line_slope = line_slope,
      x = bill_length_mm,
      y = bill_depth_mm)$y
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = line_slope, color = "red") +
  geom_point(aes(x = projection_x, y = projection_y), color = "red") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = projection_x,
        yend = projection_y), color = "red", linewidth = 0.1) +
  coord_equal()
Now you are going to replace the ```{r, eval=F} results <- first_pc_projection_code(line_slope, x, y) ``` with your own code.

For this you need to:

  • create a vector passing through the origin with the angle `line_slope`
  • scale this vector by its length
  • use the `%*%` product between the scaled vector and your point coordinate to compute the point projection
  • create the vector with the right angle and right length
Solution

```{r, eval=F} first_pc_projection_code <- function(line_slope, x, y){ a <- c(x, y) b <- c(1, line_slope) scaled_b <- b / c(sqrt(sum(b^2))) c(a %*% scaled_b) * scaled_b } ```

Evaluation of the projection

We can minimize the distance from each point to the red line. Or we can maximize the distance from each point to the point of origin.

Explain why these two approaches are equivalent.

The following code is the same as before but with the

  • SS : Sum-square of distances of the projected point to the origin
  • SR : Sum of residuals, the distances of the point to their projection
Write the formula to compute the `S_dist` and `Residuals` variables in the `mutate()` function
line_slope <- 0.2 
diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    projection_x = point_projection(
      line_slope = line_slope,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    projection_y = point_projection(
      line_slope = line_slope,
      x = bill_length_mm,
      y = bill_depth_mm)$y,
    S_dist = ,# right formula
    Residuals = # right formula
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = line_slope, color = "red") +
  geom_point(aes(x = projection_x, y = projection_y), color = "red") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = projection_x,
        yend = projection_y), color = "red", linewidth = 0.1) +
  labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
                     ", SR = ", round(sum(diy_pca$Residuals), 2))) +
  coord_equal()
Solution

```{r eval=F} S_dist = projection_x^2 + projection_y^2, Residuals = sqrt((projection_x - bill_length_mm)^2 + (projection_y - bill_depth_mm)^2) ```

Optimal PCA projection

You have seen this morning that the best line on which to project the point can be found with the covariance matrix of your data.

You can use the `cov()` function to perform this computation
Solution

```{r} diy_cov <- diy_data_f %>% as.matrix() %>% cov() diy_cov ```

You can use the `eigen()` function to get the eigenvalues and vectors of the covariance matrix.
Solution

```{r} eigen(diy_cov) ```

To obtain the following figure you will need to write the body of the `point_projection` function, taking into parameters the `diy_cov` results instead of the previous `line_slope`.

Then you will need to compute the slope value for the geom_abline function from the diy_cov results.

point_projection <- function(diy_cov, x, y){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, 1]
  results <- c(a %*% b) * b
  list(x = results[1], y = results[2])
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    pc1_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$y,
    S_dist = pc1_x^2 + pc1_y^2,
    Residuals = sqrt((pc1_x - bill_length_mm)^2 + (pc1_y - bill_depth_mm)^2)
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
  geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = pc1_x,
        yend = pc1_y), color = "red", linewidth = 0.1) +
  labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
                     ", SR = ", round(sum(diy_pca$Residuals), 2))) +
  coord_equal()
point_projection <- function(diy_cov, x, y){
  # your code
  list(x = results[1], y = results[2])
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    pc1_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$y,
    S_dist = , # right formula
    Residuals = # right formula
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = , color = "red") + # missing slope value here
  geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = pc1_x,
        yend = pc1_y), color = "red", linewidth = 0.1) +
  labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
                     ", SR = ", round(sum(diy_pca$Residuals), 2))) +
  coord_equal()
Solution

For the projection function:

point_projection <- function(diy_cov, x, y){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, 1]
  results <- c(a %*% b) * b 
  list(x = results[1], y = results[2])
}

For the slope value:

geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
Do you have the same results as your neighbors ?

We are now going to plot the second principal component and the projection of the data point on it

Adapt your previous code to perform the computation on the PC2
point_projection <- function(diy_cov, x, y){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, 2]
  results <- c(a %*% b) * b
  return(list(x = results[1], y = results[2]))
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc2_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$y,
    S_dist = pc2_x^2 + pc2_y^2,
    Residuals = sqrt((pc2_x - bill_length_mm)^2 + (pc2_y - bill_depth_mm)^2)
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
  geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +
  geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = pc2_x,
        yend = pc2_y), color = "red", linewidth = 0.1) +
  labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
                     ", SR = ", round(sum(diy_pca$Residuals), 2))) +
  coord_equal()
point_projection <- function(diy_cov, x, y){
  # you corde
  return(list(x = results[1], y = results[2]))
}

diy_pca <- diy_pca %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc2_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$x,
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm)$y,
    S_dist = , # your code
    Residuals = # your code
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = , color = "red") + # slope of the PC1
  geom_abline(slope = , color = "blue") + # slope of the PC2
  geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = pc2_x,
        yend = pc2_y), color = "red", linewidth = 0.1) +
  labs(title = str_c("SS = ", round(sum(diy_pca$S_dist), 2),
                     ", SR = ", round(sum(diy_pca$Residuals), 2))) +
  coord_equal()
Solution

For the projection function:

point_projection <- function(diy_cov, x, y){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, 2]
  results <- c(a %*% b) * b
  return(list(x = results[1], y = results[2]))
}

For the slope value:

  geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
  geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +

For the PCA construction we want, a PC2 orthogonal to the PC1.

How many lines are compatible with the orthogonality constraint for 2 variables ? For 3 variables ?
You can merge your previous computation to plot the projection on the 2 first PCs
point_projection <- function(diy_cov, x, y, PC){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, PC]
  results <- c(a %*% b) * b
  return(list(x = results[1], y = results[2]))
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm, 1)$x,
    pc1_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm, 1)$y,
     pc2_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm, 2)$x,
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm, 2)$y
  )

diy_pca %>% 
  ggplot() +
  geom_point(aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_abline(slope = eigen(diy_cov)$vector[2, 1] / eigen(diy_cov)$vector[1, 1], color = "red") +
  geom_abline(slope = eigen(diy_cov)$vector[2, 2] / eigen(diy_cov)$vector[1, 2], color = "blue") +
  geom_point(aes(x = pc1_x, y = pc1_y), color = "red") +
  geom_point(aes(x = pc2_x, y = pc2_y), color = "blue") +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = pc1_x,
        yend = pc1_y,), color = "red", linewidth = 0.1) +
  geom_segment(
    aes(x = bill_length_mm,
        y = bill_depth_mm,
        xend = pc2_x,
        yend = pc2_y,), color = "red", linewidth = 0.1) +
  coord_equal()

PCA base

Now that we know how to compute our first 2 PCs, we want to represent the data point in the PCs coordinates.

Modify the following code to plot your data in the PCs space.
point_projection <- function(diy_cov, x, y, PC){
  # your code
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 1),
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 2),
  )

diy_pca %>% 
  bind_cols(
    data %>% select(-colnames(diy_pca)[1:2]) %>%  filter(sex == "female")
  ) %>% 
  ggplot() +
  geom_point(aes(x = pc1_x, y = pc2_y, color = species))
point_projection <- function(diy_cov, x, y, PC){
  a <- c(x, y)
  b <- eigen(diy_cov)$vector[, PC]
  a %*% b
}

diy_pca <- diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 1),
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 2),
  )

diy_pca %>% 
  bind_cols(
    data %>% select(-colnames(diy_pca)[1:2]) %>%  filter(sex == "female")
  ) %>% 
  ggplot() +
  geom_point(aes(x = pc1_x, y = pc2_y, color = species))
Solution

```{r eval=F} point_projection <- function(diy_cov, x, y, PC){ a <- c(x, y) b <- eigen(diy_cov)$vector[, PC] a %*% b } ```

Comparison with prcomp

In the prcomp output you can directly get the coordinates in PCs space from the $x slot.

diy_data_f %>% 
  rowwise() %>% # perform the subsequent opperation row by row
  mutate(
    pc1_x = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 1),
    pc2_y = point_projection(
      diy_cov = diy_cov,
      x = bill_length_mm,
      y = bill_depth_mm,
      PC = 2),
  ) %>% 
  ungroup() %>% 
  mutate(
    pc1_x_ref = data_f_pca$x[,1],
    pc2_y_ref = data_f_pca$x[,2]
  ) %>% 
  bind_cols(
    data %>% select(-colnames(diy_pca)[1:2]) %>%  filter(sex == "female")
  ) %>% 
  ggplot() +
  geom_point(aes(x = pc1_x, y = pc2_y, color = species), alpha = 0.5) +
  geom_point(aes(x = pc1_x_ref, y = pc2_y_ref, color = species), size = 0.5) 
What could be the problem when comparing these two PCA results ?

PCA evaluation

Compute the PCA (`data_f_pca`) of the `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm` and `body_mass_g` for the `female` penguins
Solution

```{r} data_f_pca <- data %>% filter(sex == "female") %>% select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% prcomp(scale = TRUE) ```

species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
             geom = "point",
             col.ind = species_f
             )
What are the main differences with the `bill_length_mm` and `bill_depth_mm` PCA ?

Explained inertia

You have seen this morning that there are different criteria to evaluate the PCA projection.

  • SS(PC1): Eigenvalue for PC1
  • sqrt(SS(PC1)): Singular value for PC1
  • SS(PC1)/(n-1): Variation for PC1
  • Var(PC1)/sum(Var(PCx)): PC1 accounts for x Variation around the PCs
From the `data_f_pca` try to compute these 4 criteria
Solution

Using the get_eigenvalue function

get_eigenvalue(data_f_pca)

using the data_f_pca slot

data_f_pca$sdev^2
data %>%
  filter(sex == "female") %>% 
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
  scale() %>% 
  cov() %>% 
  eigen()
(pc_var <- data_f_pca$sdev^2 / (ncol(data_f_pca$x) - 1))
pc_var / sum(pc_var)
Why are these metric important when analyzing PCA outputs ?

Scree plot

The fviz_eig function create a scree plot of your PCA.

fviz_eig(data_f_pca)
From your previous code, create your own scree plot
Solution

tibble(
  pc = 1:4,
  var = pc_var / sum(pc_var)
) %>% 
  ggplot() +
  geom_point(aes(x = pc, y = var))

Variable contribution

The axis of the PCA are linear combinations of the variable axis.

What does it mean to find a slope of `0.5` for PC1 in the `bill_length_mm`, `bill_depth_mm` plot ?
Solution

It means that if `bill_depth_mm` contribute for 1 to PC1, `bill_length_mm` contribute for 0.5

As the number of variables increases, so is the complexity of the linear combinations for each PC. We can represent the variable axis in the new PCA axis, this representation is called the correlation circle.

fviz_pca_var(data_f_pca, col.var = "contrib")
With the `fviz_pca_var()` function, which are the variables that contribute the most to PC1 ? To PC2 ? To both ?
Use the `str()` function to find this information in the `data_f_pca` object

Finally, we can use the fviz_pca_biplot function to display the individuals and variable information on the same plot.

fviz_pca_biplot(
  data_f_pca, geom = "point",
  col.ind = (data %>% filter(sex == "female") %>% pull(species)),
  )
Remember that we scaled all the variables before computing the PCA. What are the results of the scaling in terms of variable unit contribution ?

Quality of the projection

You have seen this morning different metrics to check the quality of the representation for individuals and for variables.

The get_pca_var() and get_pca_ind() are two symmetrical functions to get relevant information for the individuals and the variables.

Explore the results of these two functions.
What is the variable contributing the most to PC4 ?
How do you interpret the `cos2` slot
Solution

```{r} # results per variables res_var <- get_pca_var(data_f_pca) res_var$coord # coordinates res_var$contrib # contribution to the axes res_var$cos2 # quality of the representation ```

Modify the following code to display the individuals that contribute the most to PC1 and PC2. What can you say about these individuals ?
# results per individuals
res_ind <- get_pca_ind(data_f_pca)
species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
             geom = "point",
             col.ind = (res_ind$contrib[, 3])
             )
With the same modifications for the following code, which species are better represented by PC1 and which by PC2 ?
# results per individuals
res_ind <- get_pca_ind(data_f_pca)
species_f <- data %>% filter(sex == "female") %>% pull(species) # we get the species variable
fviz_pca_ind(data_f_pca,
             geom = "point",
             col.ind = (res_ind$cos2[, 3])
             )

Adding new individuals

We can compute the coordinates of new individuals as follows:

  • scale the new individual variables using the scaling information from the PCA
  • Compute the coordinates by multipling by the eigenvalues of the principal components.
data_m_scale <- data %>%
  filter(sex == "male") %>%
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
   scale(
      center = data_f_pca$center,
      scale = data_f_pca$scale
    )
coord_func <- function(ind, loadings){
  r <- loadings * ind
  apply(r, 2, sum)
}
data_m_pca <- t(apply(data_m_scale, 1, coord_func, data_f_pca$rotation ))

as_tibble(data_f_pca$x) %>% 
  bind_cols(
    data %>% filter(sex == "female")
  ) %>% 
  bind_rows(
    as_tibble(data_m_pca) %>% 
    bind_cols(
      data %>% filter(sex == "male")
    )
  ) %>% 
  ggplot() +
  geom_point(aes(x = PC1, y = PC2, color = species, shape = sex))

Comparison with SVD decomposition

Compare the results of your `data_f_pca` to the one obtained with the following code.
data_f_svd <- data %>%
  filter(sex == "female") %>% 
  select(c(bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g)) %>% 
  mutate(
    bill_length_mm = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm),
    bill_depth_mm = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
    flipper_length_mm = (flipper_length_mm - mean(flipper_length_mm)) / sd(flipper_length_mm),
    body_mass_g = (body_mass_g - mean(body_mass_g)) / sd(body_mass_g),
  ) %>% 
  svd()

The .Rmd file corresponding to this page is available here under the AGPL3 Licence