Practical_a.Rmd
title: "Introduction to Principal Component Analysis"
author: "Ghislain Durif, Laurent Modolo, Franck Picard"
output:
rmdformats::downcute:
self_contain: true
use_bookdown: true
default_style: "light"
lightbox: true
css: "../www/style_Rmd.css"
if (!require("tidyverse"))
install.packages("tidyverse")
library(tidyverse) # to manipule data and make plot
if (!require("factoextra"))
install.packages("factoextra")
library(factoextra) # manipulate pca results
if (!require("palmerpenguins"))
install.packages("palmerpenguins")
library(palmerpenguins) # we load the data
if (!require("fontawesome"))
install.packages("fontawesome")
library(fontawesome)
rm(list = ls())
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
first_pc_projection_code <- function(line_slope, x, y){
a <- c(x, y)
b <- c(1, line_slope)
scaled_b <- b / c(sqrt(sum(b^2)))
c(a %*% scaled_b) * scaled_b
}
if (!require("klippy")) {
install.packages("remotes")
remotes::install_github("rlesur/klippy")
}
klippy::klippy(
position = c('top', 'right'),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
Introduction
One of the most widely used tools in big data analysis is the principal component analysis or PCA method. PCA applications are multiples, it can be used for data visualization, data exploration or as a preprocessing step to reduce the dimension of your data before applying other methods.
Working with penguins, a toy dataset
Loading the libraries
library(tidyverse) # to manipule data and make plot
library(factoextra) # manipulate pca results
library(palmerpenguins) # we load the data
loading the data
We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network.
The palmerpenguins
data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
The palmerpenguins
library load the penguins
dataset into your R environment. If you are not familiar with tibble
, you just have to know that they are equivalent to data.frame
.
penguins
We have r ncol(penguins)
variables for r nrow(penguins)
individuals:
dim(penguins)
The data is tidy:
- Each variable has its own column.
- Each observation has its own row.
- Each value must have its own cell.