Skip to content
Snippets Groups Projects
Select Git revision
  • 2cdfdf58274c47e6586c9a9aebe7e8d272823c60
  • quarto_refactor default
  • main protected
  • preparation
  • dev
5 results

Practical_a.Rmd

Blame
  • Forked from LBMC / Hub / formations / ENS M1 ML
    83 commits behind the upstream repository.
    Laurent Modolo's avatar
    Laurent Modolo authored
    2cdfdf58
    History
    Practical_a.Rmd 27.54 KiB
    title: "Introduction to Principal Component Analysis"
    author: "Ghislain Durif, Laurent Modolo, Franck Picard"
    output:
      rmdformats::downcute:
        self_contain: true
        use_bookdown: true
        default_style: "light"
        lightbox: true
        css: "../www/style_Rmd.css"
    
    if (!require("tidyverse"))
      install.packages("tidyverse")
    library(tidyverse) # to manipule data and make plot
    if (!require("factoextra"))
      install.packages("factoextra")
    library(factoextra) # manipulate pca results
    if (!require("palmerpenguins"))
      install.packages("palmerpenguins")
    library(palmerpenguins) # we load the data
    if (!require("fontawesome"))
      install.packages("fontawesome")
    library(fontawesome)
    rm(list = ls())
    knitr::opts_chunk$set(echo = TRUE)
    knitr::opts_chunk$set(comment = NA)
    
    first_pc_projection_code <- function(line_slope, x, y){
      a <- c(x, y)
      b <- c(1, line_slope)
      scaled_b <- b / c(sqrt(sum(b^2)))
      c(a %*% scaled_b) * scaled_b
    }
    if (!require("klippy")) {
      install.packages("remotes")
      remotes::install_github("rlesur/klippy")
    }
    klippy::klippy(
      position = c('top', 'right'),
      color = "white",
      tooltip_message = 'Click to copy',
      tooltip_success = 'Copied !')

    Introduction

    One of the most widely used tools in big data analysis is the principal component analysis or PCA method. PCA applications are multiples, it can be used for data visualization, data exploration or as a preprocessing step to reduce the dimension of your data before applying other methods.

    Working with penguins, a toy dataset

    Loading the libraries

    library(tidyverse) # to manipule data and make plot
    library(factoextra) # manipulate pca results
    library(palmerpenguins) # we load the data

    loading the data

    We are going to work on the famous Palmer penguins dataset. This dataset is an integrative study of the breeding ecology and population structure of Pygoscelis penguins along the western Antarctic Peninsula. These data were collected from 2007 - 2009 by Dr. Kristen Gorman with the Palmer Station Long Term Ecological Research Program, part of the US Long Term Ecological Research Network.

    The palmerpenguins data contains size measurements for three penguin species observed on three islands in the Palmer Archipelago, Antarctica.

    The palmerpenguins library load the penguins dataset into your R environment. If you are not familiar with tibble, you just have to know that they are equivalent to data.frame.

    penguins

    We have r ncol(penguins) variables for r nrow(penguins) individuals:

    dim(penguins)

    The data is tidy:

    • Each variable has its own column.
    • Each observation has its own row.
    • Each value must have its own cell.