Skip to content
Snippets Groups Projects

TDD Project classification README Maintained

Compatible with:

Linux Windows macOS


Powered by:

Rstudio Shiny TinyTex


This repository contains all R scripts, and example files, permitting to generate Random Forest models to predict Translation Independent Degradation and Translation Dependent Degradation.

This repository was tested and validated on this minimal configuration:

Ubuntu CPU RAM SWAP R Rstudio

caret caret grid gridExtra ggplot2 knitr andomForest R Markdown SHAPforxgboost Shiny tidyverse tinytex

A Shiny powered Rmarckdown file is provided to facilitate the generation of Random Forest models.

The final pdf output contains the logs of model tuning as well as the associated figures.

For all information or problem concerning this repository please contact: david.cluet@ens-lyon.fr

Our complete processed database and additional filter files are available as a 25MB file (RMI2_tdd_project_databases.zip) upon request: emiliano.ricci@ens-lyon.fr

TABLE OF CONTENT

Degradation Indexes

Our reference degradation indexes are computed for the 3h time point as follow using normalized count reads.

Absolute Degradation Fold

This metric corresponds to the total degradation fold at 3h of each transcripts. It is expressed as a ratio to normalize the differences in abundance between the transcripts.

Absolute

Degradation Fold = {1 - {counts(t3h, Trip) \over counts(t0,Trip)}}

Absolute TID index

This metric corresponds to the degradation of the transcripts when the translation is blocked. So it reveals the importance of the Translation Independent Degradation within the total degradation for each transcript.

Absolute

TID_{index} = {{counts(t0, Trip) - counts(t3h, TripCHX)} \over counts(t0, Trip)}

Absolute TDD index

The proportion of Translation Dependent Degradation for each transcript can be extracted combining the informations obtained from the two previous metric.

Absolute

TDD_{index} = {{counts(t3h, TripCHX) - counts(t3h,Trip)} \over counts(t0, Trip)}

Installation Procedure

Repository

Clone this repository (in your home folder by default).

cd ~/
git clone git@gitbio.ens-lyon.fr:LBMC/RMI2/tdd_project.git

Dependencies

On Ubuntu 20.04 some libraries has to be installed on your system before installing R packages.

To install them execute:

sudo apt update
sudo apt -y upgrade
sudo apt -y install libcurl4-openssl-dev
sudo apt -y install libxml2-dev
sudo apt -y install libssl-dev
sudo add-apt-repository -y ppa:cran/poppler
sudo apt-get update
sudo apt-get install -y libpoppler-cpp-dev

Now the R packages can be installed with Rstudio:

install.packages('rmarkdown', dep = TRUE)
install.packages('knitr')
install.packages('tinytex')
library(tinytex)
tinytex::install_tinytex()
install.packages('shiny')
install.packages('ggplot2')
install.packages('tidyverse')
install.packages('randomForest')
install.packages('caret')
install.packages('doParallel')
install.packages('gridExtra')
install.packages('grid')
install.packages('SHAPforxgboost')

Database and filter files

These files are available upon request (david.cluet@ens-lyon.fr or emiliano.ricci@ens-lyon.fr).

  • Save the RMI2_tdd_project_databases.zip file into the src/database folder
  • Extract the files and move them directly into src/database folder

The final arborescence should be:

  • src
    • database
      • 2023-07-17_Subset_Data_processed_Merge.csv
      • filtred_genes_Lympho_Resting.csv
      • filtred_genes_Lympho_Activated.csv
      • DESeqresults_lympho_activation_untreated.csv

NOTA BENE

The filtred_genes_Lympho_Resting.csv and filtred_genes_Lympho_Activated.csv files are used to focus the machine learning process on pre-selected transcripts.

This selection was based on the gene normalized read counts of the 3h Triptolide libraries (~5000 genes in Resting and ~6000 genes in Activated T-cells). Of these, only genes with completed observations in all biological replicates (including ribosome profiling libraries) and for all transcript features used to build the model, as well as at least 15% of observed degradation at 3h, were kept for further analysis.

The DESeqresults_lympho_activation_untreated.csv file is used to add the log2FoldChange to the database. This column is used as parameter for the deltaTDD and deltaTID scores.

Has we had a lot of experimental conditions and time. Scores required to "explicitely" indicate how they have been obtained. In order to have the optimal tracking of which experimental data have been used as input and how (we had initially several metrics: relative, absolute, ...). Thus I generated some "complex" column names allowing to precisely use the correct column and can be handled without any modification by Python pandas library. For exemple the

TDD_{index}
column name is initialy:

Abs(TDD)>Lympho_Resting>Trip_CHX>Ref_Trip_0h>3h

Meaning that the Absolute TDD score has been computed for (>) the Lympho in the Resting status using (>) the Trip_CHX treatment condition, with the Trip treatment condition at 0h as reference Ref, and computed with > t = 3h.

Nevertheless due to restrictions in column names in R, some characters like () and >. Thus, Emmanuel Labaronne had to change such names to perform Random Forest computations with R. This column is now called:

Abs.TDD..Lympho_Resting.Trip_CHX.Ref_Trip_0h.3h

Once the reviewing process will be over, I will change the initial Python scripts to take into account the downstream R limitations.

Generate a Random Forest model

  • Change your working directory in Rstudio by typing in the console:
setwd('~/tdd_project')
  • Open the RMI2_Random_Forest.Rmd
  • Use the menu option Knit/Knit with parameters...
  • Then a Shiny GUI allows to choose the Cell type (only Lymphocyte for now), the Activation status (Resting or Activated), and the mRNA degradation index.
  • Click on the Knit button to generate the model.

Nota Bene

For some reason after 3 executions the computations start be really slow. I have observed that the swap seems to be saturated by Rstudio (Rhistory?) , even when only 6 out 64GB of RAM is used. A reboot is necessary to reaccelerate the computations...

Flowchart of Random Forest model generation

The Random Forest model is generated using the xgbTree method via a fine tuning of the hyper parameters (nrounds, max_depth, and colsample_bytree).

In order to permit reproducibility the Training and Validation set are generated after setting the the seed to 1043.

flowchart TD

  subgraph results
    model(Cell_type_Activation_Score_model.rda)
    training_file(training_set.csv)
    validation_file(validation_set.csv)
  end 

  subgraph src
    pdf(RMI2_Random_Forest.pdf)
  end 

  subgraph tuned
    nrounds[nrounds\n100, 200]
    max_depth[max_depth\n3, 5, 10, 15, 20]
    colsample_bytree[colsample_bytree\n0.5, 0.6, 0.7, 0.8,0.9]
  end

  subgraph set
    eta[eta\n0.1]
    gamma[gamma\n0]
    min_child_weight[min_child_weight\n1]
    subsample[subsample\n1]
  end

  subgraph R
    tuned
    set
    load[Load the database]
    add_columns[Add columns]
    loaded_database[Database]
    filtering[Filter validated transcripts]
    filtered_database[Filtered\nDatabase]
    sampling[Prepare Training and Validation sets]
    training_set[Training set]
    validation_set[Validation set]
    tuning[Fine grid model tuning]
    models[Ensemble of models]
    apply_best_model[Apply the best model]
    figures[Generate the figures]
  end

  subgraph database
    db('2023-07-17_Subset_Data_processed_Merge.csv')
    DESeq('DESeqresults_lympho_activation_untreated.csv')
    filter_resting('filtred_genes_Lympho_Resting.csv')
    filter_activated('filtred_genes_Lympho_Activated.csv')  
  end 

  db --> |Input| load
  load --> loaded_database
  load --> add_columns
  DESeq --> |Input| add_columns
  add_columns --> |Add| loaded_database
  add_columns --> filtering
  loaded_database --> |Input| filtering
  filter_resting --> |Input| filtering
  filter_activated --> |Input| filtering
  filtering --> |Output| filtered_database
  filtering --> sampling
  filtered_database --> |Input| sampling
  sampling --> |80%| training_set
  sampling --> |20%| validation_set
  sampling --> tuning
  training_set --> |Input| tuning
  tuned --> |Parameters combination| tuning
  set --> |Parameters| tuning
  tuning --> models
  models --> |Input| apply_best_model
  training_set --> |Input| apply_best_model
  validation_set --> |Input| apply_best_model
  apply_best_model --> |Save| training_file
  apply_best_model --> |Save| validation_file
  apply_best_model --> |Save| model
  apply_best_model --> figures
  figures --> |Save| pdf


  style db fill: #ff6600
  style filter_resting fill: #ff6600
  style filter_activated fill: #ff6600
  style DESeq fill: #ff6600

  style model fill: #99ccff
  style training_file fill: #99ccff
  style validation_file fill: #99ccff

  style training_set fill: #69b3a2
  style validation_set fill: #404080

  style nrounds fill: #ffcc00
  style max_depth fill: #ffcc00
  style colsample_bytree fill: #ffcc00

  style eta fill: #99ffcc
  style gamma fill: #99ffcc
  style min_child_weight fill: #99ffcc
  style subsample fill: #99ffcc

  style loaded_database fill: #0099cc
  style filtered_database fill: #0099cc
  
  style pdf fill: #ff9999

  style models fill: #ffffcc

Nota Bene

The DESeqresults_lympho_activation_untreated.csv file is used to add the log2FoldChange that is used as parameter for the deltaTDD and deltaTID scores.

The filtred_genes_Lympho_Resting.csv and filtred_genes_Lympho_Activated.csv files are used to filter the validated transcripts respectively for the Resting and Activated status.

The Cell_type_Activation_Score_model.rda file can be used to apply the model to a new set.