Compatible with:
Powered by:
This repository contains all R
scripts, and example files, permitting to
generate Random Forest
models to predict Translation Independent Degradation
and Translation Dependent Degradation
.
This repository was tested and validated on this minimal configuration:
A
Shiny
powered Rmarckdown file is provided to facilitate the generation ofRandom Forest
models.The final
For all information or problem concerning this repository please contact: david.cluet@ens-lyon.fr
Our complete processed database and additional filter files are available
as a 25MB file (RMI2_tdd_project_databases.zip
) upon request:
emiliano.ricci@ens-lyon.fr
TABLE OF CONTENT
Degradation Indexes
Our reference degradation indexes are computed for the 3h time point as follow using normalized count reads.
Absolute Degradation Fold
This metric corresponds to the total degradation fold at 3h of each transcripts. It is expressed as a ratio to normalize the differences in abundance between the transcripts.
Absolute
Absolute TID index
This metric corresponds to the degradation of the transcripts when the
translation is blocked. So it reveals the importance of the
Translation Independent Degradation
within the total degradation for each
transcript.
Absolute
Absolute TDD index
The proportion of Translation Dependent Degradation
for each transcript can
be extracted combining the informations obtained from
the two previous metric.
Absolute
Installation Procedure
Repository
Clone this repository (in your home folder by default).
cd ~/
git clone git@gitbio.ens-lyon.fr:LBMC/RMI2/tdd_project.git
Dependencies
On Ubuntu 20.04
some libraries has to be installed on your system before
installing R packages
.
To install them execute:
sudo apt update
sudo apt -y upgrade
sudo apt -y install libcurl4-openssl-dev
sudo apt -y install libxml2-dev
sudo apt -y install libssl-dev
sudo add-apt-repository -y ppa:cran/poppler
sudo apt-get update
sudo apt-get install -y libpoppler-cpp-dev
Now the R packages
can be installed with Rstudio
:
install.packages('rmarkdown', dep = TRUE)
install.packages('knitr')
install.packages('tinytex')
library(tinytex)
tinytex::install_tinytex()
install.packages('shiny')
install.packages('ggplot2')
install.packages('tidyverse')
install.packages('randomForest')
install.packages('caret')
install.packages('doParallel')
install.packages('gridExtra')
install.packages('grid')
install.packages('SHAPforxgboost')
Database and filter files
These files are available upon request (david.cluet@ens-lyon.fr or emiliano.ricci@ens-lyon.fr).
- Save the
RMI2_tdd_project_databases.zip
file into thesrc/database
folder - Extract the files and move them directly into
src/database
folder
The final arborescence should be:
- src
- database
- 2023-07-17_Subset_Data_processed_Merge.csv
- filtred_genes_Lympho_Resting.csv
- filtred_genes_Lympho_Activated.csv
- DESeqresults_lympho_activation_untreated.csv
- database
NOTA BENE
The
filtred_genes_Lympho_Resting.csv
andfiltred_genes_Lympho_Activated.csv
files are used to focus the machine learning process on pre-selected transcripts.This selection was based on the gene normalized read counts of the 3h Triptolide libraries (~5000 genes in Resting and ~6000 genes in Activated T-cells). Of these, only genes with completed observations in all biological replicates (including ribosome profiling libraries) and for all transcript features used to build the model, as well as at least 15% of observed degradation at 3h, were kept for further analysis.
The
DESeqresults_lympho_activation_untreated.csv
file is used to add thelog2FoldChange
to the database. This column is used as parameter for thedeltaTDD
anddeltaTID
scores.Has we had a lot of experimental conditions and time. Scores required to "explicitely" indicate how they have been obtained. In order to have the optimal tracking of which experimental data have been used as input and how (we had initially several metrics: relative, absolute, ...). Thus I generated some "complex" column names allowing to precisely use the correct column and can be handled without any modification by
Python
pandas library. For exemple theTDD_{index}column name is initialy:
Abs(TDD)>Lympho_Resting>Trip_CHX>Ref_Trip_0h>3h
Meaning that the
Absolute TDD
score has been computed for (>
) theLympho
in theResting
status using (>
) theTrip_CHX
treatment condition, with theTrip
treatment condition at0h
as referenceRef
, and computed with>
t =3h
.Nevertheless due to restrictions in column names in
R
, some characters like()
and>
. Thus, Emmanuel Labaronne had to change such names to performRandom Forest
computations withR
. This column is now called:
Abs.TDD..Lympho_Resting.Trip_CHX.Ref_Trip_0h.3h
Once the reviewing process will be over, I will change the initial
Python
scripts to take into account the downstreamR
limitations.
Generate a Random Forest model
- Change your working directory in
Rstudio
by typing in the console:
setwd('~/tdd_project')
- Open the
RMI2_Random_Forest.Rmd
- Use the menu option
Knit/Knit with parameters...
- Then a
Shiny
GUI allows to choose theCell type
(only Lymphocyte for now), theActivation
status (Resting or Activated), and themRNA degradation index
. - Click on the
Knit
button to generate the model.
Nota Bene
For some reason after 3 executions the computations start be really slow. I have observed that the swap seems to be saturated by Rstudio (Rhistory?) , even when only 6 out 64GB of RAM is used. A reboot is necessary to reaccelerate the computations...
Flowchart of Random Forest model generation
The Random Forest
model is generated using the xgbTree
method via a fine
tuning of the hyper parameters (nrounds
, max_depth
, and colsample_bytree
).
In order to permit reproducibility the Training and Validation set are generated
after setting the the seed to 1043
.
flowchart TD
subgraph results
model(Cell_type_Activation_Score_model.rda)
training_file(training_set.csv)
validation_file(validation_set.csv)
end
subgraph src
pdf(RMI2_Random_Forest.pdf)
end
subgraph tuned
nrounds[nrounds\n100, 200]
max_depth[max_depth\n3, 5, 10, 15, 20]
colsample_bytree[colsample_bytree\n0.5, 0.6, 0.7, 0.8,0.9]
end
subgraph set
eta[eta\n0.1]
gamma[gamma\n0]
min_child_weight[min_child_weight\n1]
subsample[subsample\n1]
end
subgraph R
tuned
set
load[Load the database]
add_columns[Add columns]
loaded_database[Database]
filtering[Filter validated transcripts]
filtered_database[Filtered\nDatabase]
sampling[Prepare Training and Validation sets]
training_set[Training set]
validation_set[Validation set]
tuning[Fine grid model tuning]
models[Ensemble of models]
apply_best_model[Apply the best model]
figures[Generate the figures]
end
subgraph database
db('2023-07-17_Subset_Data_processed_Merge.csv')
DESeq('DESeqresults_lympho_activation_untreated.csv')
filter_resting('filtred_genes_Lympho_Resting.csv')
filter_activated('filtred_genes_Lympho_Activated.csv')
end
db --> |Input| load
load --> loaded_database
load --> add_columns
DESeq --> |Input| add_columns
add_columns --> |Add| loaded_database
add_columns --> filtering
loaded_database --> |Input| filtering
filter_resting --> |Input| filtering
filter_activated --> |Input| filtering
filtering --> |Output| filtered_database
filtering --> sampling
filtered_database --> |Input| sampling
sampling --> |80%| training_set
sampling --> |20%| validation_set
sampling --> tuning
training_set --> |Input| tuning
tuned --> |Parameters combination| tuning
set --> |Parameters| tuning
tuning --> models
models --> |Input| apply_best_model
training_set --> |Input| apply_best_model
validation_set --> |Input| apply_best_model
apply_best_model --> |Save| training_file
apply_best_model --> |Save| validation_file
apply_best_model --> |Save| model
apply_best_model --> figures
figures --> |Save| pdf
style db fill: #ff6600
style filter_resting fill: #ff6600
style filter_activated fill: #ff6600
style DESeq fill: #ff6600
style model fill: #99ccff
style training_file fill: #99ccff
style validation_file fill: #99ccff
style training_set fill: #69b3a2
style validation_set fill: #404080
style nrounds fill: #ffcc00
style max_depth fill: #ffcc00
style colsample_bytree fill: #ffcc00
style eta fill: #99ffcc
style gamma fill: #99ffcc
style min_child_weight fill: #99ffcc
style subsample fill: #99ffcc
style loaded_database fill: #0099cc
style filtered_database fill: #0099cc
style pdf fill: #ff9999
style models fill: #ffffcc
Nota Bene
The
DESeqresults_lympho_activation_untreated.csv
file is used to add thelog2FoldChange
that is used as parameter for thedeltaTDD
anddeltaTID
scores.The
filtred_genes_Lympho_Resting.csv
andfiltred_genes_Lympho_Activated.csv
files are used to filter the validated transcripts respectively for theResting
andActivated
status.
The Cell_type_Activation_Score_model.rda
file can be used to apply the model
to a new set.