Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
---
title: "RNAseq analysis"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{RNAseq analysis}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
```{r setup}
library(HTRfit)
```
In RNAseq, we employ Generalized Linear Models (GLM) to unravel how genes respond to various experimental conditions. These models assist in deciphering the specific impacts of experimental variables on gene expression.HTRfit can be utilized to analyze such RNAseq data, providing a robust framework for exploring and interpreting the intricate relationships between genes and experimental conditions.
## Input data
HTRfit analysis necessitates a count matrix and sample metadata, in the form of dataframes.
Notice that gene_id have to be specified as rownames of `count_matrix`.
In this example we use **HTRfit** to analyse a common RNAseq data with 2 genotypes and 2 environments.
```{r create_data , include = FALSE}
## -- hided in vignette
## -- simulate small example to prevent excessively lengthy vignette construction
list_var <- init_variable( name = "genotype", sd = 0.2, level = 2) %>%
init_variable( name = "environment", sd = 0.43, level = 2) %>%
add_interaction( between_var = c("genotype", "environment"), sd = 0.2)
N_GENES = 100
MIN_REPLICATES = 4
MAX_REPLICATES = 4
BASAL_EXPR = 3
mock_data <- mock_rnaseq(list_var, N_GENES,
min_replicates = MIN_REPLICATES,
max_replicates = MAX_REPLICATES,
basal_expression = BASAL_EXPR)
########################
## -- data from simulation or real data
count_matrix <- mock_data$counts
metaData <- mock_data$metadata
##############################
```
```{r display_input }
## -- gene count matrix
count_matrix[1:4, 1:2]
## -- samples metadata
head(metaData)
```
## Prepare data for fitting
The `prepareData2fit()` function serves the purpose of converting the counts matrix and sample metadata into a dataframe that is compatible with downstream **HTRfit** functions designed for model fitting. This function also includes an option to perform median ratio normalization and transformation on the data counts.
```{r example-prepareData, warning = FALSE, message = FALSE}
## -- convert counts matrix and samples metadatas in a data frame for fitting
data2fit = prepareData2fit(
countMatrix = count_matrix,
metadata = metaData,
normalization = NULL,
response_name = "kij")
## -- median ratio normalization
data2fit = prepareData2fit(
countMatrix = count_matrix,
metadata = metaData,
normalization = 'MRN',
response_name = "kij")
## -- apply + 1 transformation on each counts
data2fit = prepareData2fit(
countMatrix = count_matrix,
metadata = metaData,
normalization = 'MRN',
transform = "x+1",
response_name = "kij")
```
## Fit model from your data
The `fitModelParallel()` function enables independent model fitting for each gene. The number of threads used for this process can be controlled by the `n.cores` parameter.
```{r example-fitModelParallel, warning = FALSE, message = FALSE}
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
```
## Use mixed effect in your model
HTRfit uses the **glmmTMB** functions for model fitting algorithms. This choice allows for the utilization of random effects within your formula design. For further details on how to specify your model, please refer to the [mixed model documentation](https://rdrr.io/cran/glmmTMB/man/glmmTMBControl.html).
```{r example-fitModelParallel_mixed, warning = FALSE, message = FALSE}
l_tmb <- fitModelParallel(
formula = kij ~ genotype + ( 1 | environment ),
data = data2fit,
group_by = "geneID",
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
```
## Additional settings
The function provides precise control over model settings for fitting optimization, including options for specifying the [model family](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/family) and [model control setting](https://rdrr.io/cran/glmmTMB/man/glmmTMBControl.html). By default, a Gaussian family model is fitted, but for RNA-seq data, it is highly recommended to specify `family = glmmTMB::nbinom2(link = "log")`.
```{r example-fitModelParallel_addSet, warning = FALSE, message = FALSE}
l_tmb <- fitModelParallel(
formula = kij ~ genotype + environment + genotype:environment,
data = data2fit,
group_by = "geneID",
n.cores = 1,
family = glmmTMB::nbinom2(link = "log"),
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e5,
eval.max=1e5)))
```
## Extracts a tidy result table from a list tmb object
The tidy_results function extracts a data frame containing estimates of ln(fold changes), standard errors, test statistics, p-values, and adjusted p-values for fixed effects. Additionally, it provides access to correlation terms and standard deviations for random effects, offering a detailed view of HTRfit modeling results.
```{r example-tidyRes, warning = FALSE, message = FALSE}
## -- get tidy results
my_tidy_res <- tidy_results(l_tmb, coeff_threshold = 0.1,
alternative_hypothesis = "greaterAbs")
## -- head
my_tidy_res[1:3,]
```
## Update fit
The `updateParallel()` function updates and re-fits a model for each gene. It offers options similar to those in `fitModelParallel()`. In addition, it is possible to modify the reference level of the categorical variable used in your model in order to use different contrast.
```{r example-update, warning = FALSE, message = FALSE}
## -- update your fit modifying the model family
l_tmb <- updateParallel(
formula = kij ~ genotype + environment + genotype:environment,
list_tmb = l_tmb ,
family = gaussian(),
n.cores = 1)
## -- update fit using additional model control settings
l_tmb <- updateParallel(
formula = kij ~ genotype + environment + genotype:environment ,
list_tmb = l_tmb ,
family = gaussian(),
n.cores = 1,
control = glmmTMB::glmmTMBControl(optCtrl=list(iter.max=1e3,
eval.max=1e3)))
## -- update your model formula and your family model
l_tmb <- updateParallel(
formula = kij ~ genotype + environment ,
list_tmb = l_tmb ,
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1)
## -- modif reference levels
## -- genotype reference = "genotype2"
## -- environment reference = "environment2"
l_tmb <- updateParallel(
formula = kij ~ genotype + environment ,
list_tmb = l_tmb ,
family = glmmTMB::nbinom2(link = "log"),
n.cores = 1,
reference_labels = c("genotype2", "environment2"))
```
#### Struture of list tmb object
```{r example-str_obj_l_tmb, warning = FALSE, message = FALSE}
str(l_tmb$gene1, max.level = 1)
```
## Plot fit metrics
Visualizing fit metrics is essential for evaluating your models. Here, we show you how to generate various plots to assess the quality of your models. You can explore all metrics or focus on specific aspects like dispersion and log-likelihood.
```{r example-plotMetrics, warning = FALSE, message = FALSE, fig.align = 'center', fig.height = 4, fig.width = 8}
## -- plot all metrics
diagnostic_plot(list_tmb = l_tmb)
```
```{r example-plotMetricsFocus, warning = FALSE, message = FALSE, fig.align = 'center', fig.height = 3, fig.width = 8}
## -- Focus on metrics
diagnostic_plot(list_tmb = l_tmb, focus = c("AIC"))
```
## Select best fit based on diagnostic metrics
The identifyTopFit function allows for selecting genes that best fit models, based on various metrics such as AIC, BIC, LogLik, deviance, dispersion . This function provides the flexibility to employ multiple filtering methods to identify genes most relevant for further analysis. We implemented a filtering method based on the Median Absolute Deviation (MAD). For example, using the MAD method with a tolerance threshold of 3, the function identifies genes whose metric values are greater ("top") or lower ("low") than `median(metric) - 3 * MAD(metric)`. The keep option allows choosing whether to retain genes with metric values above or below the MAD threshold.
```{r example-identifytopfit, warning = FALSE, message = TRUE}
identifyTopFit(list_tmb = l_tmb, metric = "AIC",
filter_method = "mad", keep = "top",
sort = TRUE, decreasing = TRUE,
mad_tolerance = 3)
```
## Anova to select the best model
Utilizing the `anovaParallel()` function enables you to perform model selection by assessing the significance of the fixed effects. You can also include additional parameters like type. For more details, refer to [car::Anova](https://rdrr.io/cran/car/man/Anova.html).
```{r example-anova, warning = FALSE, message = FALSE}
## -- update your fit modifying the model family
l_anova <- anovaParallel(list_tmb = l_tmb)
## -- additional settings
l_anova <- anovaParallel(list_tmb = l_tmb, type = "III" )
```