CONTRIBUTING.md 10.9 KB
Newer Older
1
2
# Contributing

Laurent Modolo's avatar
Laurent Modolo committed
3
When contributing to this repository, please first discuss the change you wish to make via issues,
4
5
email, or on the [ENS-Bioinfo channel](https://matrix.to/#/#ens-bioinfo:matrix.org) before making a change. 

6
7
8
9
## Forking

In git, the [action of forking](https://git-scm.com/book/en/v2/GitHub-Contributing-to-a-Project) means that you are going to make your own private copy of a repository. You can then write modifications in your project, and if they are of interest for the source repository create a merge request (here [LBMC/nextflow](https://gitbio.ens-lyon.fr/LBMC/nextflow)). Merge requests are sent to the source repository to ask the maintainers to integrate modifications.

Laurent Modolo's avatar
Laurent Modolo committed
10
![merge request button](./doc/img/merge_request.png)
11

Laurent Modolo's avatar
Laurent Modolo committed
12
## Project organization
13

Laurent Modolo's avatar
Laurent Modolo committed
14
The `LBMC/nextflow` project is structured as follows:
15
16
17
- all the code is in the `src/` folder
- scripts downloading external tools should download them in the `bin/` folder
- all the documentation (including this file) can be found int he `doc/` folder
Laurent Modolo's avatar
Laurent Modolo committed
18
- the `data` and `results` folders contain the data and results of your pipelines and are ignored by `git`
19
20
21

## Code structure

22
The `src/` folder is where we want to save the pipeline (`.nf`) scripts. This folder also contains
23
24
25
- the `src/install_nextflow.sh` to install the nextflow executable at the root of the project.
- some pipelines examples (like the one build during the nf_pratical)
- the `src/nextflow.config` global configuration file which contains the `docker`, `singularity`, `psmn` and `ccin2p3` profiles.
Laurent Modolo's avatar
Laurent Modolo committed
26
- the `src/nf_modules` folder contains per tools `main.nf` modules with predefined process that users can import in their projects with the [DSL2](https://www.nextflow.io/docs/latest/dsl2.html)
27
28

But also some hidden folders that users don't need to see when building their pipeline:
Laurent Modolo's avatar
Laurent Modolo committed
29
- the `src/.docker_modules` contains the recipes for the `docker` containers used in the `src/nf_modules/<tool_names>/main.nf` files
30
31
32
33
34
- the `src/.singularity_in2p3` and `src/.singularity_psmn` are symbolic links to the shared folder where the singularity images are downloaded on the PSMN and CCIN2P3 

# Proposing a new tool

Each tool named `<tool_name>` must have two dedicated folders:
35

36
37
- [`src/nf_modules/<tool_name>`](./src/nf_modules/fastp/) where users can find `.nf` files to include
- [`src/.docker_modules/<tool_name>/<version_number>`](./src/.docker_modules/fastp/0.20.1/) where we have the [`Dockerfile`](./src/.docker_modules/fastp/0.20.1/Dockerfile) to construct the container used in the `main.nf` file
38

39
## `src/nf_module` guide lines
40

41
We are going to take the [`fastp`, `nf_module`](./src/nf_modules/fastp/) as an example.
42

43
The [`src/nf_modules/<tool_name>`](./src/nf_modules/fastp/) should contain a [`main.nf`](./src/nf_modules/fastp/main.nf) file that describe at least one process using `<tool_name>`
44

45
### container informations
46

47
The first two lines of [`main.nf`](./src/nf_modules/fastp/main.nf) should define two variables
48
```Groovy
49
50
51
version = "0.20.1"
container_url = "lbmc/fastp:${version}"
```
52

53
54
55
56
57
58
we can then use the `container_url` definition in each `process` in the `container` attribute.
In addition to the `container` directive, each `process` should have one of the following `label` attributes (defined in the `src/nextflow.config` file)
- `big_mem_mono_cpus`
- `big_mem_multi_cpus`
- `small_mem_mono_cpus`
- `small_mem_multi_cpus`
59

60
```Groovy
61
62
63
64
65
66
process fastp {
  container = "${container_url}"
  label = "big_mem_multi_cpus"
  ...
}
```
67

68
### process options
69

Laurent Modolo's avatar
Laurent Modolo committed
70
71
72
Before each process, you should declare at least two `params.` variables:
- A `params.<process_name>` defaulting to `""` (empty string) to allow user to add more command line option to your process without rewriting the process definition
- A `params.<process_name>_out` defaulting to `""` (empty string) that define the `results/` subfolder where the process output should be copied if the user wants to save the process output
73

74
```Groovy
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
params.fastp = ""
params.fastp_out = ""
process fastp {
  container = "${container_url}"
  label "big_mem_multi_cpus"
  if (params.fastp_out != "") {
    publishDir "results/${params.fastp_out}", mode: 'copy'
  }
  ...
  script:
"""
fastp --thread ${task.cpus} \
${params.fastp} \
...
"""
}
```
92

93
94
95
96
The user can then change the value of these variables:
- from the command line `--fastp "--trim_head1=10"``
- with the `include` command within their pipeline: `include { fastq } from "nf_modules/fastq/main" addParams(fastq_out: "QC/fastq/")
- by defining the variable within their pipeline: `params.fastq_out = "QC/fastq/"
97

98
### `input` and `output` format
99

100
101
102
You should always use `tuple` for input and output channel format with at least:
- a `val` containing variable(s) related to the item
- a `path` for the file(s) that you want to process
103

104
for example:
105
106

```Groovy
107
108
109
110
111
112
113
process fastp {
  container = "${container_url}"
  label "big_mem_multi_cpus"
  tag "$file_id"
  if (params.fastp_out != "") {
    publishDir "results/${params.fastp_out}", mode: 'copy'
  }
114

115
116
  input:
  tuple val(file_id), path(reads)
117

118
119
120
121
122
123
  output:
    tuple val(file_id), path("*.fastq.gz"), emit: fastq
    tuple val(file_id), path("*.html"), emit: html
    tuple val(file_id), path("*.json"), emit: report
...
```
124

125
Here `file_id` can be anything from a simple identifier to a list of several variables.
126
In which case the first item of the List should be usable as a file prefix.
127
So you have to keep that in mind if you want to use it to define output file names (you can test for that with `file_id instanceof List`).
128
In some case, the `file_id` may be a Map to have a cleaner access to the `file_id` content by explicit keywords.
129

130
131
If you want to use information within the `file_id` to name outputs in your `script` section, you can use the following snipet:

132
```Groovy
133
  script:
134
135
136
137
138
139
140
141
    switch(file_id) {
    case {it instanceof List}:
      file_prefix = file_id[0]
    break
    case {it instanceof Map}:
      file_prefix = file_id.values()[0]
    break
    default:
142
      file_prefix = file_id
143
    break
144
145
  }
```
146

147
148
and use the `file_prefix` variable.

149
150
This also means that channel emitting `path` item should be transformed with at least the following map function:

151
```Groovy
152
.map { it -> [it.simpleName, it]}
153
```
154

Laurent Modolo's avatar
Laurent Modolo committed
155
for example
156

157
```Groovy
158
159
160
161
162
163
164
165
channel
  .fromPath( params.fasta )
  .ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
  .map { it -> [it.simpleName, it]}
  .set { fasta_files }
```


Laurent Modolo's avatar
Laurent Modolo committed
166
The rationale behind taking a `file_id` and emitting the same `file_id` is to facilitate complex channel operations in pipelines without having to rewrite the `process` blocks.
167

168
### dealing with paired-end and single-end data
169

170
When oppening fastq files with `channel.fromFilePairs( params.fastq )`, item in the channel have the following shape:
171

172
```Groovy
173
174
[file_id, [read_1_file, read_2_file]]
```
175

176
To make this call more generic, we can use the `size: -1` option, and accept arbitrary number of associated fastq files:
177

178
```Groovy
179
180
181
182
channel.fromFilePairs( params.fastq, size: -1 )
```

will thus give `[file_id, [read_1_file, read_2_file]]` for paired-end data and `[file_id, [read_1_file]]` for single-end data
183

184
185
186

You can the use tests on `read.size()` to define conditional `script` block:

187
```Groovy
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
...
  script:
  if (file_id instanceof List){
    file_prefix = file_id[0]
  } else {
    file_prefix = file_id
  }
  if (reads.size() == 2)
  """
  fastp --thread ${task.cpus} \
    ${params.fastp} \
    --in1 ${reads[0]} \
    --in2 ${reads[1]} \
    --out1 ${file_prefix}_R1_trim.fastq.gz \
    --out2 ${file_prefix}_R2_trim.fastq.gz \
    --html ${file_prefix}.html \
    --json ${file_prefix}_fastp.json \
    --report_title ${file_prefix}
  """
207
  else
208
209
210
211
212
213
214
215
216
217
218
  """
  fastp --thread ${task.cpus} \
    ${params.fastp} \
    --in1 ${reads[0]} \
    --out1 ${file_prefix}_trim.fastq.gz \
    --html ${file_prefix}.html \
    --json ${file_prefix}_fastp.json \
    --report_title ${file_prefix}
  """
...
```
219

220
221
### Complex processes

Laurent Modolo's avatar
Laurent Modolo committed
222
Sometime you want to do write complex processes, for example for `fastp` we want to have predefine `fastp` process for different protocols, order of adapter trimming and reads clipping.
223
224
We can then use the fact that `process` or named `workflow` can be interchangeably imported with th [DSL2](https://www.nextflow.io/docs/latest/dsl2.html#workflow-composition).

Laurent Modolo's avatar
Laurent Modolo committed
225
With the following example, the user can simply include the `fastp` step without knowing that it's a named `workflow` instead of a `process`.
226
227
228
By specifying the `params.fastp_protocol`, the `fastp` step will transparently switch betwen the different `fastp` `process`es.
Here `fastp_default` or `fastp_accel_1splus`, and other protocols can be added later, pipeline will be able to handle these new protocols by simply updating from the `upstream` repository without changing their codes.

229
```Groovy
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
params.fastp_protocol = ""
workflow fastp {
  take:
    fastq

  main:
  switch(params.fastp_protocol) {
    case "accel_1splus":
      fastp_accel_1splus(fastq)
      fastp_accel_1splus.out.fastq.set{res_fastq}
      fastp_accel_1splus.out.report.set{res_report}
    break;
    default:
      fastp_default(fastq)
      fastp_default.out.fastq.set{res_fastq}
      fastp_default.out.report.set{res_report}
    break;
  }
  emit:
    fastq = res_fastq
    report = res_report
}
```

254
## `src/.docker_modules` guide lines
255

256
257
258
259
260
261
262
We are going to take the [`fastp`, `.docker_modules`](./src/.docker_module/fastp/0.20.1/) as an example.

The [`src/.docker_modules/<tool_name>/<version_number>`](./src/nf_modules/fastp/0.20.1/) should contain a [`Dockerfile`](./src/.docker_module/fastp/0.20.1/Dockerfile) and a [`docker_init.sh`](./src/.docker_module/fastp/0.20.1/docker_init.sh).

### `Dockerfile`

The [`Dockerfile`](./src/.docker_module/fastp/0.20.1/Dockerfile) shoud contains a `docker` recipe to build a image with `<tool_name>` installed in a system-wide binary folder (`/bin`, `/usr/local/bin/`, etc).
263
Therefore, your scripts are easily accessible from within the container.
264
265
266
267
268
269
270
271
272
273
274

This recipe should have:

- an easily changeable `<version_number>` to be able to update the corresponding image to a newer version of the tool
- the `ps` executable (package `procps` in debian)
- a default `bash` command (`CMD ["bash"]`)

### `docker_init.sh`

The [`docker_init.sh`](./src/.docker_module/fastp/0.20.1/docker_init.sh) script is a small sh script with the following content:

275
```sh
276
277
278
279
280
281
282
283
284
285
286
287
#!/bin/sh
docker pull lbmc/fastp:0.20.1
docker build src/.docker_modules/fastp/0.20.1 -t 'lbmc/fastp:0.20.1'
docker push lbmc/fastp:0.20.1
```

We want to be able to execute the `src/.docker_module/fastp/0.20.1/docker_init.sh` from the root of the project to :

- try to download the corresponding container if it exists on the [Docker Hub](https://hub.docker.com/repository/docker/lbmc/)
- if not build the container from the correspondig [`Dockerfile`](./src/.docker_module/fastp/0.20.1/Dockerfile) and with the same name as the name we would get from the `docker pull` command
- push the container on the [Docker Hub](https://hub.docker.com/repository/docker/lbmc/) (only [laurent.modolo@ens-lyon.fr](mailto:laurent.modolo@ens-lyon.fr) can do this step for the group **lbmc**)