-
Laurent Modolo authoredLaurent Modolo authored
Contributing
When contributing to this repository, please first discuss the change you wish to make via issue, email, or on the ENS-Bioinfo channel before making a change.
Project organisation
The LBMC/nextflow
project is structured as follow:
- all the code is in the
src/
folder - scripts downloading external tools should download them in the
bin/
folder - all the documentation (including this file) can be found int he
doc/
folder - the
data
andresults
folders contain the data and results of your piplines and are ignored bygit
Code structure
The src/
folder is where we want to save the pipline (.nf
) script. This folder also contains:
- the
src/install_nextflow.sh
to install the nextflow executable at the root of the project. - some pipelines examples (like the one build during the nf_pratical)
- the
src/nextflow.config
global configuration file which contains thedocker
,singularity
,psmn
andccin2p3
profiles. - the
src/nf_modules
folder contains per toolsmain.nf
modules with predefined process that users can imports in their projects with the DSL2
But also some hidden folders that users don't need to see when building their pipeline:
- the
src/.docker_modules
contains the recipies for thedocker
containers used in thesrc/nf_modules/<tool_names>/main.nf
files - the
src/.singularity_in2p3
andsrc/.singularity_psmn
are symbolic links to the shared folder where the singularity images are downloaded on the PSMN and CCIN2P3
Proposing a new tool
Each tool named <tool_name>
must have two dedicated folders:
-
src/nf_modules/<tool_name>
where users can find.nf
files to include -
src/.docker_modules/<tool_name>/<version_number>
where we have the.Dockerfile
to construct the container used in themain.nf
file
src/nf_module
guide lines
We are going to take the fastp
, nf_module
as an example.
The src/nf_modules/<tool_name>
should contain a main.nf
file that describe at least one process using <tool_name>
container informations
The first two lines of main.nf
should define two variables
version = "0.20.1"
container_url = "lbmc/fastp:${version}"
we can then use the container_url
definition in each process
in the container
attribute.
In addition to the container
directive, each process
should have one of the following label
attributes (defined in the src/nextflow.config
file)
big_mem_mono_cpus
big_mem_multi_cpus
small_mem_mono_cpus
small_mem_multi_cpus
process fastp {
container = "${container_url}"
label = "big_mem_multi_cpus"
...
}
process options
Before each process, you shoud declare at least two params.
variables:
- A
params.<process_name>
defaulting to""
(empty string) to allow user to add more commmand line option to your process without rewritting the process definition - A
params.<process_name>_out
defaulting to""
(empty string) that define theresults/
subfolder where the process output should be copied if the user want to save the process output
params.fastp = ""
params.fastp_out = ""
process fastp {
container = "${container_url}"
label "big_mem_multi_cpus"
if (params.fastp_out != "") {
publishDir "results/${params.fastp_out}", mode: 'copy'
}
...
script:
"""
fastp --thread ${task.cpus} \
${params.fastp} \
...
"""
}
The user can then change the value of these variables:
- from the command line `--fastp "--trim_head1=10"``
- with the
include
command within their pipeline: `include { fastq } from "nf_modules/fastq/main" addParams(fastq_out: "QC/fastq/") - by defining the variable within their pipeline: `params.fastq_out = "QC/fastq/"
input
and output
format
You should always use tuple
for input and output channel format with at least:
- a
val
containing variable(s) related to the item - a
path
for the file(s) that you want to process
for example:
process fastp {
container = "${container_url}"
label "big_mem_multi_cpus"
tag "$file_id"
if (params.fastp_out != "") {
publishDir "results/${params.fastp_out}", mode: 'copy'
}
input:
tuple val(file_id), path(reads)
output:
tuple val(file_id), path("*.fastq.gz"), emit: fastq
tuple val(file_id), path("*.html"), emit: html
tuple val(file_id), path("*.json"), emit: report
...
Here file_id
can be anything from a simple identifier to a list of several variables.
So you have to keep that in mind if you want to use it to define output file names (you can test for that with file_id instanceof List
).
If you want to use information within the file_id
to name outputs in your script
section, you can use the following snipet:
script:
if (file_id instanceof List){
file_prefix = file_id[0]
} else {
file_prefix = file_id
}
and use the file_prefix
variable.
This also means that channel emitting path
item should be transformed with at least the following map function:
.map { it -> [it.simpleName, it]}
for example:
channel
.fromPath( params.fasta )
.ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
.map { it -> [it.simpleName, it]}
.set { fasta_files }
The rational behind taking a file_id
and emitting the same file_id
is to facilitate complex channel operations in pipelines without having to rewrite the process
blocks.
dealing with paired-end and single-end data
Fastq files opened with channel.fromFilePairs( params.fastq )
create item of the following shape:
[file_id, [read_1_file, read_2_file]]
To make this call more generic, we can use the size: -1
option, and accept arbitrary number of associated fastq file:
channel.fromFilePairs( params.fastq, size: -1 )
will thus give [file_id, [read_1_file, read_2_file]]
for paired-end data and [file_id, [read_1_file]]
for single-end data
...
script:
if (file_id instanceof List){
file_prefix = file_id[0]
} else {
file_prefix = file_id
}
if (reads.size() == 2)
"""
fastp --thread ${task.cpus} \
${params.fastp} \
--in1 ${reads[0]} \
--in2 ${reads[1]} \
--out1 ${file_prefix}_R1_trim.fastq.gz \
--out2 ${file_prefix}_R2_trim.fastq.gz \
--html ${file_prefix}.html \
--json ${file_prefix}_fastp.json \
--report_title ${file_prefix}
"""
else if (reads.size() == 1)
"""
fastp --thread ${task.cpus} \
${params.fastp} \
--in1 ${reads[0]} \
--out1 ${file_prefix}_trim.fastq.gz \
--html ${file_prefix}.html \
--json ${file_prefix}_fastp.json \
--report_title ${file_prefix}
"""
...