Skip to content
Snippets Groups Projects
Unverified Commit 2a7ac06c authored by Laurent Modolo's avatar Laurent Modolo
Browse files

first commit

parents
No related branches found
No related tags found
No related merge requests found
README.html: README.md
pandoc -s --highlight-style pygments --mathjax README.md -o README.html
README.md 0 → 100644
---
title: Nextflow DSL2
---
# Nextflow DSL2
![cc-by-sa](https://licensebuttons.net/l/by-sa/4.0/88x31.png)
First, the [Wikipedia definition of DSL:](https://en.wikipedia.org/wiki/Domain-specific_language)
> A **domain-specific language** (**DSL**) is a [computer language](https://en.wikipedia.org/wiki/Computer_language) specialized to a particular application [domain](https://en.wikipedia.org/wiki/Domain_(software_engineering)).
The DSL2 of nextflow was [announced, the 24/07/2020](https://www.nextflow.io/blog/2020/dsl2-is-here.html) and is now well documented. It's defined as:
> a major evolution of the Nextflow language and makes it possible to scale and modularise your data analysis pipeline while continuing to use the Dataflow programming paradigm that characterises the Nextflow processing model.
This means that we can now split our pipeline between different files, instead of having one huge unreadable file.
## Enabling DSL2
The DSL2 is supported by every version of nextflow `>= 20.**.**`, you can update your version of nextflow with the following command:
```bash
nextflow self-update
```
The DSL2 is not enabled by default, for now, you need to add the following line into your main `.nf` script:
```groovy
nextflow.enable.dsl=2
```
## Nextflow modules
Nextflow module are merely generic `process` definition without the `input` `from`nor `output` `into` channel names specified.
### `samtool sort` process definition
```groovy
Channel
.fromPath( params.bam )
.map { it -> [it.simpleName, it]}
.set { bam_files }
process sort_bam {
tag "$file_id"
input:
set file_id, file(bam) from bam_files
output:
set file_id, "*_sorted.bam" into sorted_bam_files
script:
"""
samtools sort -@ ${task.cpus} -O BAM -o ${file_id}_sorted.bam ${bam}
"""
}
```
### `samtool sort` module definition
```groovy
process sort_bam {
tag "$file_id"
input:
tuple val(file_id), path(bam)
output:
tuple val(file_id), path("*.bam*")
script:
"""
samtools sort -@ ${task.cpus} -O BAM -o ${bam.simpleName}_sorted.bam ${bam}
"""
}
```
We save this module definition in `src/nf_modules/samtools/main.nf`
You can now include your module with the following code:
```groovy
include { sort_bam } from './nf_module/samtools/main.nf'
```
Mind the `./` at the start of the path.
## Workflow
With **modules** you don't have the channel information to chain one process after another. Nextflow DSL2 introduces the **workflow**.
A **workflow** is a new block. With a **workflow** you can write [the RNA quantification pipeline from the nextflow practical for experimental biologists](./solution_RNASeq.nf) as the following:
```groovy
log.info "fastq files : ${params.fastq}"
log.info "fasta file : ${params.fasta}"
log.info "bed file : ${params.bed}"
channel // same as Channel
.fromPath( params.fasta )
.ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
.set { fasta_files }
channel
.fromPath( params.bed )
.ifEmpty { error "Cannot find any bed files matching: ${params.bed}" }
.set { bed_files }
channel
.fromFilePairs( params.fastq )
.ifEmpty { error "Cannot find any fastq files matching: ${params.fastq}" }
.set { fastq_files }
include { adaptor_removal_pairedend } from './nf_modules/cutadapt/main'
include { trimming_pairedend } from './nf_modules/urqt/main'
include { fasta_from_bed } from './nf_modules/bedtools/main'
include { index_fasta; mapping_fastq_pairedend } from './nf_modules/kallisto/main'
workflow {
adaptor_removal_pairedend(fastq_files)
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
```
### Modules outputs
By default module outputs are accessible by `module_name.out` if you have different output `module_name.out` will be a list.
You can also have named output with the `emit` definition. For example, the RNA quantification pipeline, the `adaptor_removal_pairedend` module is defined as follows:
```groovy
process adaptor_removal_pairedend {
tag "$pair_id"
publishDir "results/fastq/adaptor_removal/", mode: 'copy'
input:
tuple val(pair_id), path(reads)
output:
tuple val(pair_id), path("*_cut_R{1,2}.fastq.gz"), emit: fastq
path "*_report.txt", emit: report
script:
"""
cutadapt -a ${adapter_3_prim} -g ${adapter_5_prim} -A ${adapter_3_prim} -G ${adapter_5_prim} \
-o ${pair_id}_cut_R1.fastq.gz -p ${pair_id}_cut_R2.fastq.gz \
${reads[0]} ${reads[1]} > ${pair_id}_report.txt
"""
}
```
Here, the `adaptor_removal_pairedend` emit two named item: `fastq` and `report`
### Modules variable scope
In the `src/nf_modules/cutadapt/main.nf` we have the following variable definition:
```groovy
adapter_3_prim = "AGATCGGAAGAG"
adapter_5_prim = "CTCTTCCGATCT"
trim_quality = "20"
```
Which are used in the `adaptor_removal_pairedend` modules. When the module is included, those variables are initialized. However, we can overwrite their value by redefining them in the **workflow** file.
```groovy
include { adaptor_removal_pairedend } from './nf_modules/cutadapt/main'
include { trimming_pairedend } from './nf_modules/urqt/main'
include { fasta_from_bed } from './nf_modules/bedtools/main'
include { index_fasta; mapping_fastq_pairedend } from './nf_modules/kallisto/main'
adapter_3_prim = "other_adaptor"
workflow {
adaptor_removal_pairedend(fastq_files)
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
```
### Implicit channel forking
With the DSL2 the operator `into` is no longer defined, because channels are duplicated automatically !
We can easily add FastQC steps to our pipline
```groovy
include { fastqc_fastq_pairedend } from './nf_modules/fastqc/main'
workflow {
adaptor_removal_pairedend(fastq_files)
fastqc_fastq_pairedend(fastq_files) // don't cause an error !
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
```
If channels are implicitly forked, it's not the case for the **modules**. We can use `as` in the `include` command to rename **modules** and use the same **module** at different points of the workflow :
```groovy
include {
fastqc_fastq_pairedend as fastqc_raw; // mind the ";" !
fastqc_fastq_pairedend as fastqc_clipped;
fastqc_fastq_pairedend as fastqc_trimmed;
} from './nf_modules/fastqc/main'
workflow {
fastqc_raw(fastq_files)
adaptor_removal_pairedend(fastq_files)
fastqc_clipped(adaptor_removal_pairedend.out.fastq)
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fastqc_trimmed(trimming_pairedend.out.fastq)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
```
## Sub-workflow
Sub-workflow can be seen as **workflow** declared as module **module**. Sub-**workflows** are **workflow** that `take` inputs and `emit` output. We can split our RNASeq quantification pipeline the following way.
```groovy
workflow read_processing {
take:
fastq_files
main:
fastqc_raw(fastq_files)
adaptor_removal_pairedend(fastq_files)
fastqc_clipped(adaptor_removal_pairedend.out.fastq)
trimming_pairedend(adaptor_removal_pairedend.out.fastq)
fastqc_trimmed(trimming_pairedend.out.fastq)
emit:
fastq = trimming_pairedend.out.fastq
report = fastqc_raw.out.report
.mix(fastqc_clipped.out.report)
.mix(fastqc_trimmed.out.report)
}
workflow {
read_processing(fastq_files)
fasta_from_bed(fasta_files, bed_files)
index_fasta(fasta_from_bed.out.fasta)
mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
}
```
Nested workflow execution determines an implicit scope. Therefore the same process can be invoked in two different workflow scopes.
## DSL2 migration notes
- Process inputs or outputs of type `set` have to be replaced with [tuple](https://www.nextflow.io/docs/latest/process.html#process-input-tuple).
- Process output option `mode flatten` is not available any more.
- Use `path` instead of `file` (can interpret string as path)
- The use of unqualified value and file elements into input tuples is not allowed anymore
```groovy
input:
tuple X, 'some-file.bam'
```
```groovy
input:
tuple val(X), path('some-file.bam')
```
- Operator [bind](https://www.nextflow.io/docs/latest/channel.html#channel-bind1) has been deprecated by DSL2 syntax
- Operator [operator <<](https://www.nextflow.io/docs/latest/channel.html#channel-bind2) has been deprecated by DSL2 syntax.
- Operator [choice](https://www.nextflow.io/docs/latest/operator.html#operator-choice) has been deprecated by DSL2 syntax. Use [branch](https://www.nextflow.io/docs/latest/operator.html#operator-branch) instead.
- Operator [close](https://www.nextflow.io/docs/latest/operator.html#operator-close) has been deprecated by DSL2 syntax.
- Operator [create](https://www.nextflow.io/docs/latest/channel.html#channel-create) has been deprecated by DSL2 syntax.
- Operator `countBy` has been deprecated by DSL2 syntax.
- Operator [into](https://www.nextflow.io/docs/latest/operator.html#operator-into) has been deprecated by DSL2 syntax since it’s not needed anymore.
- Operator `fork` has been renamed to [multiMap](https://www.nextflow.io/docs/latest/operator.html#operator-multimap).
- Operator `groupBy` has been deprecated by DSL2 syntax. Replace it with [groupTuple](https://www.nextflow.io/docs/latest/operator.html#operator-grouptuple)
- Operator `print` and `println` have been deprecated by DSL2 syntax. Use [view](https://www.nextflow.io/docs/latest/operator.html#operator-view) instead.
- Operator [merge](https://www.nextflow.io/docs/latest/operator.html#operator-merge) has been deprecated by DSL2 syntax. Use [join](https://www.nextflow.io/docs/latest/operator.html#operator-join) instead.
- Operator [separate](https://www.nextflow.io/docs/latest/operator.html#operator-separate) has been deprecated by DSL2 syntax.
- Operator [spread](https://www.nextflow.io/docs/latest/operator.html#operator-spread) has been deprecated with DSL2 syntax. Replace it with [combine](https://www.nextflow.io/docs/latest/operator.html#operator-combine).
- Operator `route` has been deprecated by DSL2 syntax.
To see all the changes you can read the [DSL2 section of the documentation](https://www.nextflow.io/docs/latest/dsl2.html#) and re-read the [full nextflow documentation...](https://www.nextflow.io/docs/latest/index.html)
img/cc_by_sa.png

1.48 KiB

log.info "fastq files : ${params.fastq}"
log.info "fasta file : ${params.fasta}"
log.info "bed file : ${params.bed}"
Channel
.fromPath( params.fasta )
.ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
.set { fasta_files }
Channel
.fromPath( params.bed )
.ifEmpty { error "Cannot find any bed files matching: ${params.bed}" }
.set { bed_files }
Channel
.fromFilePairs( params.fastq )
.ifEmpty { error "Cannot find any fastq files matching: ${params.fastq}" }
.set { fastq_files }
process adaptor_removal {
tag "$pair_id"
publishDir "results/fastq/adaptor_removal/", mode: 'copy'
input:
set pair_id, file(reads) from fastq_files
output:
set pair_id, "*_cut_R{1,2}.fastq.gz" into fastq_files_cut
script:
"""
cutadapt -a AGATCGGAAGAG -g CTCTTCCGATCT -A AGATCGGAAGAG -G CTCTTCCGATCT \
-o ${pair_id}_cut_R1.fastq.gz -p ${pair_id}_cut_R2.fastq.gz \
${reads[0]} ${reads[1]} > ${pair_id}_report.txt
"""
}
process trimming {
tag "${reads}"
publishDir "results/fastq/trimming/", mode: 'copy'
input:
set pair_id, file(reads) from fastq_files_cut
output:
set pair_id, "*_trim_R{1,2}.fastq.gz" into fastq_files_trim
script:
"""
UrQt --t 20 --m ${task.cpus} --gz \
--in ${reads[0]} --inpair ${reads[1]} \
--out ${pair_id}_trim_R1.fastq.gz --outpair ${pair_id}_trim_R2.fastq.gz \
> ${pair_id}_trimming_report.txt
"""
}
process fasta_from_bed {
tag "${bed.baseName}"
publishDir "results/fasta/", mode: 'copy'
input:
file fasta from fasta_files
file bed from bed_files
output:
file "*_extracted.fasta" into fasta_files_extracted
script:
"""
bedtools getfasta -name \
-fi ${fasta} -bed ${bed} -fo ${bed.baseName}_extracted.fasta
"""
}
process index_fasta {
tag "$fasta.baseName"
publishDir "results/mapping/index/", mode: 'copy'
input:
file fasta from fasta_files_extracted
output:
file "*.index*" into index_files
file "*_kallisto_report.txt" into index_files_report
script:
"""
kallisto index -k 31 --make-unique -i ${fasta.baseName}.index ${fasta} \
2> ${fasta.baseName}_kallisto_report.txt
"""
}
process mapping_fastq {
tag "$reads"
publishDir "results/mapping/quantification/", mode: 'copy'
input:
set pair_id, file(reads) from fastq_files_trim
file index from index_files.collect()
output:
file "*" into counts_files
script:
"""
mkdir ${pair_id}
kallisto quant -i ${index} -t ${task.cpus} \
--bias --bootstrap-samples 100 -o ${pair_id} \
${reads[0]} ${reads[1]} &> ${pair_id}/kallisto_report.txt
"""
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment