first commit

2a7ac06c · Laurent Modolo · 2a7ac06c · 2a7ac06c · 2a7ac06c · 2a7ac06c
Unverified Commit 2a7ac06c authored 4 years ago by Laurent Modolo
--- a/Makefile
+++ b/Makefile
+README.html: README.md
+	pandoc -s --highlight-style pygments --mathjax README.md -o README.html
--- a/README.md
+++ b/README.md
+---
+title: Nextflow DSL2
+---
+# Nextflow DSL2
+![cc-by-sa](https://licensebuttons.net/l/by-sa/4.0/88x31.png)
+First, the [Wikipedia definition of DSL:](https://en.wikipedia.org/wiki/Domain-specific_language)
+> A **domain-specific language** (**DSL**) is a [computer language](https://en.wikipedia.org/wiki/Computer_language) specialized to a particular application [domain](https://en.wikipedia.org/wiki/Domain_(software_engineering)).
+The DSL2 of nextflow was [announced, the 24/07/2020](https://www.nextflow.io/blog/2020/dsl2-is-here.html) and is now well documented. It's defined as:
+> a major evolution of the Nextflow language  and makes it possible to scale and modularise your data analysis  pipeline while continuing to use the Dataflow programming paradigm that  characterises the Nextflow processing model. 
+This means that we can now split our pipeline between different files, instead of having one huge unreadable file.
+## Enabling DSL2
+The DSL2 is supported by every version of nextflow `>= 20.**.**`, you can update your version of nextflow with the following command:
+```bash
+nextflow self-update
+```
+The DSL2 is not enabled by default, for now, you need to add the following line into your main `.nf` script:
+```groovy
+nextflow.enable.dsl=2
+```
+## Nextflow modules
+Nextflow module are merely generic `process` definition without the `input` `from`nor `output` `into` channel names specified.
+### `samtool sort` process definition
+```groovy
+Channel
+  .fromPath( params.bam )
+  .map { it -> [it.simpleName, it]}
+  .set { bam_files }
+process sort_bam {
+  tag "$file_id"
+  input:
+    set file_id, file(bam) from bam_files
+  output:
+    set file_id, "*_sorted.bam" into sorted_bam_files
+  script:
+"""
+samtools sort -@ ${task.cpus} -O BAM -o ${file_id}_sorted.bam ${bam}
+"""
+}
+```
+### `samtool sort` module definition
+```groovy
+process sort_bam {
+  tag "$file_id"
+  input:
+    tuple val(file_id), path(bam)
+  output:
+    tuple val(file_id), path("*.bam*")
+  script:
+"""
+samtools sort -@ ${task.cpus} -O BAM -o ${bam.simpleName}_sorted.bam ${bam}
+"""
+}
+```
+We save this module definition in `src/nf_modules/samtools/main.nf`
+You can now include your module with the following code:
+```groovy
+include { sort_bam } from './nf_module/samtools/main.nf' 
+```
+Mind the `./` at the start of the path.
+## Workflow
+With **modules** you don't have the channel information to chain one process after another. Nextflow DSL2 introduces the **workflow**.
+A **workflow** is a new block. With a **workflow** you can write [the RNA quantification pipeline from the nextflow practical for experimental biologists](./solution_RNASeq.nf) as the following:
+```groovy
+log.info "fastq files : ${params.fastq}"
+log.info "fasta file : ${params.fasta}"
+log.info "bed file : ${params.bed}"
+channel // same as Channel
+  .fromPath( params.fasta )
+  .ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
+  .set { fasta_files }
+channel
+  .fromPath( params.bed )
+  .ifEmpty { error "Cannot find any bed files matching: ${params.bed}" }
+  .set { bed_files }
+channel
+  .fromFilePairs( params.fastq )
+  .ifEmpty { error "Cannot find any fastq files matching: ${params.fastq}" }
+  .set { fastq_files }
+include { adaptor_removal_pairedend } from './nf_modules/cutadapt/main'
+include { trimming_pairedend } from './nf_modules/urqt/main'
+include { fasta_from_bed } from './nf_modules/bedtools/main'
+include { index_fasta; mapping_fastq_pairedend } from './nf_modules/kallisto/main'
+workflow {
+    adaptor_removal_pairedend(fastq_files)
+    trimming_pairedend(adaptor_removal_pairedend.out.fastq)
+    fasta_from_bed(fasta_files, bed_files)
+    index_fasta(fasta_from_bed.out.fasta)
+    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
+}
+```
+### Modules outputs
+By default module outputs are accessible by `module_name.out` if you have different output `module_name.out` will be a list.
+You can also have named output with the `emit` definition. For example, the RNA quantification pipeline, the `adaptor_removal_pairedend` module is defined as follows:
+```groovy
+process adaptor_removal_pairedend {
+  tag "$pair_id"
+  publishDir "results/fastq/adaptor_removal/", mode: 'copy'
+  input:
+  tuple val(pair_id), path(reads)
+  output:
+  tuple val(pair_id), path("*_cut_R{1,2}.fastq.gz"), emit: fastq
+  path "*_report.txt", emit: report
+  script:
+  """
+  cutadapt -a ${adapter_3_prim} -g ${adapter_5_prim} -A ${adapter_3_prim} -G ${adapter_5_prim} \
+  -o ${pair_id}_cut_R1.fastq.gz -p ${pair_id}_cut_R2.fastq.gz \
+  ${reads[0]} ${reads[1]} > ${pair_id}_report.txt
+  """
+}
+```
+Here, the `adaptor_removal_pairedend`  emit two named item: `fastq` and `report`
+### Modules variable scope
+In the `src/nf_modules/cutadapt/main.nf` we have the following variable definition:
+```groovy
+adapter_3_prim = "AGATCGGAAGAG"
+adapter_5_prim = "CTCTTCCGATCT"
+trim_quality = "20"
+```
+Which are used in the `adaptor_removal_pairedend` modules. When the module is included, those variables are initialized. However, we can overwrite their value by redefining them in the **workflow** file.
+```groovy
+include { adaptor_removal_pairedend } from './nf_modules/cutadapt/main'
+include { trimming_pairedend } from './nf_modules/urqt/main'
+include { fasta_from_bed } from './nf_modules/bedtools/main'
+include { index_fasta; mapping_fastq_pairedend } from './nf_modules/kallisto/main'
+adapter_3_prim = "other_adaptor"
+workflow {
+    adaptor_removal_pairedend(fastq_files)
+    trimming_pairedend(adaptor_removal_pairedend.out.fastq)
+    fasta_from_bed(fasta_files, bed_files)
+    index_fasta(fasta_from_bed.out.fasta)
+    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
+}
+```
+### Implicit channel forking
+With the DSL2 the operator `into` is no longer defined, because channels are duplicated automatically !
+We can easily add FastQC steps to our pipline
+```groovy
+include { fastqc_fastq_pairedend } from './nf_modules/fastqc/main'
+workflow {
+    adaptor_removal_pairedend(fastq_files)
+  	fastqc_fastq_pairedend(fastq_files) // don't cause an error !
+    trimming_pairedend(adaptor_removal_pairedend.out.fastq)
+    fasta_from_bed(fasta_files, bed_files)
+    index_fasta(fasta_from_bed.out.fasta)
+    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
+}
+```
+If channels are implicitly forked, it's not the case for the **modules**. We can use `as` in the `include` command to rename **modules** and use the same **module** at different points of the workflow :
+```groovy
+include { 
+  fastqc_fastq_pairedend as fastqc_raw; // mind the ";" !
+  fastqc_fastq_pairedend as fastqc_clipped;
+  fastqc_fastq_pairedend as fastqc_trimmed;
+} from './nf_modules/fastqc/main'
+workflow {
+    fastqc_raw(fastq_files)
+    adaptor_removal_pairedend(fastq_files)
+  	fastqc_clipped(adaptor_removal_pairedend.out.fastq)
+    trimming_pairedend(adaptor_removal_pairedend.out.fastq)
+    fastqc_trimmed(trimming_pairedend.out.fastq)
+    fasta_from_bed(fasta_files, bed_files)
+    index_fasta(fasta_from_bed.out.fasta)
+    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
+}
+```
+## Sub-workflow
+Sub-workflow can be seen as **workflow** declared as module **module**. Sub-**workflows** are **workflow** that `take` inputs and `emit` output. We can split our RNASeq quantification pipeline the following way.
+```groovy
+workflow read_processing {
+    take:
+      fastq_files
+    main:
+ 	    fastqc_raw(fastq_files)
+      adaptor_removal_pairedend(fastq_files)
+  	  fastqc_clipped(adaptor_removal_pairedend.out.fastq)
+      trimming_pairedend(adaptor_removal_pairedend.out.fastq)
+      fastqc_trimmed(trimming_pairedend.out.fastq)
+    emit:
+      fastq = trimming_pairedend.out.fastq
+      report = fastqc_raw.out.report
+  		           .mix(fastqc_clipped.out.report)
+  		           .mix(fastqc_trimmed.out.report)
+}
+workflow {
+    read_processing(fastq_files)
+    fasta_from_bed(fasta_files, bed_files)
+    index_fasta(fasta_from_bed.out.fasta)
+    mapping_fastq_pairedend(index_fasta.out.index.collect(), trimming_pairedend.out.fastq)
+}
+```
+Nested workflow execution determines an implicit scope. Therefore the same process can be invoked in two different workflow scopes.
+## DSL2 migration notes
+- Process inputs or outputs of type `set` have to be replaced with [tuple](https://www.nextflow.io/docs/latest/process.html#process-input-tuple).
+- Process output option `mode flatten` is not available any more.
+- Use `path` instead of  `file` (can interpret string as path)
+- The use of unqualified value and file elements into input tuples is not allowed anymore
+  ```groovy
+  input:
+    tuple X, 'some-file.bam'
+  ```
+  ```groovy
+  input:
+    tuple val(X), path('some-file.bam')
+  ```
+- Operator [bind](https://www.nextflow.io/docs/latest/channel.html#channel-bind1) has been deprecated by DSL2 syntax
+- Operator [operator <<](https://www.nextflow.io/docs/latest/channel.html#channel-bind2) has been deprecated by DSL2 syntax.
+- Operator [choice](https://www.nextflow.io/docs/latest/operator.html#operator-choice) has been deprecated by DSL2 syntax. Use [branch](https://www.nextflow.io/docs/latest/operator.html#operator-branch) instead.
+- Operator [close](https://www.nextflow.io/docs/latest/operator.html#operator-close) has been deprecated by DSL2 syntax.
+- Operator [create](https://www.nextflow.io/docs/latest/channel.html#channel-create) has been deprecated by DSL2 syntax.
+- Operator `countBy` has been deprecated by DSL2 syntax.
+- Operator [into](https://www.nextflow.io/docs/latest/operator.html#operator-into) has been deprecated by DSL2 syntax since it’s not needed anymore.
+- Operator `fork` has been renamed to [multiMap](https://www.nextflow.io/docs/latest/operator.html#operator-multimap).
+- Operator `groupBy` has been deprecated by DSL2 syntax. Replace it with [groupTuple](https://www.nextflow.io/docs/latest/operator.html#operator-grouptuple)
+- Operator `print` and `println` have been deprecated by DSL2 syntax. Use [view](https://www.nextflow.io/docs/latest/operator.html#operator-view) instead.
+- Operator [merge](https://www.nextflow.io/docs/latest/operator.html#operator-merge) has been deprecated by DSL2 syntax. Use [join](https://www.nextflow.io/docs/latest/operator.html#operator-join) instead.
+- Operator [separate](https://www.nextflow.io/docs/latest/operator.html#operator-separate) has been deprecated by DSL2 syntax.
+- Operator [spread](https://www.nextflow.io/docs/latest/operator.html#operator-spread) has been deprecated with DSL2 syntax. Replace it with [combine](https://www.nextflow.io/docs/latest/operator.html#operator-combine).
+- Operator `route` has been deprecated by DSL2 syntax.
+To see all the changes you can read the [DSL2 section of the documentation](https://www.nextflow.io/docs/latest/dsl2.html#) and re-read the [full nextflow documentation...](https://www.nextflow.io/docs/latest/index.html)
--- a/img/cc_by_sa.png
+++ b/img/cc_by_sa.png
--- a/solution_RNASeq.nf
+++ b/solution_RNASeq.nf
+log.info "fastq files : ${params.fastq}"
+log.info "fasta file : ${params.fasta}"
+log.info "bed file : ${params.bed}"
+Channel
+  .fromPath( params.fasta )
+  .ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
+  .set { fasta_files }
+Channel
+  .fromPath( params.bed )
+  .ifEmpty { error "Cannot find any bed files matching: ${params.bed}" }
+  .set { bed_files }
+Channel
+  .fromFilePairs( params.fastq )
+  .ifEmpty { error "Cannot find any fastq files matching: ${params.fastq}" }
+  .set { fastq_files }
+process adaptor_removal {
+  tag "$pair_id"
+  publishDir "results/fastq/adaptor_removal/", mode: 'copy'
+  input:
+  set pair_id, file(reads) from fastq_files
+  output:
+  set pair_id, "*_cut_R{1,2}.fastq.gz" into fastq_files_cut
+  script:
+  """
+  cutadapt -a AGATCGGAAGAG -g CTCTTCCGATCT -A AGATCGGAAGAG -G CTCTTCCGATCT \
+  -o ${pair_id}_cut_R1.fastq.gz -p ${pair_id}_cut_R2.fastq.gz \
+  ${reads[0]} ${reads[1]} > ${pair_id}_report.txt
+  """
+}
+process trimming {
+  tag "${reads}"
+  publishDir "results/fastq/trimming/", mode: 'copy'
+  input:
+  set pair_id, file(reads) from fastq_files_cut
+  output:
+  set pair_id, "*_trim_R{1,2}.fastq.gz" into fastq_files_trim
+  script:
+"""
+UrQt --t 20 --m ${task.cpus} --gz \
+--in ${reads[0]} --inpair ${reads[1]} \
+--out ${pair_id}_trim_R1.fastq.gz --outpair ${pair_id}_trim_R2.fastq.gz \
+> ${pair_id}_trimming_report.txt
+"""
+}
+process fasta_from_bed {
+  tag "${bed.baseName}"
+  publishDir "results/fasta/", mode: 'copy'
+  input:
+  file fasta from fasta_files
+  file bed from bed_files
+  output:
+  file "*_extracted.fasta" into fasta_files_extracted
+  script:
+"""
+bedtools getfasta -name \
+-fi ${fasta} -bed ${bed} -fo ${bed.baseName}_extracted.fasta
+"""
+}
+process index_fasta {
+  tag "$fasta.baseName"
+  publishDir "results/mapping/index/", mode: 'copy'
+  input:
+    file fasta from fasta_files_extracted
+  output:
+    file "*.index*" into index_files
+    file "*_kallisto_report.txt" into index_files_report
+  script:
+"""
+kallisto index -k 31 --make-unique -i ${fasta.baseName}.index ${fasta} \
+2> ${fasta.baseName}_kallisto_report.txt
+"""
+}
+process mapping_fastq {
+  tag "$reads"
+  publishDir "results/mapping/quantification/", mode: 'copy'
+  input:
+  set pair_id, file(reads) from fastq_files_trim
+  file index from index_files.collect()
+  output:
+  file "*" into counts_files
+  script:
+"""
+mkdir ${pair_id}
+kallisto quant -i ${index} -t ${task.cpus} \
+--bias --bootstrap-samples 100 -o ${pair_id} \
+${reads[0]} ${reads[1]} &> ${pair_id}/kallisto_report.txt
+"""
+}