building_your_pipeline.md: add details

10d46cb0 · Laurent Modolo · d160652a · 10d46cb0
Verified Commit 10d46cb0 authored 3 years ago by Laurent Modolo
--- a/doc/building_your_pipeline.md
+++ b/doc/building_your_pipeline.md
@@ -212,6 +212,8 @@ head ${fasta} > ${fasta.simpleName}_sample.fasta
 Add this to your `src/fasta_sampler.nf` file with the WebIDE and commit it to your repository before pulling your modifications locally.
 You can run your pipeline again and check the content of the folder `results/sampling`.
+Congratulations you built your first, one step, nextflow pipeline !
 # Build your own RNASeq pipeline
@@ -229,7 +231,7 @@ nextflow.enable.dsl=2
 The first step of the pipeline is to remove any Illumina adaptors left in your read files and to trim your reads by quality.
-The [LBMC/nextflow](https://gitbio.ens-lyon.fr/LBMC/nextflow) template provide you with many tools for which you can find a predefined `process` block.
+The [LBMC/nextflow](https://gitbio.ens-lyon.fr/LBMC/nextflow) template provide you with many tools, for which you can find a predefined `process` block.
 You can find a list of these tools in the [`src/nf_modules`](./src/nf_modules) folder.
 You can also ask for a new tool by creating a [new issue for it](https://gitbio.ens-lyon.fr/LBMC/nextflow/-/issues/new) in the [LBMC/nextflow](https://gitbio.ens-lyon.fr/LBMC/nextflow) project.
@@ -238,11 +240,12 @@ We are going to include the [`src/nf_modules/fastp/main.nf`](./src/nf_modules/fa
 ```Groovy
 include { fastp } from "./nf_modules/fastp/main.nf"
 ```
+The `./nf_modules/fastp/main.nf` is relative to the `src/RNASeq.nf` file, this is why we don’t include the `src/` part of the path.
 With this line we can call the `fastp` block in our future `workflow` without having to write it !
-If we check the content of the file [`src/nf_modules/fastp/main.nf`](./src/nf_modules/fastp/main.nf), we can see that by including `fastp`, we are including a sub-`workflow` (we will come back on this object latter).
+If we check the content of the file [`src/nf_modules/fastp/main.nf`](./src/nf_modules/fastp/main.nf), we can see that by including `fastp`, we are including a sub-`workflow` (we will come back on this object latter). Sub-`workflow` can be used like `process`es.
-This `sub-workflow` takes a `fastq` `channel`. We need to make one
-The `./nf_modules/fastp/main.nf` is relative to the `src/RNASeq.nf` file, this is why we don’t include the `src/` part of the path.
+This `sub-workflow` takes a `fastq` `channel`. We need to make one:
 ```Groovy
 channel
@@ -250,8 +253,9 @@ channel
  .set { fastq_files }
 ```
-The `.fromFilePairs()` can create a `channel` of pair of fastq files. Therefore, the items emitted by the `fastq_files` channel are going to be pairs of fastq for paired-end data.
+The `.fromFilePairs()` function creates a `channel` of pairs of fastq files. Therefore, the items emitted by the `fastq_files` channel are going to be pairs of fastq for paired-end data.
-The option `size: -1` allows arbitrary number of associated files. Therefore, we can use the same `channel` creation for single-end data.
+The option `size: -1` allows for arbitrary numbers of associated files. Therefore, we can use the same `channel` creation for single-end data.
 We can now include the `workflow` definition, passing the `fastq_files` `channel` to `fastp` to our `src/RNASeq.nf` file.
@@ -274,13 +278,15 @@ What is happening ?
 Nextflow tells you the following error: `fastp: command not found`. You haven’t `fastp` installed on your computer.
 Tools installation can be a tedious process and reinstalling old version of those tools to reproduce old analyses can be very difficult.
-Containers technologies like [Docker](https://www.docker.com/) or [Singularity](https://sylabs.io/singularity/) allows to create small virtual environments where we can install software in a given version with all it’s dependencies. This environment can be saved, and share, to have access to this exact working version of the software.
+Containers technologies like [Docker](https://www.docker.com/) or [Singularity](https://sylabs.io/singularity/) create small virtual environments where we can install software in a given version with all it’s dependencies. This environment can be saved, and shared, to have access to this exact working version of the software.
 > Why two different systems ?
-> Docker is easy to use and can be installed on Windows / MacOS / GNU/Linux but need admin rights
-> Singularity can only be used on GNU/Linux but don’t need admin rights, and can be used on shared environment
-The [LBMC/nextflow](https://gitbio.ens-lyon.fr/LBMC/nextflow) template provide you with [4 different `-profile`s to run your pipeline](https://gitbio.ens-lyon.fr/LBMC/nextflow/-/blob/master/doc/getting_started.md#nextflow-profile).
+> Docker is easy to use and can be installed on Windows / MacOS / GNU/Linux but need admin rights.
+> Singularity can only be used on GNU/Linux but don’t need admin rights, and can be used on shared environment.
+The [LBMC/nextflow](https://gitbio.ens-lyon.fr/LBMC/nextflow) template provides you with [4 different `-profile`s to run your pipeline](https://gitbio.ens-lyon.fr/LBMC/nextflow/-/blob/master/doc/getting_started.md#nextflow-profile).
 Profiles are defined in the [`src/nextflow.config`](./src/nextflow.config), which is the default configuration file for your pipeline (you don’t have to edit this file).
 To run the pipeline locally you can use the profile `singularity` or `docker`
@@ -289,11 +295,11 @@ To run the pipeline locally you can use the profile `singularity` or `docker`
 ./nextflow src/RNASeq.nf -profile singularity
 ```
-The `fastp` `singularity` or `docker` image is downloaded automatically and the fastq files are processed.
+The `fastp`, `singularity` or `docker`, image is downloaded automatically and the fastq files are processed.
 ## Pipeline `--` arguments
-We have defined the fastq file path within our `src/RNASeq.nf` file.
+We have defined the fastq files path within our `src/RNASeq.nf` file.
 But what if we want to share our pipeline with someone who doesn’t want to analyze the `tiny_dataset` and but other fastq.
 We can define a variable instead of fixing the path.
@@ -305,7 +311,10 @@ channel
 ```
 We declare a variable that contains the path of the fastq file to look for. The advantage of using `params.fastq` is that the option `--fastq` is now a parameter of your pipeline.
-Thus, you can call your pipeline with the `--fastq` option:
+Thus, you can call your pipeline with the `--fastq` option.
+You can commit your `src/RNASeq.nf` file, `pull` your modification locally and run your pipeline with the command:
 ```sh
 ./nextflow src/RNASeq.nf -profile singularity --fastq "data/tiny_dataset/fastq/*_R{1,2}.fastq"
@@ -321,9 +330,9 @@ This line simply displays the value of the variable
 ## BEDtools
-We need the sequences of the transcripts that need to be quantified. We are going to extract these sequences from the reference `data/tiny_dataset/fasta/tiny_v2.fasta` with the `bed` annotation `data/tiny_dataset/annot/tiny.bed`.
+We need the sequences of the transcripts that need to be quantified. We are going to extract these sequences from the reference `data/tiny_dataset/fasta/tiny_v2.fasta` with the `bed` file annotation `data/tiny_dataset/annot/tiny.bed`.
-You include the `fasta_from_bed` process from the [src/nf_modules/bedtools/main.nf](https://gitbio.ens-lyon.fr/LBMC/nextflow/blob/master/src/nf_modules/bedtools/main.nf) file to your `src/RNASeq.nf` file.
+You can include the `fasta_from_bed` `process` from the [src/nf_modules/bedtools/main.nf](https://gitbio.ens-lyon.fr/LBMC/nextflow/blob/master/src/nf_modules/bedtools/main.nf) file to your `src/RNASeq.nf` file.
 You need to be able to input a `fasta_files` `channel` and a `bed_files` `channel`.
@@ -345,7 +354,7 @@ channel
 We introduce 2 new directives:
 - `.ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }` to throw an error if the path of the file is not right
- `.map { it -> [it.simpleName, it]}` to transform our `channel` to a format compatible with the [`CONTRIBUTING`](../CONTRIBUTING.md) rules
+- `.map { it -> [it.simpleName, it]}` to transform our `channel` to a format compatible with the [`CONTRIBUTING`](../CONTRIBUTING.md) rules. Item, in the `channel` have the following shape [file_id, [file]], like the ones emited by the `.fromFilePairs(..., size: -1)` function.
 We can add the `fastq_from_bed` step to our `workflow`
@@ -364,7 +373,7 @@ Commit your work and test your pipeline with the following command:
 ## Kallisto
-Kallisto run in two steps: the indexation of the reference and the quantification on this index.
+Kallisto run in two steps: the indexation of the reference and the quantification of the transcript on this index.
 You can include two `process`es with the following syntax:
@@ -372,8 +381,9 @@ You can include two `process`es with the following syntax:
 include { index_fasta; mapping_fastq } from './nf_modules/kallisto/main.nf'
 ```
-The `index_fasta` process needs to take as input the output of your `fasta_from_bed` `process`.
+The `index_fasta` process needs to take as input the output of your `fasta_from_bed` `process`, which has the shape `[fasta_id, [fasta_file]]`.
-The input of your `mapping_fastq` `process` needs to take as input and the output of your `index_fasta` `process` and the `fastp` `process`.
+The input of your `mapping_fastq` `process` needs to take as input and the output of your `index_fasta` `process` and the `fastp` `process`, of shape `[index_id, [index_file]]`, and `[fastq_id, [fastq_r1_file, fastq_r2_file]]`.
 The output of a `process` is accessible through `<process_name>.out`.
 In the cases where we have an `emit: <channel_name>` we can access the corresponding channel with`<process_name>.out.<channel_name>`