Skip to content
Snippets Groups Projects
Select Git revision
  • f6ca2c6d9c14db62a21344477c264805310372a7
  • master default protected
  • dev_carine
  • cbedet1-master-patch-41030
  • dev
  • v2.0.0
  • v0.4.0
  • v0.3.0
  • v0.2.9
  • v0.2.8
  • v0.2.7
  • v0.2.6
  • v0.1.0
  • v0.2.5
  • v0.2.4
  • v0.2.3
  • v0.2.2
  • v0.2.1
  • v0.2.0
  • v0.1.2
20 results

pipeline_building_cec.md

Blame
  • Forked from LBMC / nextflow
    196 commits behind, 86 commits ahead of the upstream repository.
    cbedet1's avatar
    cbedet authored
    f6ca2c6d
    History

    Building the pipeline

    The goal of this guide is to keep the major steps coming from Laurent's documentation and to modify it according to our purposes.

    Initialize your own project

    You are going to build a pipeline for you or your team. So the first step is to create your own project.

    Forking

    Instead of reinventing the wheel, you can use the LBMC/nextflow as a template. To easily do so, go to the LBMC/nextflow repository and click on the fork button (you need to log-in).

    fork button

    In git, the action of forking means that you are going to make your own private copy of a repository. This repository will keep a link with the original LBMC/nextflow project from which you will be able to

    Project organization

    This project (and yours) follows the guide of good practices for the LBMC

    You are now on the main page of your fork of the LBMC/nextflow. You can explore this project, all the codes in it is under the CeCILL licence (in the LICENCE file).

    The README.md file contains instructions to run your pipeline and test its installation.

    The CONTRIBUTING.md file contains guidelines if you want to contribute to the LBMC/nextflow.

    The data folder will be the place where you store the raw data for your analysis. The results folder will be the place where you store the results of your analysis.

    The content of data and results folders should never be saved on git.

    The doc folder contains the documentation and this guide.

    And most interestingly for you, the src contains code to wrap tools. This folder contains one visible subdirectories nf_modules some pipeline examples and other hidden folders and files.

    Nextflow pipeline

    A pipeline is a succession of process. Each process has data input(s) and optional data output(s). Data flows are modeled as channels.

    Build your own RNASeq pipeline

    In this section you are going to build your own pipeline for RNASeq analysis from the code available in the src/nf_modules folder.

    Open atom and create a src/RNASeq.nf file.

    The first line that you are going to add is

    nextflow.enable.dsl=2

    fastp

    The first step of the pipeline is to remove any Illumina adaptors left in your read files and to trim your reads by quality.

    The LBMC/nextflow template provide you with many tools, for which you can find a predefined process block. You can find a list of these tools in the src/nf_modules folder. You can also ask for a new tool by creating a new issue for it in the LBMC/nextflow project.

    You are going to include the src/nf_modules/fastp/main.nf in our src/RNASeq.nf file

    include { fastp } from "./nf_modules/fastp/main.nf"

    The ./nf_modules/fastp/main.nf is relative to the src/RNASeq.nf file, this is why you don’t include the src/ part of the path.

    With this line you can call the fastp block in your future workflow without having to write it ! If you check the content of the file src/nf_modules/fastp/main.nf, you can see that by including fastp, you are including a sub-workflow (we will come back on this object latter). Sub-workflow can be used like processes.

    This sub-workflow takes a fastq channel. You need to make one:

    channel
      .fromFilePairs( "data/tiny_dataset/fastq/*_R{1,2}.fastq", size: -1)
      .set { fastq_files }

    The .fromFilePairs() function creates a channel of pairs of fastq files. Therefore, the items emitted by the fastq_files channel are going to be pairs of fastq for paired-end data.

    The option size: -1 allows for arbitrary numbers of associated files. Therefore, you can use the same channel creation for single-end data.

    You can now include the workflow definition, passing the fastq_files channel to fastp to your src/RNASeq.nf file.

    workflow {
      fastp(fastq_files)
    }

    You can commit your src/RNASeq.nf file, pull your modification locally and run your pipeline with the command:

    Pipeline -- arguments

    We have defined the fastq files path within our src/RNASeq.nf file. But what if we want to share our pipeline with someone who doesn’t want to analyze the tiny_dataset and but other fastq. You can define a variable instead of fixing the path.

    params.fastq = "data/fastq/*_{1,2}.fastq"
    channel
      .fromFilePairs( params.fastq, size: -1)
      .set { fastq_files }

    Here you declare a variable using params : the name of the variable is fastq and is a fastq file. It contains the path of the fastq file to look for. The advantage of using params.fastq is that the option --fastq is now a parameter of your pipeline.

    Thus, you can call your pipeline with the --fastq option.

    You can commit your src/RNASeq.nf file, pull your modification locally.

    You can also add the following line:

    log.info "fastq files: ${params.fastq}"

    This line simply displays the value of the variable

    Kallisto

    Kallisto runs in two steps: the indexation of the reference and the quantification of the transcript on this index.

    You can include two processes with the following syntax:

    include { index_fasta; mapping_fastq } from './nf_modules/kallisto/main.nf'

    The index_fasta process needs to take as input a fasta file corresponding to the sequence of transcripts. In our case it is :
    c_elegans.PRJNA13758.WS278.all_transcripts.fa which contains transcripts and pseudogenes sequences.

    You need to be able to input a fasta_files channel. For this you first need to declare a new variable which is a fasta file using params.fasta ; the corresponding option is then --fasta

    log.info "fasta file : ${params.fasta}"  # to display the value of the variable
    
    channel
      .fromPath( params.fasta )
      .ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
      .map { it -> [it.simpleName, it]}
      .set { fasta_file }

    We introduce 2 new directives:

    • .ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" } to throw an error if the path of the file is not right
    • .map { it -> [it.simpleName, it]} to transform our channel to a format compatible with the CONTRIBUTING rules. Item, in the channel have the following shape [file_id, [file]], like the ones emited by the .fromFilePairs(..., size: -1) function.

    You can add the index_fasta step to your workflow

    workflow {
      fastp(fastq_files)
      index_fasta(fasta_file)  
    }

    The input of your mapping_fastq process needs to take as input the output of your index_fasta process and the fastp process, of shape [index_id, [index_file]], and [fastq_id, [fastq_r1_file, fastq_r2_file]].

    The output of a process is accessible through <process_name>.out. In the cases where we have an emit: <channel_name> we can access the corresponding channel with<process_name>.out.<channel_name>

    We can add the mapping_fastq step to our workflow

    workflow {
      fastp(fastq_files)
      index_fasta(fasta_file)
      mapping_fastq(index_fasta.out.index.collect(), fastp.out.fastq)  //.collect to reuse the same index for each fastq file  
    }

    Commit your work

    For single-end RNAseq Kallisto need to have a mean fragment size (-l) together with an sd value (-s). These parameters are available from the librairies analysis. These parameters have to be added to kallisto parameters :

    include { index_fasta; mapping_fastq } from './nf_modules/kallisto/main.nf' addParams( mapping_fastq: " -l mean_frag_size -s sd_value ")

    Bowtie2

    Bowtie2 runs in two steps: the indexation of the reference genome and the mapping of the sequences on this index. It returns bam files that can be viewed with IGV browser.

    You can include two processes with the following syntax:

    include { index_fasta; mapping_fastq } from './nf_modules/bowtie2/main.nf' 

    The index_fasta process needs to take as input a fasta file corresponding to the sequence of transcripts. In our case it is :
    c_elegans.PRJNA13758.WS278.genomic.fa

    You need to be able to input a fasta_files channel. For this you first need to declare a new variable which is a fasta file using params.fasta ; the corresponding option is then --fasta

    log.info "fasta file : ${params.fasta}"  # to display the value of the variable
    
    channel
      .fromPath( params.fasta )
      .ifEmpty { error "Cannot find any fasta files matching: ${params.fasta}" }
      .map { it -> [it.simpleName, it]}
      .set { fasta_file }

    You can add the index_fasta step to your workflow

    workflow {
      fastp(fastq_files)
      index_fasta(fasta_file)  
    }

    As for Kallisto we also add the mapping_fastq step to our workflow which takes the index_fasta.out as input

    workflow {
      fastp(fastq_files)
      index_fasta(fasta_file)
      mapping_fastq(index_fasta.out.index.collect(), fastp.out.fastq) 
    }

    Commit your work

    Returning results

    By default none of the process defined in src/nf_modules use the publishDir instruction. You can specify their publishDir directory by specifying the :

    params.<process_name>_out = "path"

    Where "path" will describe a path within the results folder

    Therefore you can either:

    • call your pipeline with the following parameter --mapping_fastq_out "quantification/" for Kallisto or --mapping_fastq_out "alignement/" for Bowtie2
    • add the following lines to your src/RNASeq.nf file to get the output of the fastp and mapping_fastq process:
    include { fastp } from './nf_modules/fastp/main.nf' addParams(fastp_out: "fastQC/")
    include { index_fasta; mapping_fastq } from './nf_modules/kallisto/main.nf' addParams(mapping_fastq: " -l mean_frag_size -s sd_value ", mapping_fastq_out: "quantification/")
    or include { index_fasta; mapping_fastq } from './nf_modules/bowtie2/main.nf' addParams(mapping_fastq_out: "alignment")

    Commit your work

    Combining Kallisto and Bowtie2 in the same pipeline

    Since Kallisto and Bowtie2 take the same kind of variables as input, you first have to rename the input fasta variables, for example :

      log.info "fasta file : ${params.cds_fasta}" for kallisto which uses c_elegans.PRJNA13758.WS278.all_transcripts.fa  
      log.info "fasta file : ${params.genomic_fasta}" for bowtie2 which uses c_elegans.PRJNA13758.WS278.genomic.fa

    the corresponding options will be --cds_fasta and --genomic_fasta

    Then you need to add a channel for each of these variables :

    channel
      .fromPath( params.cds_fasta )
      .ifEmpty { error "Cannot find any fasta files matching: ${params.cds_fasta}" }
      .map { it -> [it.simpleName, it]}
      .set { cds_fasta_file }
    
    channel
      .fromPath( params.genomic_fasta )
      .ifEmpty { error "Cannot find any fasta files matching: ${params.genomic_fasta}" }
      .map { it -> [it.simpleName, it]}
      .set { genomic_fasta_file }

    Finally, since kallisto and bowtie2 use the same kind of process you have to create alias for both index_fasta and mapping_fastq:

    include {
      index_fasta as kallisto_index_fasta; 
      mapping_fastq as kallisto_mapping_fastq
     } from './nf_modules/kallisto/main.nf' addParams( mapping_fastq_out: "quantif/", mapping_fastq: " -l 459.8 -s 198.5 ")
    
    include {
      index_fasta as bowtie_index_fasta;
      mapping_fastq as bowtie_mapping_fastq
    } from './nf_modules/bowtie2/main.nf' addParams(mapping_fastq_out: "align/")

    You can add the index_fasta and mapping_fastq steps to your workflow

    workflow {
      fastp(fastq_files)
      kallisto_index_fasta(cds_fasta_file)
      kallisto_mapping_fastq(kallisto_index_fasta.out.index.collect(), fastp.out.fastq)  //.collect pour réutiliser l'index pour chaque fastq file
      bowtie_index_fasta(genomic_fasta_file)
      bowtie_mapping_fastq(bowtie_index_fasta.out.index.collect(), fastp.out.fastq)  
              }

    Commit your work : your pipeline is ready

    Run your RNASeq pipeline on the PSMN

    First you need to connect to the PSMN:

    login@allo-psmn

    Then once connected to allo-psmn, you can connect to cl6242comp2:

    login@cl6242comp2

    Set your environment

    Create and go to your scratch folder:

    mkdir -p /scratch/Bio/<login>
    cd /scratch/Bio/<login>

    Then you need to clone your pipeline and get the data:

    git clone https://gitbio.ens-lyon.fr/<usr_name>/nextflow.git
    cd nextflow/data
    git clone https://gitbio.ens-lyon.fr/LBMC/hub/tiny_dataset.git
    cd ..

    Run nextflow

    It is better to writ a script as a.sh file to lauch nextflow : Here is the RNAseq_script for Kallisto:

    ./nextflow run src/RNAseq_cec.nf \
            -profile psmn \
            --fastq "data/*.fastq" \
            --cds_fasta "data/c_elegans.PRJNA13758.WS278.all_transcripts.fa" \
            --genomic_fasta "data/c_elegans.PRJNA13758.WS278.genomic.fa"
    

    Don't forget to make the script executable using : chmod 744 RNAseq_script.sh:

    src/install_nextflow.sh

    Then you can open an attached terminal using screen :

      screen -S rnaseq # to create the rnaseq screen 
      src/RNAseq_script.sh # to launch the script

    Maintain CTRL +A pressed and then press D to detach the screen.
    Use :

      screen -r rnaseq #to reattach the screen

    You just ran your pipeline on the PSMN!