Introduction
This project
MARS-seq pipeline
This pipeline aims at generating counts matrix from fastq files from Illumina sequencing which were generated with the adapted MARSseq scRNAseq protocole.
Softwares:
- nextflow(19.10)
- fastqc (v0.11.5) and MultiQC (v1.9) for quality control analysis
- cutadapt (v2.1) for trimming remaining adaptors, plate demultiplexing, mRNA filtering, cell demultiplexing and quality trimming
- umi_tools (v1.0.0) for cell barcodes and UMI sequences detection, cell barcodes whitelist generation, reads sequence extraction, and read counting
- bowtie2 (v2.3.4.1) for fasta files indexing and mapping
- samtools (v1.7) for extracting and indexing bam files
- subreads(1.6.4)
- R (v4) to generate an histogram of cell barcodes frequency, convert transcript names to gene names and to fuse all cells into a single cells x genes matrix
- Python () to calculate cell barcodes frequency and to handle whitelist file
Install nextflow
Run install_nextflow.sh
Files
The pipeline takes in entry:
- Mars_seq.nf the nextflow pipeline
- Mars_seq.config the nextflow configuration file
- fastq files R1 and R2 To indicate both file for read 1 and read 2 you can put the "1" and "2" between bracket and separated by a comma ("{1,2}"). Reads file can be gziped.
- a tag.fa file containing the barcode plates in the following format
>Plate1
^ATGC
>Plate2
^CATG
...
- a fasta reference transcriptome
- a GTF file matching the transcriptome
- the expected whitelist of cell barcodes in txt format
- a gene map file used for transcripts to genes conversion after mapping
Results
The pipeline output a cell x genes counts matrix and QC files.
Generating metadata files
Software
Python scripts are used to generate metadata files from the scRNASeq data.
- get_reads_nb.py generate a QC matrix with reads number, mapped reads number, percent of reads mapped per each cell.
- mapping_ratio.py generate a QC matrix with reads number mapped to transcriptome, reads number mapped to ERCC, ratio of the two, pourcent of reads mapped to ERCC per each cell.
files
- *_mapping files from conrol_qual repository to use with get_reads_nb.py
- *_geneassigned files from plateX repository to use with mapping_ratio.py
scRNAseq data analysis
R scripts
R script must been run in an R console
-
QC_cleaning.R: takes in entry counts matrix and QC matrix. This script does cells filtering based on their reads number, pourcent of mapped reads, pourcent of mapped reads to ERCC, number of detected genes, number of counts, per cell. And gene filtering.
-
Data_normalization.R: takes in entry the QC filtered counts matrix. This script does data normalization with SCTransform from Seurat package.
-
Analysis_UMAP.R: takes in entry the normalized and log1p transform matrix. This script does dimentionality reduction and projection using UMAP, to compare the different biological conditions all together and two by two.
-
DE_analysis.R: takes in entry the normalized matrix. This script does differential expression analysis on biological conditions two by two using Seurat package.
-
Gene_distribution_analysis.R: takes in entry the normalized and log1p transform matrix. This script does histograms of specific genes expression values for each condition, to compare pattern of expression.
-
Gini_index_boot_strap.R: takes in entry the normalized and log1p transform matrix. This script performs 100 bootstrap of Wasserstein distance computation for each gene between pair of conditions and computes Gini index for each comparison. Finally, it generates a boxplot of Gini indexes values for each pair of comparison.
-
3genes comparison_analysis.R: takes in entry the normalized and log1p transform matrix. This script generates boxplots of 3 genes expression in each biological condition.
scRT-qPCR data analysis
#sparse PLS analysis R script 3328_2.R takes in entry two files:
- an expression matrix
- a table to define classes