Skip to content
Snippets Groups Projects
Unverified Commit 1cc0a18c authored by Nicolas Servant's avatar Nicolas Servant Committed by GitHub
Browse files

Merge pull request #155 from jzohren/master

Fixed some spelling mistakes in the Output documentation
parents b2a52352 fa42adcd
No related branches found
No related tags found
No related merge requests found
......@@ -23,7 +23,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [MultiQC](#multiqc) - aggregate report and quality controls, describing
results of the whole pipeline
- [Export](#exprot) - additionnal export for compatibility with downstream
analysis tool and visualization
analysis tool and visualisation
## From raw data to valid pairs
......@@ -36,7 +36,7 @@ For details about the workflow, see
#### Reads alignment
Using Hi-C data, each reads mate has to be independantly aligned on the
Using Hi-C data, each reads mate has to be independently aligned on the
reference genome.
The current workflow implements a two steps mapping strategy. First, the reads
are aligned using an end-to-end aligner.
......@@ -63,7 +63,7 @@ are available ;
- `*.mapstat` - mapping statistics per read mate
Usually, a high fraction of reads is expected to be aligned on the genome
(80-90%). Among them, we usually observed a few percent (around 10%) of step 2
(80-90%). Among them, we usually observe a few percent (around 10%) of step 2
aligned reads. Those reads are chimeric fragments for which we detect a
ligation junction. An abnormal level of chimeric reads can reflect a ligation
issue during the library preparation.
......@@ -142,9 +142,9 @@ removed (see `--keep_dups` to disable duplicates filtering).
Additional quality controls such as fragment size distribution can be extracted
from the list of valid interaction products.
We usually expect to see a distribution centered around 300 pb which correspond
We usually expect to see a distribution centered around 300 bp which corresponds
to the paired-end insert size commonly used.
The fraction of dplicates is also presented. A high level of duplication
The fraction of duplicates is also presented. A high level of duplication
indicates a poor molecular complexity and a potential PCR bias.
Finally, an important metric is to look at the fraction of intra and
inter-chromosomal interactions, as well as long range (>20kb) versus short
......@@ -176,15 +176,15 @@ All results are available in `results/hicpro/stats`.
#### Contact maps
Intra et inter-chromosomal contact maps are build for all specified resolutions.
The genome is splitted into bins of equal size. Each valid interaction is
Intra and inter-chromosomal contact maps are built for all specified resolutions.
The genome is split into bins of equal size. Each valid interaction is
associated with the genomic bins to generate the raw maps.
In addition, Hi-C data can contain several sources of biases which has to be
corrected.
The HiC-Pro workflow uses the [ìced](https://github.com/hiclib/iced) and
[Varoquaux and Servant, 2018](http://joss.theoj.org/papers/10.21105/joss.01286)
python package which proposes a fast implementation of the original ICE
normalization algorithm (Imakaev et al. 2012), making the assumption of equal
normalisation algorithm (Imakaev et al. 2012), making the assumption of equal
visibility of each fragment.
Importantly, the HiC-Pro maps are generated only if the `--hicpro_maps` option
......@@ -221,16 +221,16 @@ downstream analysis.
## Hi-C contact maps
Contact maps are usually stored as simple txt (`HiC-Pro`), .hic (`Juicer/Juicebox`) and .(m)cool (`cooler/Higlass`) formats.
The .cool and .hic format are compressed and indexed and usually much more efficient that the txt format.
In the current workflow, we propose to use the `cooler` format as a standard to build the raw and normalized maps
after valid pairs detection as it is used by several downstream analysis and visualization tools.
The .cool and .hic format are compressed and indexed and usually much more efficient than the txt format.
In the current workflow, we propose to use the `cooler` format as a standard to build the raw and normalised maps
after valid pairs detection as it is used by several downstream analysis and visualisation tools.
Raw contact maps are therefore in **`results/contact_maps/raw`** which contains the different maps in `txt` and `cool` formats, at various resolutions.
Normalized contact maps are stored in **`results/contact_maps/norm`** which contains the different maps in `txt`, `cool`, and `mcool` format.
Normalised contact maps are stored in **`results/contact_maps/norm`** which contains the different maps in `txt`, `cool`, and `mcool` format.
The bin coordinates used for all resolutions are available in **`results/contact_maps/bins`**.
Note that `txt` contact maps generated with `cooler` are identical to those generated by `HiC-Pro`.
However, differences can be observed on the normalized contact maps as the balancing algorithm is not exactly the same.
However, differences can be observed on the normalised contact maps as the balancing algorithm is not exactly the same.
## Downstream analysis
......@@ -246,23 +246,23 @@ The results generated with the `HiCExplorer hicPlotDistVsCounts` tool (plot and
### Compartments calling
Compartments calling is one of the most common analysis which aims at detecting A (open, active) / B (close, inactive) compartments.
In the first studies on the subject, the compartments were called at high/medium resolution (1000000 to 250000) which is enough to call A/B comparments.
In the first studies on the subject, the compartments were called at high/medium resolution (1000000 to 250000) which is enough to call A/B compartments.
Analysis at higher resolution has shown that these two main types of compartments can be further divided into compartments subtypes.
Although different methods have been proposed for compartment calling, the standard remains the eigen vector decomposition from the normalized correlation maps.
Although different methods have been proposed for compartment calling, the standard remains the eigen vector decomposition from the normalised correlation maps.
Here, we use the implementation available in the [`cooltools`](https://cooltools.readthedocs.io/en/lates) package.
Results are available in **`results/compartments/`** folder and includes :
Results are available in **`results/compartments/`** folder and include :
- `*cis.vecs.tsv`: eigenvectors decomposition along the genome
- `*cis.lam.txt`: eigenvalues associated with the eigenvectors
### TADs calling
TADs has been described as functional units of the genome.
While contacts between genes and regulatority elements can occur within a single TADs, contacts between TADs are much less frequent, mainly due to the presence of insulation protein (such as CTCF) at their boundaries. Looking at Hi-C maps, TADs look like triangles around the diagonal. According to the contact map resolutions, TADs appear as hierarchical structures with a median size around 1Mb (in mammals), as well as smaller structures usually called sub-TADs of smaller size.
TADs have been described as functional units of the genome.
While contacts between genes and regulatority elements can occur within a single TAD, contacts between TADs are much less frequent, mainly due to the presence of an insulation protein (such as CTCF) at their boundaries. Looking at Hi-C maps, TADs look like triangles around the diagonal. According to the contact map resolutions, TADs appear as hierarchical structures with a median size around 1Mb (in mammals), as well as smaller structures usually called sub-TADs of smaller size.
TADs calling remains a challenging task, and even if many methods have been proposed in the last decade, little overlap have been found between their results.
TADs calling remains a challenging task, and even if many methods have been proposed in the last decade, little overlap has been found between their results.
Currently, the pipeline proposes two approaches :
......@@ -283,7 +283,7 @@ Usually, TADs results are presented as simple BED files, or bigWig files, with t
</details>
[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
[MultiQC](http://multiqc.info) is a visualisation tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment