Skip to content
Snippets Groups Projects
Commit 9283bddc authored by Bertache's avatar Bertache
Browse files

Update file README.md

parent 50e7be12
Branches
No related tags found
No related merge requests found
...@@ -8,44 +8,74 @@ SPDX-License-Identifier: CC0-1.0 ...@@ -8,44 +8,74 @@ SPDX-License-Identifier: CC0-1.0
This is a reuse of [Microsplit](https://pypi.org/project/microsplit/) structure. Hi-Classifier is a command-line tool designed to identify, classify, and manage paired reads in BAM files derived from Hi‑C experiments. It follows the logic and structure of the [Parasplit](https://pypi.org/project/parasplit/) tool but is tailored for Micro-C data. The tool reads alignment files (SAM, BAM, or CRAM) using `pysam` and identifies events of soft-clipping or hard-clipping. This is a reuse of [Microsplit](https://pypi.org/project/microsplit/) structure. Hi-Classifier is a command-line tool designed to identify, classify, and manage paired reads in BAM files derived from Hi‑C experiments. It follows the logic and structure of the [Parasplit](https://pypi.org/project/parasplit/) tool but is tailored for Micro-C data. The tool reads alignment files (SAM, BAM, or CRAM) using `pysam` and identifies events of soft-clipping or hard-clipping.
---
## Why another Hi-C parser ?
Valid/invalid conventions ignore >30% of reads.
Some of these categories—dangling, self-circle, uncut sites—fluctuate with digestion and compaction.
Hi-Classifier is a fast, **site-aware** classifier able to:
* label every pair (cis/trans-up/down, dangling, self-circle, re-joined, other)
* count categories **per restriction site**
* split a BAM into category-specific BAMs for downstream QC
* stream-process paired BAMs (R1/R2) could use ≤8 GB RAM
---
## Features ## Features
- **Parallel Processing**: Microsplit utilizes parallel processing to enhance performance and efficiency. - **Parallel Processing**: Hi-Classifier utilizes parallel processing to enhance performance and efficiency.
- **Error Margin Handling**: Adds a fixed number of base pairs to new fragments to account for potential over-mapping by Bowtie2, ensuring more accurate downstream analysis. - **Error Margin Handling**:
- **Output Paired Reads**: Outputs both end-to-end aligned pairs and newly generated fragment pairs. - **Output Paired Reads**:
## Installation
Hi-Classifier is available on PyPI and can be installed using pip: ## Quick start
```bash ```bash
pip install hi-classifier hi-classifier \
-1 sample_R1.bam -2 sample_R2.bam \
-f genome.fa --enzyme dpnII \
-o out/prefix --num_threads 8
``` ```
## Usage Outputs
Before using Hi-Classifier, you need to perform an initial alignment of reads using mapper to obtain explicit BAM files. Below is an example of how to use Microsplit from the command line: ```
prefix_counts.tsv # matrix [site, class]
prefix_classified.bam # BAM with CAT tag (optional split per class)
```
```bash ## CLI
microsplit --bam_for_file path/to/forward.bam \
--bam_rev_file path/to/reverse.bam \ ```text
--output_forward path/to/output_forward.fastq.gz \ --bam_for_file BAM of Read 1 (required)
--output_reverse path/to/output_reverse.fastq.gz \ --bam_rev_file BAM of Read 2 (required)
--num_threads 8 --fasta_ref Reference genome (required)
--enzyme dpnII | mbol | arima
--len_max discard pairs > L [1500]
--mapq_score minimum MAPQ [1]
--tolerance ±bp for dangling [1]
--num_threads default 6
``` ```
### Command-Line Arguments
- `--bam_for_file`: Path to the forward BAM file. ## Performance
- `--bam_rev_file`: Path to the reverse BAM file.
- `--output_forward`: Path to the output forward FastQ file. * Less than 4 hours on 20 cores to compute the data from HiConfidence (https://doi.org/10.1093/bib/bbad044)
- `--output_reverse`: Path to the output reverse FastQ file.
- `--num_threads`: Total number of threads for parallel processing.
## Methodology ## Class logic (default rules)
TO BE COMPLETED | Class | Orientation | Distance | Same frag? | Potiential origin |
| ------------ | ----------- | -------- | ---------- | -------------------- |
| Valide | any | > len | no | bona fide contacts |
| Dangling | ↘↙ (conv.) | 0 | yes | ligation manquée |
| Self-circle | ↗↖ (div.) | ≤ 1 site | yes | auto-circularisation |
| Re-joined | ↘↙ | adjac. | no | religation immédiate |
| Cis-Up/Down… | mixed | > len | no | four intrachr. bins |
| Trans-… | mixed | N/A | no | inter-chromosomes |
| Other | – | – | – | tout le reste |
## License ## License
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment