Update file README.md

9283bddc · Bertache · 50e7be12 · 9283bddc
Commit 9283bddc authored 2 months ago by Bertache
--- a/README.md
+++ b/README.md
@@ -8,44 +8,74 @@ SPDX-License-Identifier: CC0-1.0
 This is a reuse of [Microsplit](https://pypi.org/project/microsplit/)  structure. Hi-Classifier is a command-line tool designed to identify, classify, and manage paired reads in BAM files derived from Hi‑C experiments. It follows the logic and structure of the [Parasplit](https://pypi.org/project/parasplit/) tool but is tailored for Micro-C data. The tool reads alignment files (SAM, BAM, or CRAM) using `pysam` and identifies events of soft-clipping or hard-clipping. 
+---
+## Why another Hi-C parser ?
+Valid/invalid conventions ignore >30% of reads.
+Some of these categories—dangling, self-circle, uncut sites—fluctuate with digestion and compaction. 
+Hi-Classifier is a fast, **site-aware** classifier able to:
+* label every pair (cis/trans-up/down, dangling, self-circle, re-joined, other)  
+* count categories **per restriction site**  
+* split a BAM into category-specific BAMs for downstream QC  
+* stream-process paired BAMs (R1/R2)  could use ≤8 GB RAM
+---
 ## Features
- **Parallel Processing**: Microsplit utilizes parallel processing to enhance performance and efficiency.
+- **Parallel Processing**: Hi-Classifier utilizes parallel processing to enhance performance and efficiency.
- **Error Margin Handling**: Adds a fixed number of base pairs to new fragments to account for potential over-mapping by Bowtie2, ensuring more accurate downstream analysis.
+- **Error Margin Handling**:
- **Output Paired Reads**: Outputs both end-to-end aligned pairs and newly generated fragment pairs.
+- **Output Paired Reads**: 
-## Installation
-Hi-Classifier is available on PyPI and can be installed using pip:
+## Quick start
 ```bash
-pip install hi-classifier
+hi-classifier \
+   -1 sample_R1.bam -2 sample_R2.bam \
+   -f genome.fa --enzyme dpnII \
+   -o out/prefix --num_threads 8
 ```
-## Usage
+Outputs
-Before using Hi-Classifier, you need to perform an initial alignment of reads using mapper to obtain explicit BAM files. Below is an example of how to use Microsplit from the command line:
+```
+prefix_counts.tsv      # matrix  [site, class]
+prefix_classified.bam  # BAM with CAT tag (optional split per class)
+```
-```bash
+## CLI 
-microsplit --bam_for_file path/to/forward.bam \
-           --bam_rev_file path/to/reverse.bam \
+```text
-           --output_forward path/to/output_forward.fastq.gz \
+--bam_for_file   BAM of Read 1      (required)
-           --output_reverse path/to/output_reverse.fastq.gz \
+--bam_rev_file   BAM of Read 2      (required)
-           --num_threads 8 
+--fasta_ref      Reference genome   (required)
+--enzyme         dpnII | mbol | arima
+--len_max        discard pairs > L  [1500]
+--mapq_score     minimum MAPQ       [1]
+--tolerance      ±bp for dangling   [1]
+--num_threads    default 6
 ```
-### Command-Line Arguments
- `--bam_for_file`: Path to the forward BAM file.
+## Performance
- `--bam_rev_file`: Path to the reverse BAM file.
- `--output_forward`: Path to the output forward FastQ file.
+* Less than 4 hours on 20 cores to compute the data from HiConfidence (https://doi.org/10.1093/bib/bbad044)
- `--output_reverse`: Path to the output reverse FastQ file.
- `--num_threads`: Total number of threads for parallel processing.
-## Methodology
+## Class logic (default rules)
-TO BE COMPLETED
+| Class        | Orientation | Distance | Same frag? | Potiential origin    |
+| ------------ | ----------- | -------- | ---------- | -------------------- |
+| Valide       | any         | > len    | no         | bona fide contacts   |
+| Dangling     | ↘↙ (conv.)  | 0        | yes        | ligation manquée     |
+| Self-circle  | ↗↖ (div.)   | ≤ 1 site | yes        | auto-circularisation |
+| Re-joined    | ↘↙          | adjac.   | no         | religation immédiate |
+| Cis-Up/Down… | mixed       | > len    | no         | four intrachr. bins  |
+| Trans-…      | mixed       | N/A      | no         | inter-chromosomes    |
+| Other        | –           | –        | –          | tout le reste        |
 ## License