Commit 6d3284af authored by mcariou's avatar mcariou
Browse files

modify make PSI BLAST

parent 24f5cba3
......@@ -45,6 +45,8 @@ cp src/phyml /home/adminmarie/bin/
## 2. BLAST database construction
### Current status
**For now we use the database already available from legionella CDS**
- get sequences
......@@ -53,8 +55,27 @@ cp src/phyml /home/adminmarie/bin/
- Blast db construction
### Making a database containing only a liste of db
In the doc/1\_reference\_legio\_phylo.pdf, I make a table containing the liste of genomes selected fore reference phylogeny. That is all the genomes from Burstein et al. 2016 and Saini or from Saini et al. 2021 and 5 legionella strains from legiolist.
The list of genomes is in the file doc/tabAss.txt.
The script script/2\_make\_db.sh make a new db with CDS from these genomes.
```
~/2021_legio/phylolegio/script/2_make_db.sh
```
The database to use for phylogeny is here:
```
~/2021_legio/blastdb/phyloref/phyloref
```
## 3. BLAST
### 3.1. Generic version
This script compile blast data bases if necessary, then blast a gene sequence.
Example commande line:
......@@ -71,6 +92,13 @@ USAGE: ./make_blast.sh $1=blast_bd $2=gene_query $3=out_path
*out_blastn* : path to the output repertory
### 3.2. PSI BLAST
Used to recover the reference phylogeny:
```
~/script/3_make_PSIblast.sh ~/2021_legio/blastdb/phyloref/phyloref \
~/2021_legio/genes/78genes ~/2021_legio/out_blastn
```
## 4. BLAST table parsing
......@@ -123,25 +151,19 @@ run prank, trimal and phyml on the fasta output from step 4.
## 6. To do
- Reference phylogeny for a group of reference Legionella and reference genes:
-- https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5050043/
Prendre les 78 gènes de cette publi, ces espèces
+ les espèces de gupta non incluses dans cette liste
https://pubmed.ncbi.nlm.nih.gov/33881638/
+ 5 souches de légionella incluses dans legiolist
- Adapt the pipeline to use more than one gene, make concatenates, and format presence/absence for all genes.
=> make script 3_run_PSIblast.sh
to blast all 78 genes retrieved from uniprot on the specific database designed for phylogeny.
=> parse this result to see what is missing and make concatenate.
- Integration dans Nextflow
=> make the pipeline for the phylogeny.
=> include possible branching
=> beautiful output
......
......@@ -9,6 +9,7 @@ echo "--------------------------------------------------"
### PSMN
#./make_blast.sh /Xnfs/ciridb/shared_data/Databases/legio_cds/blastdb_ncbi/cdslegio ~/2021_legio/genes/lp0952.fasta ~/2021_legio/out_blastn
#./make_blast.sh ~/2021_legio/blastdb/phyloref/phyloref ~/2021_legio/genes/78Lp_uniprot.fasta ~/2021_legio/out_blastn
##################################################################################################################
# variable
......@@ -40,6 +41,8 @@ echo "make db if not available:"
if [[ -s $REP_DB.nal ]] ; then
echo "$REP_DB.nal exists"
elif [[ -s $REP_DB.nhi ]] ; then
echo "$REP_DB.nhi exists"
else
echo "$REP_DB.nal doesn't exist"
origdb=`echo x{00..31}`
......@@ -49,10 +52,13 @@ fi ;
echo "--------------------------------------------------"
echo $REP_DB
echo ${gene}.fasta
blastn -db $REP_DB -query ${gene}.fasta -evalue 10 \
psiblast -db $REP_DB -query ${gene}.fasta -evalue 10 \
-outfmt "7 qseqid qlen sseqid slen length pident evalue nident mismatch gapopen qstart qend qseq bitscrore score sstart send sseq sstrand" \
-out $REP_OUT/${gene_name}.blastn -max_target_seqs 5000
-out $REP_OUT/${gene_name}.blastn -max_target_seqs 5000 \
-num_iterations 0 -out_pssm $REP_OUT/log
echo "program done!"
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment