Commit f90dc9e7 authored by mcariou's avatar mcariou
Browse files

advance scrip PSI

parent 6d3284af
......@@ -60,7 +60,7 @@ cp src/phyml /home/adminmarie/bin/
In the doc/1\_reference\_legio\_phylo.pdf, I make a table containing the liste of genomes selected fore reference phylogeny. That is all the genomes from Burstein et al. 2016 and Saini or from Saini et al. 2021 and 5 legionella strains from legiolist.
The list of genomes is in the file doc/tabAss.txt.
The script script/2\_make\_db.sh make a new db with CDS from these genomes.
The script script/2\_make\_db.sh make a new db with CDS from these genomes. We construct a nucleotide db and a proteic db (for PSIBLAST)
```
~/2021_legio/phylolegio/script/2_make_db.sh
......@@ -68,7 +68,8 @@ The script script/2\_make\_db.sh make a new db with CDS from these genomes.
The database to use for phylogeny is here:
```
~/2021_legio/blastdb/phyloref/phyloref
~/2021_legio/blastdb/phyloref/phyloref_nuc
~/2021_legio/blastdb/phyloref/phyloref_prot
```
......@@ -100,6 +101,9 @@ Used to recover the reference phylogeny:
~/2021_legio/genes/78genes ~/2021_legio/out_blastn
```
This is a proteic blast, thus the nucleotidic sequences are extracted using blastdbcmd (see script 4prime)
## 4. BLAST table parsing
Parser les résultats :
......@@ -134,11 +138,11 @@ USAGE: ./4_parse_blast.sh $1=blast_out $2=modified_blast
This script take as input a blastn output and filter criteria, get species names using efetch and run a R script to filter the blast output, keep a given number of sequences per unique species names. It return a report od the number of hit blast at each steps and a fasta output.
This script take as input a blastn output and filter criteria, get species names using efetch and run a R script to filter the blast output, keep a given number of sequences per unique species names. It returns a report of the number of hit blast at each steps and a fasta output.
*problem: I get a roganism name from efetch, wich is not necessarily a standardize species name: numerous names including strain names/number. I keep max 10 sequences for sequences that have the exact same name, but I keep small variations such as "Legionella pneumophila strain zorg" "Legionella_pneumophila_subsp._pneumophila_LPE509" etc. These are difficult to parse because no common feature. To be discussed with team. Results could be improved using the 2 first word of each name*
*Instead of keeping randomly **maxseqpertax** sequences chosen randomly, it would be better to keep non redondantsequences, for example with [treemmer](https://github.com/fmenardo/Treemmer)*
*Instead of keeping randomly **maxseqpertax** sequences chosen randomly, it would be better to keep non redondant sequences, for example with [treemmer](https://github.com/fmenardo/Treemmer)*
## 5. Align selected sequences and make phylogeny
......
......@@ -22,14 +22,13 @@ module purge
HOME="/home/mcariou/2021_legio/"
OUT=$HOME"blastdb/phyloref/"
CAT=$OUT"/cat_phyloref_cds.fasta"
CATPROT=$OUT"/cat_phyloref_pep.fasta"
Trans="/home/mcariou/2020_Attaiech/prot_db/Transdecoder/"
TAB=$HOME"/phylolegio/doc/tabAss.txt"
mkdir -p $OUT
##################################################################################################################UT
### Read tab genomes and cat cds files.
if [[ -s $CAT ]] ; then
......@@ -54,12 +53,12 @@ fi
### Make Blast db
makeblastdb -dbtype nucl -in $CAT -hash_index -out $OUT/phyloref -parse_seqids
#makeblastdb -dbtype nucl -in $CAT -hash_index -out $OUT/phyloref_nuc -parse_seqids
###
# need to translate
transeq -sequence $CAT -outseq $CATPROT
makeblastdb -dbtype prot -in $CATPROT -hash_index -out $OUT/phyloref_prot -parse_seqids
# fin
......@@ -41,8 +41,8 @@ echo "make db if not available:"
if [[ -s $REP_DB.nal ]] ; then
echo "$REP_DB.nal exists"
elif [[ -s $REP_DB.nhi ]] ; then
echo "$REP_DB.nhi exists"
elif [[ -s $REP_DB.phi ]] ; then
echo "$REP_DB.phi exists"
else
echo "$REP_DB.nal doesn't exist"
origdb=`echo x{00..31}`
......@@ -54,10 +54,11 @@ echo "--------------------------------------------------"
echo $REP_DB
echo ${gene}.fasta
echo $REP_OUT
psiblast -db $REP_DB -query ${gene}.fasta -evalue 10 \
-outfmt "7 qseqid qlen sseqid slen length pident evalue nident mismatch gapopen qstart qend qseq bitscrore score sstart send sseq sstrand" \
-out $REP_OUT/${gene_name}.blastn -max_target_seqs 5000 \
-out $REP_OUT/${gene_name}.psiblast -max_target_seqs 5000 \
-num_iterations 0 -out_pssm $REP_OUT/log
echo "program done!"
......
#!/bin/bash
echo "USAGE: ./4_parse_blast.sh \$1=blast_out_file
\$2=modified_blast \$3=fasta_out
\$4=seuil_evalue \$5=percid \$6=percoverlap
\$7=maxseqpertax"
echo "--------------------------------------------------"
##################################################################################################################
### En local
#./4_parse_blast.sh ~/Documents/CIRI_BIBS_projects/2021_04_Doublet/pipeline/out_blastn/lp0952.blastn \
#~/Documents/CIRI_BIBS_projects/2021_04_Doublet/pipeline/out_blastn/lp0952_sp.blastn \
#~/Documents/CIRI_BIBS_projects/2021_04_Doublet/pipeline/fasta/lp0952_ortho.fasta
#0.0001 0.3 0.5 10
#./4_parse_blast.sh ~/Documents/CIRI_BIBS_projects/2021_04_Doublet/pipeline/out_blastn/lp0952.blastn ~/Documents/CIRI_BIBS_projects/2021_04_Doublet/pipeline/out_blastn/lp0952_sp.blastn ~/Documents/CIRI_BIBS_projects/2021_04_Doublet/pipeline/fasta/lp0952_ortho.fasta 0.0001 0.3 0.5 10
### PSMN
#./4_parse_blast.sh ~/2021_legio/out_blastn/78Lp_uniprot.psiblast ~/2021_legio/phylolegio/doc/tabAss.txt ~/2021_legio/fasta/78Lp 0.0001 0.3 0.5 3
##################################################################################################################
# variable
OUT_BLAST=$1
TAB_SP=$2
BLAST_REP=`echo $1 | awk -F "/" '{$NF=""; print $0}' | sed 's/ /\//g'`
FASTA_REP=$3
mkdir -p $FASTA_REP
EVAL=$4
PERCID=$5
PERCOVERLAP=$6
NPERTAX=$7
#################################################################################################################
# Read tabAss, split blast output.
#for id in `seq $nrow`
#do
#line=`sed -n ${id}p $OUT_BLAST.tmp`
#idncbi=`sed -n ${id}p $OUT_BLAST.tmp | awk '{print $3}' | awk -F[\.] '{print $1}'` #### erreur sur cette ligne, il faut selectionne la ligne $id
#tax=`efetch -db nuccore -id $idncbi -format docsum | xtract -pattern DocumentSummary -element Organism | sed 's/ /_/g'`
#echo $tax
#echo $line " " $tax >> $MOD_BLAST
#done
## This is to get the path to the R script from the sh script
## This assume they are in the same repertory
echo $0
rscript=`ls $0 | sed 's/.sh/.R/g'`
echo $rscript
#Rscript --vanilla $rscript $MOD_BLAST $EVAL $PERCID $PERCOVERLAP $NPERTAX $FASTA> $BLAST_REP/paste_blast.log
# fin
2.6.0+
1
Q5ZXP3_lpg0688
/home/mcariou/2021_legio/blastdb/phyloref/phyloref_prot
query
hits
AE017354.1.p373_1
CR628337.1.p362_1
FQ958210.1.p381_1
CP001828.1.p378_1
CR628336.1.p382_1
CP000675.2.p381_1
JNCF01000046.1.p5_1
LNZB01000060.1.p91_1
LNYR01000034.1.p54_1
LNZC01000001.1.p18_1
LN614827.1.p493_1
LNYN01000042.1.p56_1
LNYW01000067.1.p7_1
LNXT01000048.1.p93_1
LNXX01000004.1.p12_1
LNYE01000009.1.p14_1
LNYV01000015.1.p23_1
LNYS01000001.1.p20_1
CAAAIM010000011.1.p14_1
FN650140.1.p448_1
LNYU01000032.1.p28_1
LN901321.1.p48_1
JNIA01000024.1.p18_1
LNYZ01000029.1.p9_1
CAAAID010000002.1.p45_1
LNYP01000029.1.p77_1
LNXW01000008.1.p28_1
CAAAIA010000014.1.p9_1
JH413817.1.p18_1
LNYY01000021.1.p116_1
CAAAIW010000005.1.p16_1
LNYH01000058.1.p13_1
LNXV01000005.1.p26_1
LNYL01000026.1.p4_1
LNXS01000029.1.p73_1
LNXU01000019.1.p78_1
LNYF01000001.1.p62_1
LNYI01000048.1.p5_1
LNYK01000014.1.p72_1
CP016397.1.p355_1
LNYQ01000013.1.p281_1
LNYX01000032.1.p21_1
LNYJ01000011.1.p312_1
LNZA01000001.1.p292_1
LNYG01000001.1.p42_1
JHYC01000004.1.p26_1
UGNV01000001.1.p343_1
LNYB01000004.1.p11_1
CALJ01000113.1.p1_1
CCVW01000002.1.p114_1
LNYO01000024.1.p100_1
LNYT01000007.1.p118_1
LNYA01000001.1.p35_1
QCXN01000001.1.p67_1
LNXY01000027.1.p96_1
LNKA01000001.1.p167_1
CP021497.1.p53_1
LNYC01000044.1.p19_1
CP013667.1.p163_1
LNYQ01000012.1.p129_1
LNYY01000019.1.p372_1
CCVW01000002.1.p2_1
LN614827.1.p572_1
LNYN01000020.1.p80_1
LNYO01000013.1.p333_1
2.6.0+
2
Q5ZXP3_lpg0688
/home/mcariou/2021_legio/blastdb/phyloref/phyloref_prot
query
hits
AE017354.1.p373_1
CR628337.1.p362_1
FQ958210.1.p381_1
CP001828.1.p378_1
CR628336.1.p382_1
JNCF01000046.1.p5_1
CP000675.2.p381_1
LNYR01000034.1.p54_1
LNZC01000001.1.p18_1
LNYN01000042.1.p56_1
LNZB01000060.1.p91_1
LN614827.1.p493_1
CAAAID010000002.1.p45_1
LNYP01000029.1.p77_1
LNXV01000005.1.p26_1
LNYK01000014.1.p72_1
LNYH01000058.1.p13_1
LNXT01000048.1.p93_1
LNYS01000001.1.p20_1
CP016397.1.p355_1
LNYL01000026.1.p4_1
LNYF01000001.1.p62_1
LNYG01000001.1.p42_1
JHYC01000004.1.p26_1
UGNV01000001.1.p343_1
LNYI01000048.1.p5_1
LNKA01000001.1.p167_1
CAAAIA010000014.1.p9_1
LNYJ01000011.1.p312_1
LNYX01000032.1.p21_1
CAAAIW010000005.1.p16_1
CAAAIM010000011.1.p14_1
LNXX01000004.1.p12_1
LNYZ01000029.1.p9_1
LNYB01000004.1.p11_1
CALJ01000113.1.p1_1
LNYW01000067.1.p7_1
JH413817.1.p18_1
LNYU01000032.1.p28_1
LNYV01000015.1.p23_1
FN650140.1.p448_1
LNYE01000009.1.p14_1
LN901321.1.p48_1
LNYY01000021.1.p116_1
LNXW01000008.1.p28_1
JNIA01000024.1.p18_1
LNXS01000029.1.p73_1
CCVW01000002.1.p114_1
LNYT01000007.1.p118_1
LNXU01000019.1.p78_1
LNZA01000001.1.p292_1
LNYA01000001.1.p35_1
LNYO01000024.1.p100_1
QCXN01000001.1.p67_1
LNXY01000027.1.p96_1
LNYQ01000013.1.p281_1
LNYC01000044.1.p19_1
CP021497.1.p53_1
CP013667.1.p163_1
This diff is collapsed.
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment