This script take as input a blastn output and filter criteria, get species names using efetch and run a R script to filter the blast output, keep a given number of sequences per unique species names. It return a report od the number of hit blast at each steps and a fasta output.
*problem: I get a roganism name from efetch, wich is not necessarily a standardize species name: numerous names including strain names/number. I keep max 10 sequences for sequences that have the exact same name, but I keep small variations such as "Legionella pneumophila strain zorg" "Legionella_pneumophila_subsp._pneumophila_LPE509" etc. These are difficult to parse because no common feature. To be discussed with team. Results could be improved using the 2 first word of each name*
## 5. Align selected sequences and make phylogeny
```
~/script/5_aln_phy.sh ~/fasta/ lp0952_ortho
```
run prank, trimal and phyml on the fasta output from step 4.