Skip to content
Snippets Groups Projects
Unverified Commit 631474e9 authored by Laurent Modolo's avatar Laurent Modolo
Browse files

start of 9.

parent fe95f0cb
No related branches found
No related tags found
No related merge requests found
# Batch processing
[![cc_by_sa](./img/cc_by_sa.png)](http://creativecommons.org/licenses/by-sa/4.0/)
Objective: Learn basics of batch processing in GNU/Linux
In the previous section, we have seen how to handle streams and text. We can use this knowledge to generate list of command instead of text. This is called batch processing.
In everyday life, you may want to run command sequentiality without using pipes.
To run `CMD1` and then run `CMD2` you can use the `;` operator
```
CMD1 ; CMD2
```
To run `CMD1` and then run `CMD2` if `CMD1` didn't throw an error, you can use the `&&` operator which is safer than the `;` operator.
```sh
CMD1 && CMD2
```
You can also use the `||` to manage errors and run `CMD2` if `CMD1` failed.
```sh
CMD1 || CMD2
```
## Executing list of commands
The easiest option to execute list of command is to use `xargs`. `xargs` reads arguments from **stdin** and use then as argument for a command. In Unix systems the command `echo` send string of character into **stdout**. We are going to use this command to learn more about `xargs`.
```sh
echo "hello world"
```
In general a string of character differs from a command when it's placed between quotes.
The two following commands are equivalent, why ?
```sh
echo "file1 file2 file3" | xargs touch
touch file1 file2 file3
```
You can display the command executed by `xargs` with the switch `-t`.
By default the number of arguments sent by `xargs` is defined by the system. You can change it with the option `-n N`, where `N` is the number of arguments sent. Use the option `-t` and `-n` to run the previous command as 3 separate `touch` commands.
<details><summary>Solution</summary>
<p>
```sh
echo "file1 file2 file3" | xargs -t -n 1 touch
```
</p>
</details>
Sometime, the arguments are not separated by space but by other characters. You can use the `-d` option to specify them. Execute `touch`1 time from the following command:
```sh
echo "file1;file2;file3"
```
<details><summary>Solution</summary>
<p>
```sh
echo "file1;file2;file3" | xargs -t -d \; touch
```
</p>
</details>
To reuse the arguments sent to `xargs` you can use the command `-I` which defines a string corresponding to the argument. Try the following command, what does the **man**ual says about the `-c` option of the command `cut` ?
```sh
ls -l file* | cut -c 44- | xargs -t -I % ln -s % link_%
```
Instead of using `ls` the command `xargs` is often used with the command `find`. The command `find` is a powerful command to search for files.
Start from the following command to make a non-hidden copy of all the file with a name starting with *.bash* in your home folder
```sh
find . -name ".bash*" | sed 's|./.||g'
```
<details><summary>Solution</summary>
<p>
```sh
find . -name ".bash*" | sed 's|./.||g' | xargs -t -I % cp .% %
```
</p>
</details>
You can try to remove every file in the `/tmp` folder with the following command:
```sh
find /tmp/ -type f | xargs -t rm
```
Modify this command to remove every directly in the `/tmp` folder.
<details><summary>Solution</summary>
<p>
```sh
find /tmp/ -type d | xargs -t rm -R
```
</p>
</details>
## Writing `awk` commands
`xargs` Is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn `awk`. `awk` is a programming language by itself, but you don't need to know everything about `awk` to use it.
You can to think of `awk` as a `xargs -I $N` command where `$1` correspond to the first column `$2` to the second column, etc...
There are also some predefined variables that you can use like
- `$0` Correspond to all the columns.
- `FS` the field separator used
- `NF` the number of field separated by `FS`
- `NR` the number for records already read
A `awk` program is a chain of commands with the form `motif { action }`
- the `motif` define where there `action` is executed
- the `action` is what you want to do
The `motif` can be
- a regexp
- The keyword `BEGIN`or `END`
- a comparison like `<`, `<=`, `==`, `>=`, `>` or `!=`
- a combination of the three separated by `&&` (AND), `||`(OR) and `!` (Negation)
- a range of line `motif_1,motif_2`
With `awk` you can
Count the number of line in a file
```sh
awk '{ print NR " : " $0 }' file
```
Convert a tabulated sequences file into fasta format
```sh
awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa
```
Convert a multiline fasta file into a single line fasta file
```sh
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample.fa > sample1_singleline.fa
```
Convert fasta sequences to uppercase
```sh
awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' file.fasta > file_upper.fasta
```
Return a list of sequence_id sequence_length from a fasta file
```sh
awk 'BEGIN {OFS = "\n"}; /^>/ {print(substr(sequence_id, 2)" "sequence_length); sequence_length = 0; sequence_id = $0}; /^[^>]/ {sequence_length += length($0)}; END {print(substr(sequence_id, 2)" "sequence_length)}' file.fasta
```
Count the number of bases in a fastq.gz file
```sh
(gzip -dc $0) | awk 'NR%4 == 2 {basenumber += length($0)} END {print basenumber}'
```
Only read with more than 20bp from a fastq
```sh
awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 20){print header, seq, qheader, qseq}}' < input.fastq > output.fastq
```
## Writing a bash script
When you start writing complicated command, you may want to save them to use them later.
You can find everything that you are typing in your `bash`in the `~/.bash_history` but working with this file can be tedious as it also contains all the command that you mistype. A good solution, for reproducibility is to write `bash` scripts. A bash script is simply a text file that contains a sequence of `bash`commands.
To execute a `bash` script you can use the following command:
```bash
source myscrip.sh
```
It's usual to write the `.sh` extension for `shell`scripts.
<details><summary>Solution</summary>
<p>
```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l
```
</p>
</details>
###
......@@ -6,7 +6,9 @@ all: html/index.html \
html/5_users_and_rights.html \
html/6_unix_processes.html \
html/7_streams_and_pipes.html \
html/8_text_manipulation.html
html/8_text_manipulation.html \
html/9_batch_processing.html
html/index.html: index.md github-pandoc.css
pandoc -s -c github-pandoc.css index.md -o html/index.html
......@@ -34,3 +36,6 @@ html/7_streams_and_pipes.html: 7_streams_and_pipes.md github-pandoc.css
html/8_text_manipulation.html: 8_text_manipulation.md github-pandoc.css
pandoc -s --toc -c github-pandoc.css 8_text_manipulation.md -o html/8_text_manipulation.html
html/9_batch_processing.html: 9_batch_processing.md github-pandoc.css
pandoc -s --toc -c github-pandoc.css 9_batch_processing.md -o html/9_batch_processing.html
......@@ -11,6 +11,7 @@ title: # Unix / command line training course
5. [Users and rights](http://perso.ens-lyon.fr/laurent.modolo/unix/5_users_and_rights.html)
6. [Unix processes](http://perso.ens-lyon.fr/laurent.modolo/unix/6_unix_processes.html)
7. [Streams and pipes](http://perso.ens-lyon.fr/laurent.modolo/unix/7_streams_and_pipes.html)
7. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html)
8. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html)
9. [Batch processing](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment