From 631474e91bd3ec53d94f345962457e321eb162dd Mon Sep 17 00:00:00 2001 From: Laurent Modolo <laurent@modolo.fr> Date: Tue, 24 Nov 2020 12:19:18 +0100 Subject: [PATCH] start of 9. --- 9_batch_processing.md | 206 ++++++++++++++++++++++++++++++++++++++++++ Makefile | 7 +- index.md | 3 +- 3 files changed, 214 insertions(+), 2 deletions(-) create mode 100644 9_batch_processing.md diff --git a/9_batch_processing.md b/9_batch_processing.md new file mode 100644 index 0000000..c0ec69f --- /dev/null +++ b/9_batch_processing.md @@ -0,0 +1,206 @@ +# Batch processing + +[](http://creativecommons.org/licenses/by-sa/4.0/) + +Objective: Learn basics of batch processing in GNU/Linux + +In the previous section, we have seen how to handle streams and text. We can use this knowledge to generate list of command instead of text. This is called batch processing. + +In everyday life, you may want to run command sequentiality without using pipes. + +To run `CMD1` and then run `CMD2` you can use the `;` operator + +``` +CMD1 ; CMD2 +``` + +To run `CMD1` and then run `CMD2` if `CMD1` didn't throw an error, you can use the `&&` operator which is safer than the `;` operator. + +```sh +CMD1 && CMD2 +``` + +You can also use the `||` to manage errors and run `CMD2` if `CMD1` failed. + +```sh +CMD1 || CMD2 +``` + +## Executing list of commands + +The easiest option to execute list of command is to use `xargs`. `xargs` reads arguments from **stdin** and use then as argument for a command. In Unix systems the command `echo` send string of character into **stdout**. We are going to use this command to learn more about `xargs`. + +```sh +echo "hello world" +``` + +In general a string of character differs from a command when it's placed between quotes. + +The two following commands are equivalent, why ? + +```sh +echo "file1 file2 file3" | xargs touch +touch file1 file2 file3 +``` + +You can display the command executed by `xargs` with the switch `-t`. + +By default the number of arguments sent by `xargs` is defined by the system. You can change it with the option `-n N`, where `N` is the number of arguments sent. Use the option `-t` and `-n` to run the previous command as 3 separate `touch` commands. + +<details><summary>Solution</summary> +<p> +```sh +echo "file1 file2 file3" | xargs -t -n 1 touch +``` +</p> +</details> + +Sometime, the arguments are not separated by space but by other characters. You can use the `-d` option to specify them. Execute `touch`1 time from the following command: + +```sh +echo "file1;file2;file3" +``` + +<details><summary>Solution</summary> +<p> +```sh +echo "file1;file2;file3" | xargs -t -d \; touch +``` +</p> +</details> + +To reuse the arguments sent to `xargs` you can use the command `-I` which defines a string corresponding to the argument. Try the following command, what does the **man**ual says about the `-c` option of the command `cut` ? + +```sh +ls -l file* | cut -c 44- | xargs -t -I % ln -s % link_% +``` + +Instead of using `ls` the command `xargs` is often used with the command `find`. The command `find` is a powerful command to search for files. + +Start from the following command to make a non-hidden copy of all the file with a name starting with *.bash* in your home folder + +```sh +find . -name ".bash*" | sed 's|./.||g' +``` + +<details><summary>Solution</summary> +<p> +```sh +find . -name ".bash*" | sed 's|./.||g' | xargs -t -I % cp .% % +``` +</p> +</details> + +You can try to remove every file in the `/tmp` folder with the following command: + +```sh +find /tmp/ -type f | xargs -t rm +``` + +Modify this command to remove every directly in the `/tmp` folder. + +<details><summary>Solution</summary> +<p> +```sh +find /tmp/ -type d | xargs -t rm -R +``` +</p> +</details> + +## Writing `awk` commands + +`xargs` Is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn `awk`. `awk` is a programming language by itself, but you don't need to know everything about `awk` to use it. + +You can to think of `awk` as a `xargs -I $N` command where `$1` correspond to the first column `$2` to the second column, etc... + +There are also some predefined variables that you can use like + +- `$0` Correspond to all the columns. +- `FS` the field separator used +- `NF` the number of field separated by `FS` +- `NR` the number for records already read + +A `awk` program is a chain of commands with the form `motif { action }` + +- the `motif` define where there `action` is executed +- the `action` is what you want to do + +The `motif` can be + +- a regexp +- The keyword `BEGIN`or `END` +- a comparison like `<`, `<=`, `==`, `>=`, `>` or `!=` +- a combination of the three separated by `&&` (AND), `||`(OR) and `!` (Negation) +- a range of line `motif_1,motif_2` + +With `awk` you can + +Count the number of line in a file + +```sh +awk '{ print NR " : " $0 }' file +``` + +Convert a tabulated sequences file into fasta format + +```sh +awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa +``` + +Convert a multiline fasta file into a single line fasta file + +```sh +awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample.fa > sample1_singleline.fa +``` + +Convert fasta sequences to uppercase + +```sh +awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' file.fasta > file_upper.fasta +``` + +Return a list of sequence_id sequence_length from a fasta file + +```sh +awk 'BEGIN {OFS = "\n"}; /^>/ {print(substr(sequence_id, 2)" "sequence_length); sequence_length = 0; sequence_id = $0}; /^[^>]/ {sequence_length += length($0)}; END {print(substr(sequence_id, 2)" "sequence_length)}' file.fasta +``` + +Count the number of bases in a fastq.gz file + +```sh +(gzip -dc $0) | awk 'NR%4 == 2 {basenumber += length($0)} END {print basenumber}' +``` + +Only read with more than 20bp from a fastq + +```sh +awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 20){print header, seq, qheader, qseq}}' < input.fastq > output.fastq +``` + + + +## Writing a bash script + +When you start writing complicated command, you may want to save them to use them later. + +You can find everything that you are typing in your `bash`in the `~/.bash_history` but working with this file can be tedious as it also contains all the command that you mistype. A good solution, for reproducibility is to write `bash` scripts. A bash script is simply a text file that contains a sequence of `bash`commands. + +To execute a `bash` script you can use the following command: + +```bash +source myscrip.sh +``` + +It's usual to write the `.sh` extension for `shell`scripts. + +<details><summary>Solution</summary> +<p> +```sh +gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l +``` +</p> +</details> + + +### + diff --git a/Makefile b/Makefile index cfe4fa2..dddaa68 100644 --- a/Makefile +++ b/Makefile @@ -6,7 +6,9 @@ all: html/index.html \ html/5_users_and_rights.html \ html/6_unix_processes.html \ html/7_streams_and_pipes.html \ - html/8_text_manipulation.html + html/8_text_manipulation.html \ + html/9_batch_processing.html + html/index.html: index.md github-pandoc.css pandoc -s -c github-pandoc.css index.md -o html/index.html @@ -34,3 +36,6 @@ html/7_streams_and_pipes.html: 7_streams_and_pipes.md github-pandoc.css html/8_text_manipulation.html: 8_text_manipulation.md github-pandoc.css pandoc -s --toc -c github-pandoc.css 8_text_manipulation.md -o html/8_text_manipulation.html + +html/9_batch_processing.html: 9_batch_processing.md github-pandoc.css + pandoc -s --toc -c github-pandoc.css 9_batch_processing.md -o html/9_batch_processing.html diff --git a/index.md b/index.md index 58b6fa2..46b4d1b 100644 --- a/index.md +++ b/index.md @@ -11,6 +11,7 @@ title: # Unix / command line training course 5. [Users and rights](http://perso.ens-lyon.fr/laurent.modolo/unix/5_users_and_rights.html) 6. [Unix processes](http://perso.ens-lyon.fr/laurent.modolo/unix/6_unix_processes.html) 7. [Streams and pipes](http://perso.ens-lyon.fr/laurent.modolo/unix/7_streams_and_pipes.html) -7. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html) +8. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html) +9. [Batch processing](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html) -- GitLab