start of 9.

631474e9 · Laurent Modolo · fe95f0cb · 631474e9 · 631474e9 · 631474e9
Unverified Commit 631474e9 authored 4 years ago by Laurent Modolo
--- a/9_batch_processing.md
+++ b/9_batch_processing.md
+# Batch processing
+
+[![cc_by_sa](./img/cc_by_sa.png)](http://creativecommons.org/licenses/by-sa/4.0/)
+
+Objective: Learn basics of batch processing in GNU/Linux
+
+In the previous section, we have seen how to handle streams and text. We can use this knowledge to generate list of command instead of text. This is called batch processing.
+
+In everyday life, you may want to run command sequentiality without using pipes.
+
+To run `CMD1` and then run `CMD2` you can use the `;` operator
+
+```
+CMD1 ; CMD2
+```
+
+To run `CMD1` and then run `CMD2` if `CMD1` didn't throw an error, you can use the `&&` operator which is safer than the `;` operator.
+
+```sh
+CMD1 && CMD2
+```
+
+You can also use the `||`  to manage errors and run `CMD2` if `CMD1` failed.
+
+```sh
+CMD1 || CMD2
+```
+
+## Executing list of commands
+
+The easiest option to execute list of command is to use `xargs`. `xargs` reads arguments from **stdin** and use then as argument for a command. In Unix systems the command `echo` send string of character into **stdout**. We are going to use this command to learn more about `xargs`.
+
+```sh
+echo "hello world"
+```
+
+In general a string of character differs from a command when it's placed between quotes.
+
+The two following commands are equivalent, why ?
+
+```sh
+echo "file1 file2 file3" | xargs touch
+touch file1 file2 file3
+```
+
+You can display the command executed by  `xargs` with the switch `-t`.
+
+By default the number of arguments sent by `xargs` is defined by the system. You can change it with the option `-n N`, where `N` is the number of arguments sent. Use the option `-t` and `-n` to run the previous command as 3 separate `touch` commands.
+
+<details><summary>Solution</summary>
+<p>
+```sh
+echo "file1 file2 file3" | xargs -t -n 1 touch
+```
+</p>
+</details>
+
+Sometime, the arguments are not separated by space but by other characters. You can use the `-d` option to specify them. Execute  `touch`1 time from the following command:
+
+```sh
+echo "file1;file2;file3"
+```
+
+<details><summary>Solution</summary>
+<p>
+```sh
+echo "file1;file2;file3" | xargs -t -d \; touch
+```
+</p>
+</details>
+
+To reuse the arguments sent to `xargs` you can use the command `-I` which defines a string corresponding to the argument. Try the following command, what does the **man**ual says about the `-c` option of the command `cut` ?
+
+```sh
+ls -l file* | cut -c 44- | xargs -t -I % ln -s % link_%
+```
+
+Instead of using `ls` the command `xargs` is often used with the command `find`. The command `find` is a powerful command to search for files.
+
+Start from the following command to make a non-hidden copy of all the file with a name starting with *.bash* in your home folder
+
+```sh
+find . -name ".bash*" | sed 's|./.||g'
+```
+
+<details><summary>Solution</summary>
+<p>
+```sh
+find . -name ".bash*" | sed 's|./.||g' | xargs -t -I % cp .% %
+```
+</p>
+</details>
+
+You can try to remove every file in the `/tmp` folder with the following command:
+
+```sh
+find /tmp/ -type f | xargs -t rm
+```
+
+Modify this command to remove every directly in the `/tmp` folder.
+
+<details><summary>Solution</summary>
+<p>
+```sh
+find /tmp/ -type d | xargs -t rm -R
+```
+</p>
+</details>
+
+## Writing `awk` commands
+
+`xargs` Is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn `awk`. `awk` is a programming language by itself, but you don't need to know everything about `awk` to use it.
+
+You can to think of `awk` as a `xargs -I $N` command where `$1` correspond to the first column `$2` to the second column, etc...
+
+There are also some predefined variables that you can use like
+
+- `$0` Correspond to all the columns.
+- `FS` the field separator used
+- `NF` the number of field separated by `FS`
+- `NR` the number for records already read
+
+A `awk` program is a chain of commands with the form `motif { action }`
+
+- the `motif` define where there `action` is executed
+- the `action` is what you want to do
+
+The `motif` can be
+
+- a regexp
+- The keyword `BEGIN`or `END`
+- a comparison like `<`, `<=`, `==`, `>=`, `>` or `!=`
+- a combination of the three separated by `&&` (AND),  `||`(OR) and  `!` (Negation)
+- a range of line `motif_1,motif_2`
+
+With `awk` you can
+
+Count the number of line in a file
+
+```sh
+awk '{ print NR " : " $0 }' file
+```
+
+Convert a tabulated sequences file into fasta format
+
+```sh
+awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa
+```
+
+Convert a multiline fasta file into a single line fasta file
+
+```sh
+awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' sample.fa > sample1_singleline.fa
+```
+
+Convert fasta sequences to uppercase
+
+```sh
+awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' file.fasta > file_upper.fasta
+```
+
+Return a list of sequence_id sequence_length from a fasta file
+
+```sh
+awk 'BEGIN {OFS = "\n"}; /^>/ {print(substr(sequence_id, 2)" "sequence_length); sequence_length = 0; sequence_id = $0}; /^[^>]/ {sequence_length += length($0)}; END {print(substr(sequence_id, 2)" "sequence_length)}' file.fasta
+```
+
+Count the number of bases in a fastq.gz file
+
+```sh
+(gzip -dc $0) | awk 'NR%4 == 2 {basenumber += length($0)} END {print basenumber}'
+```
+
+Only read with more than 20bp from a fastq  
+
+```sh
+awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 20){print header, seq, qheader, qseq}}' < input.fastq > output.fastq
+```
+
+
+
+## Writing a bash script
+
+When you start writing complicated command, you may want to save them to use them later.
+
+You can find everything that you are typing in your `bash`in the `~/.bash_history` but working with this file can be tedious as it also contains all the command that you mistype. A good solution, for reproducibility is to write `bash` scripts. A bash script is simply a text file that contains a sequence of `bash`commands.
+
+To execute a `bash` script you can use the following command:
+
+```bash
+source myscrip.sh
+```
+
+It's usual to write the `.sh` extension for `shell`scripts.
+
+<details><summary>Solution</summary>
+<p>
+```sh
+gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l
+```
+</p>
+</details>
+
+
+### 
+
--- a/Makefile
+++ b/Makefile
@@ -6,7 +6,9 @@ all: html/index.html \
 	html/5_users_and_rights.html \
 	html/6_unix_processes.html \
 	html/7_streams_and_pipes.html \
-	html/8_text_manipulation.html
+	html/8_text_manipulation.html \
+	html/9_batch_processing.html 
+

 html/index.html: index.md github-pandoc.css
 	pandoc -s -c github-pandoc.css index.md -o html/index.html
@@ -34,3 +36,6 @@ html/7_streams_and_pipes.html: 7_streams_and_pipes.md github-pandoc.css

 html/8_text_manipulation.html: 8_text_manipulation.md github-pandoc.css
 	pandoc -s --toc -c github-pandoc.css 8_text_manipulation.md -o html/8_text_manipulation.html
+
+html/9_batch_processing.html: 9_batch_processing.md github-pandoc.css
+	pandoc -s --toc -c github-pandoc.css 9_batch_processing.md -o html/9_batch_processing.html
--- a/index.md
+++ b/index.md
@@ -11,6 +11,7 @@ title: #  Unix / command line training course
 5. [Users and rights](http://perso.ens-lyon.fr/laurent.modolo/unix/5_users_and_rights.html)
 6. [Unix processes](http://perso.ens-lyon.fr/laurent.modolo/unix/6_unix_processes.html)
 7. [Streams and pipes](http://perso.ens-lyon.fr/laurent.modolo/unix/7_streams_and_pipes.html)
-7. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html)
+8. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html)
+9. [Batch processing](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html)