Skip to content
Snippets Groups Projects
Verified Commit cc8a8d8e authored by Laurent Modolo's avatar Laurent Modolo
Browse files

add 9 on batch processing

parent 631474e9
No related branches found
No related tags found
No related merge requests found
...@@ -273,7 +273,7 @@ If you want to learn more about `vim` you can start with the https://vim-adventu ...@@ -273,7 +273,7 @@ If you want to learn more about `vim` you can start with the https://vim-adventu
In the next session, we are going to apply the logic of pipes and text manipulation to [batch processing.](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html) In the next session, we are going to apply the logic of pipes and text manipulation to [batch processing.](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html)
> We have users the following commands: > We have used the following commands:
> >
> - `head` / `tail` to display head or tail of a file > - `head` / `tail` to display head or tail of a file
> - `wget` to download files > - `wget` to download files
......
...@@ -14,7 +14,7 @@ To run `CMD1` and then run `CMD2` you can use the `;` operator ...@@ -14,7 +14,7 @@ To run `CMD1` and then run `CMD2` you can use the `;` operator
CMD1 ; CMD2 CMD1 ; CMD2
``` ```
To run `CMD1` and then run `CMD2` if `CMD1` didn't throw an error, you can use the `&&` operator which is safer than the `;` operator. To run `CMD1` and then run `CMD2` if `CMD1` didnt throw an error, you can use the `&&` operator which is safer than the `;` operator.
```sh ```sh
CMD1 && CMD2 CMD1 && CMD2
...@@ -28,13 +28,13 @@ CMD1 || CMD2 ...@@ -28,13 +28,13 @@ CMD1 || CMD2
## Executing list of commands ## Executing list of commands
The easiest option to execute list of command is to use `xargs`. `xargs` reads arguments from **stdin** and use then as argument for a command. In Unix systems the command `echo` send string of character into **stdout**. We are going to use this command to learn more about `xargs`. The easiest option to execute list of command is to use `xargs`. `xargs` reads arguments from **stdin** and use them as argument for a command. In UNIX systems the command `echo` send string of character into **stdout**. We are going to use this command to learn more about `xargs`.
```sh ```sh
echo "hello world" echo "hello world"
``` ```
In general a string of character differs from a command when it's placed between quotes. In general a string of character differs from a command when its placed between quotes.
The two following commands are equivalent, why ? The two following commands are equivalent, why ?
...@@ -43,7 +43,7 @@ echo "file1 file2 file3" | xargs touch ...@@ -43,7 +43,7 @@ echo "file1 file2 file3" | xargs touch
touch file1 file2 file3 touch file1 file2 file3
``` ```
You can display the command executed by `xargs` with the switch `-t`. You can display the command executed by `xargs` with the switch `-t`.
By default the number of arguments sent by `xargs` is defined by the system. You can change it with the option `-n N`, where `N` is the number of arguments sent. Use the option `-t` and `-n` to run the previous command as 3 separate `touch` commands. By default the number of arguments sent by `xargs` is defined by the system. You can change it with the option `-n N`, where `N` is the number of arguments sent. Use the option `-t` and `-n` to run the previous command as 3 separate `touch` commands.
...@@ -55,7 +55,7 @@ echo "file1 file2 file3" | xargs -t -n 1 touch ...@@ -55,7 +55,7 @@ echo "file1 file2 file3" | xargs -t -n 1 touch
</p> </p>
</details> </details>
Sometime, the arguments are not separated by space but by other characters. You can use the `-d` option to specify them. Execute `touch`1 time from the following command: Sometime, the arguments are not separated by space but by other characters. You can use the `-d` option to specify them. Execute `touch`1 time from the following command:
```sh ```sh
echo "file1;file2;file3" echo "file1;file2;file3"
...@@ -77,7 +77,7 @@ ls -l file* | cut -c 44- | xargs -t -I % ln -s % link_% ...@@ -77,7 +77,7 @@ ls -l file* | cut -c 44- | xargs -t -I % ln -s % link_%
Instead of using `ls` the command `xargs` is often used with the command `find`. The command `find` is a powerful command to search for files. Instead of using `ls` the command `xargs` is often used with the command `find`. The command `find` is a powerful command to search for files.
Start from the following command to make a non-hidden copy of all the file with a name starting with *.bash* in your home folder Modify the following command to make a non-hidden copy of all the file with a name starting with *.bash* in your home folder
```sh ```sh
find . -name ".bash*" | sed 's|./.||g' find . -name ".bash*" | sed 's|./.||g'
...@@ -91,13 +91,13 @@ find . -name ".bash*" | sed 's|./.||g' | xargs -t -I % cp .% % ...@@ -91,13 +91,13 @@ find . -name ".bash*" | sed 's|./.||g' | xargs -t -I % cp .% %
</p> </p>
</details> </details>
You can try to remove every file in the `/tmp` folder with the following command: You can try to remove all the files in the `/tmp` folder with the following command:
```sh ```sh
find /tmp/ -type f | xargs -t rm find /tmp/ -type f | xargs -t rm
``` ```
Modify this command to remove every directly in the `/tmp` folder. Modify this command to remove every folder in the `/tmp` folder.
<details><summary>Solution</summary> <details><summary>Solution</summary>
<p> <p>
...@@ -109,44 +109,64 @@ find /tmp/ -type d | xargs -t rm -R ...@@ -109,44 +109,64 @@ find /tmp/ -type d | xargs -t rm -R
## Writing `awk` commands ## Writing `awk` commands
`xargs` Is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn `awk`. `awk` is a programming language by itself, but you don't need to know everything about `awk` to use it. `xargs` It is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn `awk`. `awk` is a programming language by itself, but you dont need to know everything about `awk` to use it.
You can to think of `awk` as a `xargs -I $N` command where `$1` correspond to the first column `$2` to the second column, etc... You can to think of `awk` as a `xargs -I $N` command where `$1` correspond to the first column `$2` to the second column, etc.
There are also some predefined variables that you can use like There are also some predefined variables that you can use like.
- `$0` Correspond to all the columns. - `$0` Correspond to all the columns.
- `FS` the field separator used - `FS` the field separator used
- `NF` the number of field separated by `FS` - `NF` the number of fields separated by `FS`
- `NR` the number for records already read - `NR` the number for records already read
A `awk` program is a chain of commands with the form `motif { action }` A `awk` program is a chain of commands with the form `motif { action }`
- the `motif` define where there `action` is executed - the `motif` define where there `action` is executed
- the `action` is what you want to do - there `action` is what you want to do
The `motif` can be They `motif` can be
- a regexp - a regexp
- The keyword `BEGIN`or `END` - The keyword `BEGIN`or `END` (before reading the first line, and after reading the last line)
- a comparison like `<`, `<=`, `==`, `>=`, `>` or `!=` - a comparison like `<`, `<=`, `==`, `>=`, `>` or `!=`
- a combination of the three separated by `&&` (AND), `||`(OR) and `!` (Negation) - a combination of the three separated by `&&` (AND), `||`(OR) and `!` (Negation)
- a range of line `motif_1,motif_2` - a range of line `motif_1,motif_2`
With `awk` you can With `awk` you can
Count the number of line in a file Count the number of lines in a file
```sh ```sh
awk '{ print NR " : " $0 }' file awk '{ print NR " : " $0 }' file
``` ```
Modify this command to only display the total number of line with awk (like `wc -l`)
<details><summary>Solution</summary>
<p>
```sh
awk 'END{ print NR }' file
```
</p>
</details>
Convert a tabulated sequences file into fasta format Convert a tabulated sequences file into fasta format
```sh ```sh
awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa
``` ```
Modify this command to only get a list of sequence names in a fasta file
<details><summary>Solution</summary>
<p>
```sh
awk -vOFS='' '{print $1 "\n";}' two_column_sample_tab.txt > seq_name.txt
```
</p>
</details>
Convert a multiline fasta file into a single line fasta file Convert a multiline fasta file into a single line fasta file
```sh ```sh
...@@ -159,6 +179,16 @@ Convert fasta sequences to uppercase ...@@ -159,6 +179,16 @@ Convert fasta sequences to uppercase
awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' file.fasta > file_upper.fasta awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' file.fasta > file_upper.fasta
``` ```
Modify this command to only get a list of sequence names in a fasta file un lowercase
<details><summary>Solution</summary>
<p>
```sh
awk '/[^>]/ {print(tolower($0))}' file.fasta > seq_name_lower.txt
```
</p>
</details>
Return a list of sequence_id sequence_length from a fasta file Return a list of sequence_id sequence_length from a fasta file
```sh ```sh
...@@ -171,7 +201,7 @@ Count the number of bases in a fastq.gz file ...@@ -171,7 +201,7 @@ Count the number of bases in a fastq.gz file
(gzip -dc $0) | awk 'NR%4 == 2 {basenumber += length($0)} END {print basenumber}' (gzip -dc $0) | awk 'NR%4 == 2 {basenumber += length($0)} END {print basenumber}'
``` ```
Only read with more than 20bp from a fastq Only read with more than 20bp from a fastq
```sh ```sh
awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 20){print header, seq, qheader, qseq}}' < input.fastq > output.fastq awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 20){print header, seq, qheader, qseq}}' < input.fastq > output.fastq
...@@ -181,26 +211,181 @@ awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline q ...@@ -181,26 +211,181 @@ awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline q
## Writing a bash script ## Writing a bash script
When you start writing complicated command, you may want to save them to use them later. When you start writing complicated command, you may want to save them to reuse them later.
You can find everything that you are typing in your `bash`in the `~/.bash_history` but working with this file can be tedious as it also contains all the command that you mistype. A good solution, for reproducibility is to write `bash` scripts. A bash script is simply a text file that contains a sequence of `bash`commands. You can find everything that you are typing in your `bash`in the `~/.bash_history` file, but working with this file can be tedious as it also contains all the command that you mistype. A good solution, for reproducibility is to write `bash` scripts. A bash script is simply a text file that contains a sequence of `bash`commands.
To execute a `bash` script you can use the following command: As you use `bash` in your terminal, you can execute a `bash` script with the following command:
```bash ```bash
source myscrip.sh source myscrip.sh
``` ```
It's usual to write the `.sh` extension for `shell`scripts. It’s usual to write the `.sh` extension for `shell`scripts.
Write a bash script named `download_hg38.sh` that download the [hg38.ncbiRefSeq.gtf.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz) file, then extract it and that says that it has done it.
The `\` character like in regexp cancel the meaning of what follow, you can use it to split your one-liner scripts over many lines to use the `&&` operator.
<details><summary>Solution</summary>
<p>
```sh
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz && \
gzip -dc hg38.ncbiRefSeq.gtf.gz && \
echo "download and extraction complete"
```
</p>
</details>
### shebang
In your first bash script, the only thing saying that your script is a bash script is its extension. But most of the time UNIX system doesn’t care about file extension, a text file is a text file.
To tell the system that your text file is a bash script you need to add a **shebang**. A **shebang** is a special first line that starts with a `#!` followed by the path of the interpreter for your script.
For example, for a bash script in a system where `bash` is installed in `/bin/bash` the **shebang** is:
```bash
#!/bin/bash
```
When you are not sure `which`is the path of the tools available to interpret your script, you can use the following shebang:
```bash
#!/usr/bin/env bash
```
You can add a **shebang** to your script and add it the e**x**ecutable right.
<details><summary>Solution</summary> <details><summary>Solution</summary>
<p> <p>
```sh ```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l chmod u+x download_hg38.sh
``` ```
</p> </p>
</details> </details>
### Now you can execute your script with the command:
```bash
./download_hg38.sh
```
Congratulations you wrote your first program !
### PATH
Where did they `/usr/bin/env` find the information about your bash ? Why did we have to write a `./` before our script if we are in the same folder ?
This is all linked to the **PATH** bash variable. Like in many programming languages `bash` have what we call *variables*. *variables* are named storage for temporary information. You can print a list of all your environment variables (variables loaded in your `bash` memory), with the command `printenv`.
To create a new variable you can use the following syntax:
```sh
VAR_NAME="text"
VAR_NAME2=2
```
Create a `IDENTIY` variable with your first and last names.
<details><summary>Solution</summary>
<p>
```sh
IDENTITY="First name Last Name"
```
</p>
</details>
It’s good practice to write your `bash` variable in uppercase with `_` in place of spaces.
You can access the value of an existing `bash` variable with the `$VAR_NAME`
To display the value of your `IDENTITY` variable with `echo` you can write:
```sh
echo $IDENTITY
```
When you want to mix variable value and text you can use the two following syntax:
```sh
echo "my name is "$IDENTITY
echo "my name is ${IDENTITY}"
```
Going back to the `printenv` You can see a **PWD** variable that store your current path, a **SHELL** variable that store your current shell, and you can see a **PATH** variable that stores a loot of file path separated by `:`.
The **PATH** variable contains every folder where to look for executable programs. Executable programs can be binary files or text files with a **shebang**.
Display the content of `PATH` with `echo`
<details><summary>Solution</summary>
<p>
```sh
echo $PATH
```
</p>
</details>
You can create a `scripts`folder and move your `download_hg38.sh` script in it. Then we can modify the `PATH` variable to include the `scripts` folder in it.
> Don’t erase your `PATH` variable !
<details><summary>Solution</summary>
<p>
```sh
mkdir ~/scripts
mv `download_hg38.sh` ~/scripts/
PATH=$PATH:~/scripts/
```
</p>
</details>
You can check the result of your command with `echo $PATH`
Try to call your `download_hg38.sh` from anywhere on the file tree. Congratulation you installed your first UNIX program !
### Arguments
You can pass argument to your bash scripts, writing the following command:
```sh
my_script.sh arg1 arg2 arg3
```
Means that from within the script:
- `$0` will give you the name of the script (`my_script.sh`)
- `$1`, `$2`, `$3`, `$n` will give you the value of the arguments (`arg1`, `arg2`, `arg3`, `argn`)
- `$$` the process id of the current shell
- `$#` the total number of arguments passed to the script
- `$@`the value of all the arguments passed to the script
- `$?` the exit status of the last executed command
- `$!`the process id of the last executed command
You can write the following `variables.sh` script in your `scripts` folder:
```sh
#!/bin/bash
echo "Name of the script: $0"
echo "Total number of arguments: $#"
echo "Values of all the arguments: $@"
```
And you can try to call it with some arguments !
In the next session, we are going to learn how to execute command on other computers with [ssh.](http://perso.ens-lyon.fr/laurent.modolo/unix/10_ssh.html)
> We have used the following commands:
>
> - `echo` to display text
> - `xarg` to execute a chain of commands
> - `awk` to execute complex chain of commands
> - `;` `&&` and `||` to chain commands
> - `source` to load a script
> - `shebang` to specify the language of a script
> - `PATH` to install script
...@@ -7,7 +7,8 @@ all: html/index.html \ ...@@ -7,7 +7,8 @@ all: html/index.html \
html/6_unix_processes.html \ html/6_unix_processes.html \
html/7_streams_and_pipes.html \ html/7_streams_and_pipes.html \
html/8_text_manipulation.html \ html/8_text_manipulation.html \
html/9_batch_processing.html html/9_batch_processing.html \
html/10_network_and_ssh.html
html/index.html: index.md github-pandoc.css html/index.html: index.md github-pandoc.css
...@@ -39,3 +40,6 @@ html/8_text_manipulation.html: 8_text_manipulation.md github-pandoc.css ...@@ -39,3 +40,6 @@ html/8_text_manipulation.html: 8_text_manipulation.md github-pandoc.css
html/9_batch_processing.html: 9_batch_processing.md github-pandoc.css html/9_batch_processing.html: 9_batch_processing.md github-pandoc.css
pandoc -s --toc -c github-pandoc.css 9_batch_processing.md -o html/9_batch_processing.html pandoc -s --toc -c github-pandoc.css 9_batch_processing.md -o html/9_batch_processing.html
html/10_network_and_ssh.html: 10_network_and_ssh.md github-pandoc.css
pandoc -s --toc -c github-pandoc.css 10_ssh.md -o html/10_network_and_ssh.html
...@@ -13,5 +13,6 @@ title: # Unix / command line training course ...@@ -13,5 +13,6 @@ title: # Unix / command line training course
7. [Streams and pipes](http://perso.ens-lyon.fr/laurent.modolo/unix/7_streams_and_pipes.html) 7. [Streams and pipes](http://perso.ens-lyon.fr/laurent.modolo/unix/7_streams_and_pipes.html)
8. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html) 8. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html)
9. [Batch processing](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html) 9. [Batch processing](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html)
10. [Network and ssh](http://perso.ens-lyon.fr/laurent.modolo/unix/10_network_and_ssh.html)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment