diff --git a/8_text_manipulation.md b/8_text_manipulation.md index c3f621d9792d9f3bfd3ddf46a6dda426bc0dab5f..c1d48da7b46ca0b3716ee2966cab2e3f42eddc3e 100644 --- a/8_text_manipulation.md +++ b/8_text_manipulation.md @@ -273,7 +273,7 @@ If you want to learn more about `vim` you can start with the https://vim-adventu In the next session, we are going to apply the logic of pipes and text manipulation to [batch processing.](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html) -> We have users the following commands: +> We have used the following commands: > > - `head` / `tail` to display head or tail of a file > - `wget` to download files diff --git a/9_batch_processing.md b/9_batch_processing.md index c0ec69f702fdbbd6a4068059115bdd30f8d0721c..77097fb51a5dfda062b6b8ea3ac775d9cb2f5ea5 100644 --- a/9_batch_processing.md +++ b/9_batch_processing.md @@ -14,7 +14,7 @@ To run `CMD1` and then run `CMD2` you can use the `;` operator CMD1 ; CMD2 ``` -To run `CMD1` and then run `CMD2` if `CMD1` didn't throw an error, you can use the `&&` operator which is safer than the `;` operator. +To run `CMD1` and then run `CMD2` if `CMD1` didn’t throw an error, you can use the `&&` operator which is safer than the `;` operator. ```sh CMD1 && CMD2 @@ -28,13 +28,13 @@ CMD1 || CMD2 ## Executing list of commands -The easiest option to execute list of command is to use `xargs`. `xargs` reads arguments from **stdin** and use then as argument for a command. In Unix systems the command `echo` send string of character into **stdout**. We are going to use this command to learn more about `xargs`. +The easiest option to execute list of command is to use `xargs`. `xargs` reads arguments from **stdin** and use them as argument for a command. In UNIX systems the command `echo` send string of character into **stdout**. We are going to use this command to learn more about `xargs`. ```sh echo "hello world" ``` -In general a string of character differs from a command when it's placed between quotes. +In general a string of character differs from a command when it’s placed between quotes. The two following commands are equivalent, why ? @@ -43,7 +43,7 @@ echo "file1 file2 file3" | xargs touch touch file1 file2 file3 ``` -You can display the command executed by `xargs` with the switch `-t`. +You can display the command executed by `xargs` with the switch `-t`. By default the number of arguments sent by `xargs` is defined by the system. You can change it with the option `-n N`, where `N` is the number of arguments sent. Use the option `-t` and `-n` to run the previous command as 3 separate `touch` commands. @@ -55,7 +55,7 @@ echo "file1 file2 file3" | xargs -t -n 1 touch </p> </details> -Sometime, the arguments are not separated by space but by other characters. You can use the `-d` option to specify them. Execute `touch`1 time from the following command: +Sometime, the arguments are not separated by space but by other characters. You can use the `-d` option to specify them. Execute `touch`1 time from the following command: ```sh echo "file1;file2;file3" @@ -77,7 +77,7 @@ ls -l file* | cut -c 44- | xargs -t -I % ln -s % link_% Instead of using `ls` the command `xargs` is often used with the command `find`. The command `find` is a powerful command to search for files. -Start from the following command to make a non-hidden copy of all the file with a name starting with *.bash* in your home folder +Modify the following command to make a non-hidden copy of all the file with a name starting with *.bash* in your home folder ```sh find . -name ".bash*" | sed 's|./.||g' @@ -91,13 +91,13 @@ find . -name ".bash*" | sed 's|./.||g' | xargs -t -I % cp .% % </p> </details> -You can try to remove every file in the `/tmp` folder with the following command: +You can try to remove all the files in the `/tmp` folder with the following command: ```sh find /tmp/ -type f | xargs -t rm ``` -Modify this command to remove every directly in the `/tmp` folder. +Modify this command to remove every folder in the `/tmp` folder. <details><summary>Solution</summary> <p> @@ -109,44 +109,64 @@ find /tmp/ -type d | xargs -t rm -R ## Writing `awk` commands -`xargs` Is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn `awk`. `awk` is a programming language by itself, but you don't need to know everything about `awk` to use it. +`xargs` It is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn `awk`. `awk` is a programming language by itself, but you don’t need to know everything about `awk` to use it. -You can to think of `awk` as a `xargs -I $N` command where `$1` correspond to the first column `$2` to the second column, etc... +You can to think of `awk` as a `xargs -I $N` command where `$1` correspond to the first column `$2` to the second column, etc. -There are also some predefined variables that you can use like +There are also some predefined variables that you can use like. - `$0` Correspond to all the columns. - `FS` the field separator used -- `NF` the number of field separated by `FS` +- `NF` the number of fields separated by `FS` - `NR` the number for records already read A `awk` program is a chain of commands with the form `motif { action }` - the `motif` define where there `action` is executed -- the `action` is what you want to do +- there `action` is what you want to do -The `motif` can be +They `motif` can be - a regexp -- The keyword `BEGIN`or `END` +- The keyword `BEGIN`or `END` (before reading the first line, and after reading the last line) - a comparison like `<`, `<=`, `==`, `>=`, `>` or `!=` -- a combination of the three separated by `&&` (AND), `||`(OR) and `!` (Negation) +- a combination of the three separated by `&&` (AND), `||`(OR) and `!` (Negation) - a range of line `motif_1,motif_2` With `awk` you can -Count the number of line in a file +Count the number of lines in a file ```sh awk '{ print NR " : " $0 }' file ``` +Modify this command to only display the total number of line with awk (like `wc -l`) + +<details><summary>Solution</summary> +<p> +```sh +awk 'END{ print NR }' file +``` +</p> +</details> + Convert a tabulated sequences file into fasta format ```sh awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa ``` +Modify this command to only get a list of sequence names in a fasta file + +<details><summary>Solution</summary> +<p> +```sh +awk -vOFS='' '{print $1 "\n";}' two_column_sample_tab.txt > seq_name.txt +``` +</p> +</details> + Convert a multiline fasta file into a single line fasta file ```sh @@ -159,6 +179,16 @@ Convert fasta sequences to uppercase awk '/^>/ {print($0)}; /^[^>]/ {print(toupper($0))}' file.fasta > file_upper.fasta ``` +Modify this command to only get a list of sequence names in a fasta file un lowercase + +<details><summary>Solution</summary> +<p> +```sh +awk '/[^>]/ {print(tolower($0))}' file.fasta > seq_name_lower.txt +``` +</p> +</details> + Return a list of sequence_id sequence_length from a fasta file ```sh @@ -171,7 +201,7 @@ Count the number of bases in a fastq.gz file (gzip -dc $0) | awk 'NR%4 == 2 {basenumber += length($0)} END {print basenumber}' ``` -Only read with more than 20bp from a fastq +Only read with more than 20bp from a fastq ```sh awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline qseq ; if (length(seq) >= 20){print header, seq, qheader, qseq}}' < input.fastq > output.fastq @@ -181,26 +211,181 @@ awk 'BEGIN {OFS = "\n"} {header = $0 ; getline seq ; getline qheader ; getline q ## Writing a bash script -When you start writing complicated command, you may want to save them to use them later. +When you start writing complicated command, you may want to save them to reuse them later. -You can find everything that you are typing in your `bash`in the `~/.bash_history` but working with this file can be tedious as it also contains all the command that you mistype. A good solution, for reproducibility is to write `bash` scripts. A bash script is simply a text file that contains a sequence of `bash`commands. +You can find everything that you are typing in your `bash`in the `~/.bash_history` file, but working with this file can be tedious as it also contains all the command that you mistype. A good solution, for reproducibility is to write `bash` scripts. A bash script is simply a text file that contains a sequence of `bash`commands. -To execute a `bash` script you can use the following command: +As you use `bash` in your terminal, you can execute a `bash` script with the following command: ```bash source myscrip.sh ``` -It's usual to write the `.sh` extension for `shell`scripts. +It’s usual to write the `.sh` extension for `shell`scripts. + +Write a bash script named `download_hg38.sh` that download the [hg38.ncbiRefSeq.gtf.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz) file, then extract it and that says that it has done it. + +The `\` character like in regexp cancel the meaning of what follow, you can use it to split your one-liner scripts over many lines to use the `&&` operator. + +<details><summary>Solution</summary> +<p> +```sh +wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz && \ +gzip -dc hg38.ncbiRefSeq.gtf.gz && \ +echo "download and extraction complete" +``` +</p> +</details> + + +### shebang + +In your first bash script, the only thing saying that your script is a bash script is its extension. But most of the time UNIX system doesn’t care about file extension, a text file is a text file. + +To tell the system that your text file is a bash script you need to add a **shebang**. A **shebang** is a special first line that starts with a `#!` followed by the path of the interpreter for your script. + +For example, for a bash script in a system where `bash` is installed in `/bin/bash` the **shebang** is: + +```bash +#!/bin/bash +``` + +When you are not sure `which`is the path of the tools available to interpret your script, you can use the following shebang: + +```bash +#!/usr/bin/env bash +``` + +You can add a **shebang** to your script and add it the e**x**ecutable right. <details><summary>Solution</summary> <p> ```sh -gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l +chmod u+x download_hg38.sh ``` </p> </details> -### +Now you can execute your script with the command: + +```bash +./download_hg38.sh +``` + +Congratulations you wrote your first program ! + +### PATH + +Where did they `/usr/bin/env` find the information about your bash ? Why did we have to write a `./` before our script if we are in the same folder ? + +This is all linked to the **PATH** bash variable. Like in many programming languages `bash` have what we call *variables*. *variables* are named storage for temporary information. You can print a list of all your environment variables (variables loaded in your `bash` memory), with the command `printenv`. + +To create a new variable you can use the following syntax: + +```sh +VAR_NAME="text" +VAR_NAME2=2 +``` + +Create a `IDENTIY` variable with your first and last names. + +<details><summary>Solution</summary> +<p> +```sh +IDENTITY="First name Last Name" +``` +</p> +</details> + +It’s good practice to write your `bash` variable in uppercase with `_` in place of spaces. + +You can access the value of an existing `bash` variable with the `$VAR_NAME` + +To display the value of your `IDENTITY` variable with `echo` you can write: + +```sh +echo $IDENTITY +``` + +When you want to mix variable value and text you can use the two following syntax: + +```sh +echo "my name is "$IDENTITY +echo "my name is ${IDENTITY}" +``` + +Going back to the `printenv` You can see a **PWD** variable that store your current path, a **SHELL** variable that store your current shell, and you can see a **PATH** variable that stores a loot of file path separated by `:`. + +The **PATH** variable contains every folder where to look for executable programs. Executable programs can be binary files or text files with a **shebang**. + +Display the content of `PATH` with `echo` + +<details><summary>Solution</summary> +<p> +```sh +echo $PATH +``` +</p> +</details> + +You can create a `scripts`folder and move your `download_hg38.sh` script in it. Then we can modify the `PATH` variable to include the `scripts` folder in it. + +> Don’t erase your `PATH` variable ! + +<details><summary>Solution</summary> +<p> +```sh +mkdir ~/scripts +mv `download_hg38.sh` ~/scripts/ +PATH=$PATH:~/scripts/ +``` +</p> +</details> + +You can check the result of your command with `echo $PATH` + +Try to call your `download_hg38.sh` from anywhere on the file tree. Congratulation you installed your first UNIX program ! + +### Arguments + +You can pass argument to your bash scripts, writing the following command: + +```sh +my_script.sh arg1 arg2 arg3 +``` + +Means that from within the script: + +- `$0` will give you the name of the script (`my_script.sh`) +- `$1`, `$2`, `$3`, `$n` will give you the value of the arguments (`arg1`, `arg2`, `arg3`, `argn`) +- `$$` the process id of the current shell +- `$#` the total number of arguments passed to the script +- `$@`the value of all the arguments passed to the script +- `$?` the exit status of the last executed command +- `$!`the process id of the last executed command + +You can write the following `variables.sh` script in your `scripts` folder: + +```sh +#!/bin/bash + +echo "Name of the script: $0" +echo "Total number of arguments: $#" +echo "Values of all the arguments: $@" +``` + +And you can try to call it with some arguments ! + +In the next session, we are going to learn how to execute command on other computers with [ssh.](http://perso.ens-lyon.fr/laurent.modolo/unix/10_ssh.html) + +> We have used the following commands: +> +> - `echo` to display text +> - `xarg` to execute a chain of commands +> - `awk` to execute complex chain of commands +> - `;` `&&` and `||` to chain commands +> - `source` to load a script +> - `shebang` to specify the language of a script +> - `PATH` to install script diff --git a/Makefile b/Makefile index dddaa68ba2455b7c65f562594b1659cc19e3bb53..a94cfb51f02b20543fed896e0f7af7ad482b58b8 100644 --- a/Makefile +++ b/Makefile @@ -7,7 +7,8 @@ all: html/index.html \ html/6_unix_processes.html \ html/7_streams_and_pipes.html \ html/8_text_manipulation.html \ - html/9_batch_processing.html + html/9_batch_processing.html \ + html/10_network_and_ssh.html html/index.html: index.md github-pandoc.css @@ -39,3 +40,6 @@ html/8_text_manipulation.html: 8_text_manipulation.md github-pandoc.css html/9_batch_processing.html: 9_batch_processing.md github-pandoc.css pandoc -s --toc -c github-pandoc.css 9_batch_processing.md -o html/9_batch_processing.html + +html/10_network_and_ssh.html: 10_network_and_ssh.md github-pandoc.css + pandoc -s --toc -c github-pandoc.css 10_ssh.md -o html/10_network_and_ssh.html diff --git a/index.md b/index.md index 46b4d1b1cf19c2a18e7f0ee029d4bd125639a3b9..622c138ba0fbc8279dfa5b7eaa0a4ca6e767cc3c 100644 --- a/index.md +++ b/index.md @@ -13,5 +13,6 @@ title: # Unix / command line training course 7. [Streams and pipes](http://perso.ens-lyon.fr/laurent.modolo/unix/7_streams_and_pipes.html) 8. [Text manipulation](http://perso.ens-lyon.fr/laurent.modolo/unix/8_text_manipulation.html) 9. [Batch processing](http://perso.ens-lyon.fr/laurent.modolo/unix/9_batch_processing.html) +10. [Network and ssh](http://perso.ens-lyon.fr/laurent.modolo/unix/10_network_and_ssh.html)