Skip to content
Snippets Groups Projects
Forked from CAN / UNIX command line
69 commits behind the upstream repository.
9_batch_processing.md 11.61 KiB

Batch processing

cc_by_sa

Objective: Learn basics of batch processing in GNU/Linux

In the previous section, we have seen how to handle streams and text. We can use this knowledge to generate list of command instead of text. This is called batch processing.

In everyday life, you may want to run command sequentiality without using pipes.

To run CMD1 and then run CMD2 you can use the ; operator

CMD1 ; CMD2

To run CMD1 and then run CMD2 if CMD1 didn’t throw an error, you can use the && operator which is safer than the ; operator.

CMD1 && CMD2

You can also use the || to manage errors and run CMD2 if CMD1 failed.

CMD1 || CMD2

Executing list of commands

The easiest option to execute list of command is to use xargs. xargs reads arguments from stdin and use them as argument for a command. In UNIX systems the command echo send string of character into stdout. We are going to use this command to learn more about xargs.

echo "hello world"

In general a string of character differs from a command when it’s placed between quotes.

The two following commands are equivalent, why ?

echo "file1 file2 file3" | xargs touch
touch file1 file2 file3

You can display the command executed by xargs with the switch -t.

By default the number of arguments sent by xargs is defined by the system. You can change it with the option -n N, where N is the number of arguments sent. Use the option -t and -n to run the previous command as 3 separate touch commands.

Solution

```sh echo "file1 file2 file3" | xargs -t -n 1 touch ```

Sometime, the arguments are not separated by space but by other characters. You can use the -d option to specify them. Execute touch1 time from the following command:

echo "file1;file2;file3"
Solution

```sh echo "file1;file2;file3" | xargs -t -d \; touch ```

To reuse the arguments sent to xargs you can use the command -I which defines a string corresponding to the argument. Try the following command, what does the manual says about the -c option of the command cut ?

ls -l file* | cut -c 44- | xargs -t -I % ln -s % link_%

Instead of using ls the command xargs is often used with the command find. The command find is a powerful command to search for files.

Modify the following command to make a non-hidden copy of all the file with a name starting with .bash in your home folder

find . -name ".bash*" | sed 's|./.||g'
Solution

```sh find . -name ".bash*" | sed 's|./.||g' | xargs -t -I % cp .% % ```

You can try to remove all the files in the /tmp folder with the following command:

find /tmp/ -type f | xargs -t rm

Modify this command to remove every folder in the /tmp folder.

Solution

```sh find /tmp/ -type d | xargs -t rm -R ```

Writing awk commands

xargs It is a simple solution for writing batch commands, but if you want to write more complex command you are going to need to learn awk. awk is a programming language by itself, but you don’t need to know everything about awk to use it.

You can to think of awk as a xargs -I $N command where $1 correspond to the first column $2 to the second column, etc.

There are also some predefined variables that you can use like.

  • $0 Correspond to all the columns.
  • FS the field separator used
  • NF the number of fields separated by FS
  • NR the number for records already read

A awk program is a chain of commands with the form motif { action }

  • the motif define where there action is executed
  • there action is what you want to do

They motif can be

  • a regexp
  • The keyword BEGINor END (before reading the first line, and after reading the last line)
  • a comparison like <, <=, ==, >=, > or !=
  • a combination of the three separated by && (AND), ||(OR) and ! (Negation)
  • a range of line motif_1,motif_2

With awk you can

Count the number of lines in a file

awk '{ print NR " : " $0 }' file

Modify this command to only display the total number of line with awk (like wc -l)

Solution

```sh awk 'END{ print NR }' file ```

Convert a tabulated sequences file into fasta format

awk -vOFS='' '{print ">",$1,"\n",$2,"\n";}' two_column_sample_tab.txt > sample1.fa

Modify this command to only get a list of sequence names in a fasta file