Skip to content
Snippets Groups Projects
8_text_manipulation.Rmd 10.3 KiB
Newer Older
Laurent Modolo's avatar
Laurent Modolo committed
---
title: Text manipulation
author: "Laurent Modolo"
---
Laurent Modolo's avatar
Laurent Modolo committed

```{r include = FALSE}
if (!require("fontawesome")) {
  install.packages("fontawesome")
}
library(fontawesome)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
```
Laurent Modolo's avatar
Laurent Modolo committed

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">
<img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" />
</a>
Laurent Modolo's avatar
Laurent Modolo committed

Objective: Learn simple ways to work with text file in Unix
Laurent Modolo's avatar
Laurent Modolo committed

One of the great things with command line tools is that they are simple and fast. Which means that they are great for handling large files. And as bioinformaticians you have to handle large files, so you need to use command line tools for that.
Laurent Modolo's avatar
Laurent Modolo committed

Ghislain Durif's avatar
Ghislain Durif committed
## Text search
Laurent Modolo's avatar
Laurent Modolo committed

The file [hg38.ncbiRefSeq.gtf.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz) contains the RefSeq annotation for hg38 in [GFT format](http://www.genome.ucsc.edu/FAQ/FAQformat.html#format4)

We can download files with the `wget` command. Here the annotation is in **gz** format which is a compressed format, you can use the `gzip` tool to handle **gz** files.
Laurent Modolo's avatar
Laurent Modolo committed

On useful command to check large text file is the `head `command.

```sh
wget -q -O - http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz | gzip -dc | head
```

You can change the number of lines displayed with the option `-n number_of_line`. The command `tail` has the same function as `head` but starting from the end of the file.

Try the `tail` for the same number of lines displayed, does the computation take the same time ?

Download the `hg38.ncbiRefSeq.gtf.gz` file in your `~/`.

The program `grep string` allows you to search for *string* through a file or stream line by line.

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep "chr2" | head
```

What is the last annotation on the chromosome 1 (to write a tabulation character you can type `\t`) ?

You can count things in text file with the command `wc` read the `wc` **man**ual to see how you can count lines in a file.
Laurent Modolo's avatar
Laurent Modolo committed

Does the number of *3UTR* match the number of *5UTR* ?

How many transcripts does the gene *CCR7* have ?

Ghislain Durif's avatar
Ghislain Durif committed
## Regular expression
Laurent Modolo's avatar
Laurent Modolo committed

When you do a loot text search, you will encounter regular expressions (regexp), which allow you to perform fuzzy search. To run `grep` in regexp mode you can use the switch. `-E`
Laurent Modolo's avatar
Laurent Modolo committed

The most basic form fo regexp si the exact match:

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "gene_id"
```

You can use the `.` wildcard character to match anything
Laurent Modolo's avatar
Laurent Modolo committed

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "...._id"
```

There are different special characters in regexp, but you can use `\` to escape them. For example, if you search for **.** you can use the following regexp

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "\."
```

Ghislain Durif's avatar
Ghislain Durif committed
### Character classes and alternatives
Laurent Modolo's avatar
Laurent Modolo committed

There are a number of special patterns that match more than one character. You’ve already seen `.`, which matches any character apart from a newline. There are four other useful tools:

- `\d`: matches any digit.
- `\s`: matches any whitespace (e.g. space, tab, newline).
- `\S`: matches any non-whitespace.
- `[abc]`: matches a, b, or c.
- `[^abc]`: matches anything except a, b, or c.
- `[a-z]`: Match any letter of the alphabet

Search for two digits followed by an uppercase letter and one digit.

<details><summary>Solution</summary>
<p>
```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | perl -E "\d\d[A-Z]\d"
```
</p>
</details>

Ghislain Durif's avatar
Ghislain Durif committed
### Anchors
Laurent Modolo's avatar
Laurent Modolo committed

By default, regular expressions will match any part of a string. It’s often useful to *anchor* the regular expression so that it matches from the start or end of the string. You can use

- `^` to match the start of the string.
Laurent Modolo's avatar
Laurent Modolo committed
- `$` to match the end of the string.

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E ";"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E ";$"
```

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "c"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "^c"
```

Ghislain Durif's avatar
Ghislain Durif committed
### Repetition
Laurent Modolo's avatar
Laurent Modolo committed

The next step up in power involves controlling how many times a pattern matches

- `?`: 0 or 1
- `+`: 1 or more
- `*`: 0 or more

What is the following regexp going to match ?

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "[a-z]*_[a-z]*\s\"[1-3]\""
```

You can also specify the number of matches precisely:

- `{n}`: exactly n
- `{n,}`: n or more
- `{,m}`: at most m
- `{n,m}`: between n and m

What is the following regexp going to match ?

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "^[a-z]{3}[2-3]\s.*exon\s\d{4,5}\s\d{4,5}.*"
```

How many gene names of more than 16 characters does the annotation contain ?

<details><summary>Solution</summary>
<p>
```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l
```
</p>
</details>

Ghislain Durif's avatar
Ghislain Durif committed
### Grouping and back references
Laurent Modolo's avatar
Laurent Modolo committed

You can group match using `()`, for example the following regexp match doublet of *12* .

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "(12){2}"
```

Grouping is also used for back references in the case of text replacement. You can use the command `sed` for text replacement. The syntax of `sed` for replacement is the following:  `sed -E 's|regexp|\n|g` where `n` is the grouping number. `s` stand for substitute and `g` stand for global (which means that is they are different substitutions per line `sed` won't stop at the first one).

Try the following replacement regexp

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|(transcript_).{2}|\1number|g'
```

Try to write a `sed` command to replace *ncbiRefSeq* with *transcript_id* .

<details><summary>Solution</summary>
<p>
```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|ncbiRefSeq(.*)(transcript_id "([A-Z_0-9.]*))|\3\1\2|g'
```
</p>
</details>
Regexp can be very complex see for example [a regex to validate an email on starckoverflow](https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression/201378#201378). When you start you can always use for a given regexp to a more experienced used (just give him the kind of text you want to match and not match). You can test your regex easily with the [regex101 website](https://regex101.com/).
Laurent Modolo's avatar
Laurent Modolo committed

Ghislain Durif's avatar
Ghislain Durif committed
## Sorting
Laurent Modolo's avatar
Laurent Modolo committed

GTF files should be sorted by chromosome, starting position and end position. But you can change that with the command `sort` to select the column to sort on you can use the option `-k n,n` where `n` is the column number.

You need to specify where sort keys start *and where they end*, otherwise (as in when you use `-k 3` instead of `-k 3,3`) they end at the end of the line.

For example, the following command `sort` on the 4th column and then on the 5th when the values of the 4th column are equal.

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 4,4 -k 5,5 | head
```

> Sorting operations are complex and can take a long time

You can add more option to the sorting of each column, for example `r` for reverse order `d` for dictionary order or `n` for numeric order.

What will the following command do ?

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 3,3d -k 4,4n | head
```

Use the `-u` option to count the number of different annotation type based on the 3rd column

<details><summary>Solution</summary>
<p>
```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 3,3d -u | wc -l
```
</p>
</details>
You can check if a file is already sorted with the `-c` switch. Check if the gtf file is sorted by chromosome, start and end position.

<details><summary>Solution</summary>
<p>
```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 1,1 -k 4,4n -k 5,5n -c
```
</p>
</details>




Ghislain Durif's avatar
Ghislain Durif committed
## Field extractor
Laurent Modolo's avatar
Laurent Modolo committed

Sometime rather than using complex regexp, we want to extract a particular column from a file. You can use the command `cut` to do that.

The following command extracts the 3rd column of the annotation

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | cut -f 3
```

You can change the field separator with the option `-d`, set it to `";"` to extract the *transcript_id* and *gene_name* from the information column.

<details><summary>Solution</summary>
<p>
```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | cut -f 2 -f 5 -d ";"
```
</p>
</details>
Ghislain Durif's avatar
Ghislain Durif committed
## Concatenation
Laurent Modolo's avatar
Laurent Modolo committed

There are different tools to concatenate files from the command line `cat` for vertical concatenation and `paste` for horizontal concatenation.

```sh
cat .bashrc .bashrc | wc -l
```

What will be the results of the following command ?

```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head |  paste - -
```



Ghislain Durif's avatar
Ghislain Durif committed
## Text editor
Laurent Modolo's avatar
Laurent Modolo committed

You often have access to different text editors from the common line, two of the most popular ones are `vim` and `nano`.

`nano` is more friendly to use than `vim` but also very limited.

To open a text file you can type `editor file_path`.

In `nano` everything is written at the bottom, so you only have to remember that `^`is the symbol for the key `Ctrl`.

Open you `.bashrc` file and delete any comment line (starting with the `#` character).

`vim` is a child of the project `vi` (which should also be available on your system), and which bring him more functionality. The workings of `vim` can be a little strange at first, but you have to understand that on a US keyboard the distance that your finger have to travel while using `vim` is minimal.

You have 3 modes in `vim`:

- The **normal** mode, where you can navigate the file and enter command with the `:` key. You can come back to this mode by pressing `Esc`
- The **insert** mode, where you can write things. You enter this mode with the `i` key or any other key insertion key (for example `a` to insert after the cursor or `A` to insert at the end of the line)
- The **visual** mode where you can select text for copy/paste action. You can enter this mode with the `v` key

Ghislain Durif's avatar
Ghislain Durif committed
If you want to learn more about `vim`, you can start with the <https://vim-adventures.com/> website. Once you master `vim` everything is faster but you will have to practice a lot. 
Laurent Modolo's avatar
Laurent Modolo committed

> We have used the following commands:
Laurent Modolo's avatar
Laurent Modolo committed
>
> - `head` / `tail` to display head or tail of a file
> - `wget` to download files
> - `gzip` to extract `tar.gz` files
> - `grep` to search text files
> - `wc` to count things
> - `sed` to search and replace string of text
> - `sort` to sort files on specific field
> - `cut` to extract a specific field
> - `cat` / `paste` for concatenation
> - `nano` / `vim` for text edition

Laurent Modolo's avatar
Laurent Modolo committed
In the next session, we are going to apply the logic of pipes and text manipulation to [batch processing.](./9_batch_processing.html)