Skip to content
Snippets Groups Projects
8_text_manipulation.Rmd 10.60 KiB
title: Text manipulation
author: "Laurent Modolo"
output:
  rmdformats::downcute:
    self_contain: true
    use_bookdown: true
    default_style: "light"
    lightbox: true
    css: "./www/style_Rmd.css"

if (!require("fontawesome")) {
  install.packages("fontawesome")
}
if (!require("klippy")) {
  install.packages("remotes")
  remotes::install_github("rlesur/klippy")
}
library(fontawesome)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
klippy::klippy(
  position = c("top", "right"),
  color = "white",
  tooltip_message = "Click to copy",
  tooltip_success = "Copied !"
)

cc_by_sa

Objective: Learn simple ways to work with text file in Unix

One of the great things with command line tools is that they are simple and fast. Which means that they are great for handling large files. And as bioinformaticians you have to handle large files, so you need to use command line tools for that.

Text search

The file hg38.ncbiRefSeq.gtf.gz contains the RefSeq annotation for hg38 in GFT format

We can download files with the wget command. Here the annotation is in gz format which is a compressed format, you can use the gzip tool to handle gz files.

On useful command to check large text file is the head command.

wget -q -O - http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz | gzip -dc | head

You can change the number of lines displayed with the option -n number_of_line. The command tail has the same function as head but starting from the end of the file.

Try the tail for the same number of lines displayed, does the computation take the same time ?

Download the hg38.ncbiRefSeq.gtf.gz file in your ~/.

The program grep string allows you to search for string through a file or stream line by line.

gzip -dc hg38.ncbiRefSeq.gtf.gz | grep "chr2" | head

What is the last annotation on the chromosome 1 (to write a tabulation character you can type \t) ?

You can count things in text file with the command wc read the wc manual to see how you can count lines in a file.

Does the number of 3UTR match the number of 5UTR ?

How many transcripts does the gene CCR7 have ?

Regular expression

When you do a loot text search, you will encounter regular expressions (regexp), which allow you to perform fuzzy search. To run grep in regexp mode you can use the switch. -E

The most basic form fo regexp si the exact match:

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "gene_id"

You can use the . wildcard character to match anything

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "...._id"

There are different special characters in regexp, but you can use \ to escape them. For example, if you search for . you can use the following regexp

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "\."

Character classes and alternatives

There are a number of special patterns that match more than one character. You’ve already seen ., which matches any character apart from a newline. There are four other useful tools:

  • \d: matches any digit.
  • \s: matches any whitespace (e.g. space, tab, newline).
  • \S: matches any non-whitespace.
  • [abc]: matches a, b, or c.
  • [^abc]: matches anything except a, b, or c.
  • [a-z]: Match any letter of the alphabet

Search for two digits followed by an uppercase letter and one digit.

Solution

```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head | perl -E "\d\d[A-Z]\d" ```

Anchors

By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use

  • ^ to match the start of the string.
  • $ to match the end of the string.
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E ";"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E ";$"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "c"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "^c"

Repetition

The next step up in power involves controlling how many times a pattern matches

  • ?: 0 or 1
  • +: 1 or more
  • *: 0 or more

What is the following regexp going to match ?

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "[a-z]*_[a-z]*\s\"[1-3]\""

You can also specify the number of matches precisely:

  • {n}: exactly n
  • {n,}: n or more
  • {,m}: at most m
  • {n,m}: between n and m

What is the following regexp going to match ?

gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "^[a-z]{3}[2-3]\s.*exon\s\d{4,5}\s\d{4,5}.*"

How many gene names of more than 16 characters does the annotation contain ?

Solution

```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l ```

Grouping and back references

You can group match using (), for example the following regexp match doublet of 12 .

gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "(12){2}"

Grouping is also used for back references in the case of text replacement. You can use the command sed for text replacement. The syntax of sed for replacement is the following: sed -E 's|regexp|\n|g where n is the grouping number. s stand for substitute and g stand for global (which means that is they are different substitutions per line sed won't stop at the first one).

Try the following replacement regexp

gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|(transcript_).{2}|\1number|g'

Try to write a sed command to replace ncbiRefSeq with transcript_id .

Solution

```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|ncbiRefSeq(.*)(transcript_id "([A-Z_0-9.]*))|\3\1\2|g' ```

Regexp can be very complex see for example a regex to validate an email on starckoverflow. When you start you can always use for a given regexp to a more experienced used (just give him the kind of text you want to match and not match). You can test your regex easily with the regex101 website.

Sorting

GTF files should be sorted by chromosome, starting position and end position. But you can change that with the command sort to select the column to sort on you can use the option -k n,n where n is the column number.

You need to specify where sort keys start and where they end, otherwise (as in when you use -k 3 instead of -k 3,3) they end at the end of the line.

For example, the following command sort on the 4th column and then on the 5th when the values of the 4th column are equal.

gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 4,4 -k 5,5 | head

Sorting operations are complex and can take a long time

You can add more option to the sorting of each column, for example r for reverse order d for dictionary order or n for numeric order.

What will the following command do ?

gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 3,3d -k 4,4n | head

Use the -u option to count the number of different annotation type based on the 3rd column

Solution

```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 3,3d -u | wc -l ```

You can check if a file is already sorted with the `-c` switch. Check if the gtf file is sorted by chromosome, start and end position.
Solution

```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 1,1 -k 4,4n -k 5,5n -c ```