-
Laurent Modolo authoredLaurent Modolo authored
title: Text manipulation
Text manipulation
Objective: Learn basics way to work with text file in Unix
One of the great thing with command line tools is that they are simple and fast. Which means that they are great for handle large files. And as bioinformaticians you have to handle large file, so you need to use command line tools for that.
Text search
The file hg38.ncbiRefSeq.gtf.gz contains the RefSeq annotation for hg38 in GFT format
We can download files with the wget
command. Here the annotation is in gz format which is a compressed format, you can use the gzip
tool to hande gz files.
On useful command to check large text file is the head
command.
wget -q -O - http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz | gzip -dc | head
You can change the number of lines displayed with the option -n number_of_line
. The command tail
has the same function as head
but starting from the end of the file.
Try the tail
for the same number of lines displayed, does the computation take the same time ?
Download the hg38.ncbiRefSeq.gtf.gz
file in your ~/
.
The program grep string
allows you to search for string through a file or stream line by line.
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep "chr2" | head
What is the last annotation on the chromosome 1 (to write a tabulation character you can type \t
) ?
You can count things in text file with the command wc
read the wc
manual to see how you can count line in a file.
Does the number of 3UTR match the number of 5UTR ?
How many transcripts does the gene CCR7 have ?
Regular expression
When you do a loot text search, you will encounter regular expression (regexp), which allow you to perform fuzzy search. To run grep
in regexp mode you can use the switch -E
The most basic form fo regexp si the exact match:
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "gene_id"
You can use the .
wildcard character to match any thing
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "...._id"
There are different special characters in regexp, but you can use \
to escape them. For example, if you search for . you can use the following regexp
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "\."
Character classes and alternatives
There are a number of special patterns that match more than one character. You’ve already seen .
, which matches any character apart from a newline. There are four other useful tools:
-
\d
: matches any digit. -
\s
: matches any whitespace (e.g. space, tab, newline). -
\S
: matches any non-whitespace. -
[abc]
: matches a, b, or c. -
[^abc]
: matches anything except a, b, or c. -
[a-z]
: Match any letter of the alphabet
Search for two digits followed by an uppercase letter and one digit.
Solution
```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head | perl -E "\d\d[A-Z]\d" ```
Anchors
By default, regular expressions will match any part of a string. It’s often useful to anchor the regular expression so that it matches from the start or end of the string. You can use
- ^` to match the start of the string.
-
$
to match the end of the string.
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E ";"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E ";$"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "c"
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "^c"
Repetition
The next step up in power involves controlling how many times a pattern matches
-
?
: 0 or 1 -
+
: 1 or more -
*
: 0 or more
What is the following regexp going to match ?
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "[a-z]*_[a-z]*\s\"[1-3]\""
You can also specify the number of matches precisely:
-
{n}
: exactly n -
{n,}
: n or more -
{,m}
: at most m -
{n,m}
: between n and m
What is the following regexp going to match ?
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "^[a-z]{3}[2-3]\s.*exon\s\d{4,5}\s\d{4,5}.*"
How many gene names of more than 16 characters does the annotation contain ?
Solution
```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "transcript\s.*gene_id\s\"\S{16,}\";" | wc -l ```
Grouping and back references
You can group match using ()
, for example the following regexp match doublet of 12 .
gzip -dc hg38.ncbiRefSeq.gtf.gz | grep -E "(12){2}"
Grouping is also used for back references in the case of text replacement. You can use the command sed
for text replacement. The syntax of sed
for replacement is the following: sed -E 's|regexp|\n|g
where n
is the grouping number. s
stand for substitute and g
stand for global (which means that is they are different substitutions per line sed
won't stop at the first one).
Try the following replacement regexp
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|(transcript_).{2}|\1number|g'
Try to write a sed
command to replace ncbiRefSeq with transcript_id .
Solution
```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|ncbiRefSeq(.*)(transcript_id "([A-Z_0-9.]*))|\3\1\2|g' ```
Sorting
GTF files should be sorted by chromosome, starting position and end position. But you can change that with the command sort
to select the column to sort on you can use the option -k n,n
where n
is the column number.
You need to specify where sort keys start and where they end, otherwise (as in when you use -k 3
instead of -k 3,3
) they end at the end of the line.
For example, the following command sort
on the 4th column and then on the 5th when the values of the 4th column are equal.
gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 4,4 -k 5,5 | head
Sorting operations are complex and can take a long time
You can add more option to the sorting of each column, for example r
for reverse order d
for dictionary order or n
for numeric order.
What will the following command do ?
gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 3,3d -k 4,4n | head
Use the -u
option to count the number of different annotation type based on the 3rd column
Solution
```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 3,3d -u | wc -l ```
Solution
```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head -n 10000 | sort -k 1,1 -k 4,4n -k 5,5n -c ```
Field extractor
Sometime rather than using complex regexp, we want to extract a particular column from a file. You can use the command cut
to do that.
The following command extracts the 3rd column of the annotation
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | cut -f 3
You can change the field separator with the option -d
, set it to ";"
to extract the transcript_id and gene_name from the information column.
Solution
```sh gzip -dc hg38.ncbiRefSeq.gtf.gz | head | cut -f 2 -f 5 -d ";" ```
There are different tools to concatenate files from the command line cat
for vertical concatenation and paste
for horizontal concatenation.
cat .bashrc .bashrc | wc -l
What will be the results of the following command ?
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | paste - -
Text editor
You often have access to different text editors from the common line, two of the most popular ones are vim
and nano
.
nano
is more friendly to use than vim
but also very limited.
To open a text file you can type editor file_path
.
In nano
everything is written at the bottom, so you only have to remember that ^
is the symbol for the key Ctrl
.
Open you .bashrc
file and delete any comment line (starting with the #
character).
vim
is a child of the project vi
(which should also be available on your system), and which bring him more functionality. The workings of vim
can be a little strange at first, but you have to understand that on a US keyboard the distance that your finger have to travel while using vim
is minimal.
You have 3 modes in vim
:
- The normal mode, where you can navigate the file and enter command with the
:
key. You can come back to this mode by pressingEsc
- The insert mode, where you can write things. You enter this mode with the
i
key or any other key insertion key (for examplea
to insert after the cursor orA
to insert at the end of the line) - The visual mode where you can select text for copy/paste action. You can enter this mode with the
v
key
If you want to learn more about vim
you can start with the https://vim-adventures.com/ website. Once you master vim
everything is faster but you will have to practice a loot.
In the next session, we are going to apply the logic of pipes and text manipulation to batch processing.
We have used the following commands:
head
/tail
to display head or tail of a filewget
to download filesgzip
to extracttar.gz
filesgrep
to search text fileswc
to count thingssed
to search and replace string of textsort
to sort files on specific fieldcut
to extract a specific fieldcat
/paste
for concatenationnano
/vim
for text edition