Skip to content
Snippets Groups Projects
Verified Commit 305b74be authored by Laurent Modolo's avatar Laurent Modolo
Browse files

8_text_manipulation.Rmd: fix typo

parent e0ee461f
No related branches found
No related tags found
No related merge requests found
......@@ -23,24 +23,24 @@ library(fontawesome)
knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(comment = NA)
klippy::klippy(
position = c('top', 'right'),
position = c("top", "right"),
color = "white",
tooltip_message = 'Click to copy',
tooltip_success = 'Copied !')
tooltip_message = "Click to copy",
tooltip_success = "Copied !"
)
```
[![cc_by_sa](./img/cc_by_sa.png)](http://creativecommons.org/licenses/by-sa/4.0/)
Objective: Learn basics way to work with text file in Unix
Objective: Learn simple ways to work with text file in Unix
One of the great thing with command line tools is that they are simple and fast. Which means that they are great for handle large files. And as bioinformaticians you have to handle large file, so you need to use command line tools for that.
One of the great things with command line tools is that they are simple and fast. Which means that they are great for handling large files. And as bioinformaticians you have to handle large files, so you need to use command line tools for that.
# Text search
The file [hg38.ncbiRefSeq.gtf.gz](http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/genes/hg38.ncbiRefSeq.gtf.gz) contains the RefSeq annotation for hg38 in [GFT format](http://www.genome.ucsc.edu/FAQ/FAQformat.html#format4)
We can download files with the `wget` command. Here the annotation is in **gz** format which is a compressed format, you can use the `gzip` tool to hande **gz** files.
We can download files with the `wget` command. Here the annotation is in **gz** format which is a compressed format, you can use the `gzip` tool to handle **gz** files.
On useful command to check large text file is the `head `command.
......@@ -62,7 +62,7 @@ gzip -dc hg38.ncbiRefSeq.gtf.gz | grep "chr2" | head
What is the last annotation on the chromosome 1 (to write a tabulation character you can type `\t`) ?
You can count things in text file with the command `wc` read the `wc` **man**ual to see how you can count line in a file.
You can count things in text file with the command `wc` read the `wc` **man**ual to see how you can count lines in a file.
Does the number of *3UTR* match the number of *5UTR* ?
......@@ -70,7 +70,7 @@ How many transcripts does the gene *CCR7* have ?
# Regular expression
When you do a loot text search, you will encounter regular expression (regexp), which allow you to perform fuzzy search. To run `grep` in regexp mode you can use the switch `-E`
When you do a loot text search, you will encounter regular expressions (regexp), which allow you to perform fuzzy search. To run `grep` in regexp mode you can use the switch. `-E`
The most basic form fo regexp si the exact match:
......@@ -78,7 +78,7 @@ The most basic form fo regexp si the exact match:
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "gene_id"
```
You can use the `.` wildcard character to match any thing
You can use the `.` wildcard character to match anything
```sh
gzip -dc hg38.ncbiRefSeq.gtf.gz | head | grep -E "...._id"
......@@ -115,7 +115,7 @@ gzip -dc hg38.ncbiRefSeq.gtf.gz | head | perl -E "\d\d[A-Z]\d"
By default, regular expressions will match any part of a string. It’s often useful to *anchor* the regular expression so that it matches from the start or end of the string. You can use
- ^` to match the start of the string.
- `^` to match the start of the string.
- `$` to match the end of the string.
```sh
......@@ -190,7 +190,7 @@ gzip -dc hg38.ncbiRefSeq.gtf.gz | head | sed -E 's|ncbiRefSeq(.*)(transcript_id
```
</p>
</details>
Regexp can be very complexe see for example [a regex to validate an email on starckoverflow](https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression/201378#201378). When you start you can always use for a given regexp to a more experienced used (just give him the kind of text you want to match and not match). You can test your regex easily with the [regex101 website](https://regex101.com/).
Regexp can be very complex see for example [a regex to validate an email on starckoverflow](https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression/201378#201378). When you start you can always use for a given regexp to a more experienced used (just give him the kind of text you want to match and not match). You can test your regex easily with the [regex101 website](https://regex101.com/).
# Sorting
......@@ -291,7 +291,7 @@ You have 3 modes in `vim`:
- The **insert** mode, where you can write things. You enter this mode with the `i` key or any other key insertion key (for example `a` to insert after the cursor or `A` to insert at the end of the line)
- The **visual** mode where you can select text for copy/paste action. You can enter this mode with the `v` key
If you want to learn more about `vim` you can start with the https://vim-adventures.com/ website. Once you master `vim` everything is faster but you will have to practice a loot.
If you want to learn more about `vim`, you can start with the https://vim-adventures.com/ website. Once you master `vim` everything is faster but you will have to practice a loot.
> We have used the following commands:
>
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment