Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • mcariou/2020_dginn_covid19
  • ciri/ps_sars-cov-2/2021_dginn_covid19
2 results
Show changes
Commits on Source (5)
Showing with 6487 additions and 0 deletions
# Evolutionary history of SARS-CoV-2 interactome in bats and primates identifies key virus-host interfaces and conflicts
## Introduction
The current COVID-19 pandemic is caused by a novel coronavirus strain, SARS-CoV-2. It originated from the cross-species transmission of a coronavirus from the bat reservoir, directly or through an intermediate host to humans. This catastrophic spillover underlines the necessity to better understand how viruses and hosts have shaped one another over evolutionary time.
Pathogenic viruses put a selective pressure on the host-viral interacting proteins. Identifying which host genes bear signatures of such evolutionary conflict (e.g. positive selection) can lead to the identification of the proteins that have been the most relevant in the response to a virus family. Here, we have used this evolutionary framework to decipher which interactions between the SARS-CoV-2-like viruses and our cells have been important in vivo. In addition, identifying traces of positive selection in different hosts phylogenetic lineages also sheds lights on ancient epidemics and how virus-host determinants may be species specific. This may help to understand differences in susceptibility and pathogenicity to SARS-CoV-like viruses between hosts.
To achieve this, we characterized the evolutionary history of the SARS-CoV-2 interactome identified in in vitro studies: 332 host proteins identified by mass-spectrometry by Gordon and collaborators [1], as well as two essential SARS-CoV-2 entry factors, the angiotensin converting enzyme 2 (ACE2) and the transmembrane serine protease 2 (TMPRSS2) genes. We characterized their evolution in primates (tracing the human history) and in bats (the natural viral reservoir). To do so, we used DGINN [2], a novel computational pipeline to Detect Genetic INNovations in protein-coding genes, which embeds gold-standard methods to perform phylogenetic and positive selection analyses in a high-throughput manner.
## Data formating
requisite R packages: formatR, tinytex
Script to merge DGINN outputs from different batch of analysis and included or correct rows corresponding to genes ran on corrected alignmenents.
```
rnw_scripts/
```
Input tables in **data/**.
Output tables in **out_tab**
The tables output from this script will be used for the following analysis steps.
## Primates and bats
## Dataset comparison
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
File Name Gene GeneSize NbSpecies omegaM0Bpp omegaM0codeml BUSTED BUSTED_p-value MEME_NbSites MEME_PSS BppM1M2 BppM1M2_p-value BppM1M2_NbSites BppM1M2_PSS BppM7M8 BppM7M8_p-value BppM7M8_NbSites BppM7M8_PSS codemlM1M2 codemlM1M2_p-value codemlM1M2_NbSites codemlM1M2_PSS codemlM7M8 codemlM7M8_p-value codemlM7M8_NbSites codemlM7M8_PSS
ATE1_bat_select_cut_mafft_prank ATE1 ATE1 571 11 0.15571103991607290751 0.147 N 0.0691 22 11, 17, 18, 27, 28, 29, 30, 108, 176, 223, 242, 243, 248, 251, 330, 346, 349, 350, 357, 361, 364, 571 N 0.9999998628509369 0 na N 0.9999862060854904 0 na N 1.0 0 na N 1.0 0 na
CEP43_bat_realign_mafft_prank CEP43 FGFR1OP 744 9 0.22520577281291204175 na Y 0.0000 13 8, 33, 158, 183, 184, 216, 226, 392, 527, 547, 606, 705, 722 N 0.9999995178687894 0 na N 0.9999984890175757 0 na na na 0 na na na 0 na
COL6A1_bat_cut_hipArm_mafft_prank COL6A1 COL6A1 1078 9 0.082800084329500089897 0.077 Y 0.0023 46 15, 26, 99, 115, 180, 184, 202, 221, 226, 227, 283, 306, 337, 375, 387, 390, 399, 472, 603, 632, 678, 685, 725, 726, 738, 740, 761, 768, 770, 808, 834, 856, 895, 902, 911, 916, 919, 920, 956, 958, 967, 985, 996, 1009, 1060, 1078 N 0.9999996594925176 0 na Y 1.4694972038785021e-06 8 64, 108, 227, 364, 387, 472, 727, 986 N 0.13931746906383644 0 na Y 0.017650344100739557 1 472
COQ8B_bat_select-2_mafft_prank COQ8B COQ8B 549 11 0.18249719548855317108 0.152 N 0.8349 3 20, 56, 542 N 0.9999990844975953 0 na N 0.2543943107209184 0 na N 1.0 0 na Y 0.041794104084912305 0
CYB5B_bat_select-2_mafft_prank CYB5B CYB5B 278 12 0.26049862063255097011 0.265 Y 0.0000 4 34, 116, 117, 179 N 0.9999997690552203 0 na N 0.9987889831935154 0 na N 0.9831436846348578 0 na N 0.6187833918061296 0 na
DDX21_bat_select_mafft_prank DDX21 DDX21 875 12 0.20291300096326300717 0.191 Y 0.0096 5 119, 163, 220, 852, 853 N 0.9999982397800553 0 na N 0.4659703818153812 0 na N 1.0 0 na N 0.5106861833664071 0 na
ELOB_bat_select_mafft_prank ELOB ELOB 124 7 0.022399397377356227573 0.023 Y 0.0017 0 na N 0.999999863385606 0 na N 0.8892346112479187 0 na N 1.0 0 na N 0.9990004998333987 0 na
ERO1B_bat_select_mafft_prank ERO1B ERO1B 777 11 0.036193322757663994038 0.039 N 1.0000 1 386 N 0.9999999786846276 0 na N 0.18314924606569427 0 na N 1.0 0 na N 1.0 0 na
ETFA_bat_select_mafft_prank ETFA ETFA 463 9 0.14661583111874651464 0.130 N 0.0544 4 95, 149, 252, 281 N 0.9999990190681537 0 na N 0.08502800413099425 0 na N 1.0 0 na N 0.997004495503217 0 na
GNB1_bat_select_add-2_mafft_prank GNB1 GNB1 340 12 0.0018342406212685069111 0.002 N 0.9997 0 na N 0.9999999737615329 0 na N 0.9999998460152829 0 na N 0.997004495503217 0 na N 0.997004495503217 0 na
GOLGA7_bat_select_mafft_prank GOLGA7 GOLGA7 149 12 0.22057358315093311685 0.207 Y 0.0000 0 na Y 2.7070592544122577e-07 1 123 Y 1.7510477519039514e-09 1 123 Y 9.285332670143659e-11 1 123 Y 2.0005777681870837e-11 1 123
HSBP1_bat-select_mafft_prank HSBP1 HSBP1 110 12 0.15412257221850680922 0.152 N 1.0000 0 na N 0.9999999941107944 0 na N 0.9999805391217563 0 na N 0.9990004998333987 0 na N 1.0 0 na
INTS4_bat_select_mafft_prank INTS4 INTS4 1016 10 0.076028837169011515007 0.079 N 0.9908 3 218, 219, 731 N 0.9999989182894236 0 na Y 0.004664170724173206 4 25, 521, 988, 989 N 1.0 0 na N 0.5804219151408183 0 na
MOV10_bat_cut_mafft_prank MOV10 MOV10 1173 12 0.19324965973420044074 0.196 Y 0.0052 14 7, 34, 41, 263, 391, 455, 485, 527, 575, 657, 661, 1066, 1067, 1068 N 0.9999989373906107 0 na Y 0.0032754616051401913 1 391 N 1.0 0 na Y 0.012727376189877581 0
MTARC1_bat_select_mafft_prank MTARC1 MARC1 386 9 0.17111475068673179245 0.188 N 0.0986 32 7, 30, 31, 38, 39, 41, 46, 48, 62, 64, 68, 74, 82, 85, 91, 94, 113, 120, 162, 179, 183, 211, 218, 220, 245, 270, 273, 300, 306, 331, 336, 371 N 0.9999997707509728 0 na N 0.9999808885270796 0 na N 1.0 0 na N 1.0 0 na
PABPC1_bat_select_mafft_prank PABPC1 PABPC1 735 12 0.093895137051223293012 0.093 N 0.0509 2 471, 671 N 0.9720024792000442 0 na N 0.609585807664959 0 na N 0.7124824490892332 0 na N 0.32497728220613026 0 na
PCSK6_bat_select_mafft_prank PCSK6 PCSK6 1654 8 0.098698549445369543331 0.089 Y 0.0000 16 28, 499, 522, 690, 749, 889, 948, 953, 966, 1010, 1303, 1310, 1324, 1325, 1350, 1398 N 0.9999997156783542 0 na N 0.10593739617924755 0 na N 1.0 0 na Y 3.761624567607716e-10 0
PDE4DIP_bat_select_check_mafft_prank PDE4DIP PDE4DIP 2894 9 0.24383690569934904357 0.244 N 0.0597 49 179, 292, 315, 375, 443, 541, 546, 732, 888, 931, 991, 1201, 1208, 1397, 1437, 1559, 1564, 1583, 1640, 1697, 1783, 1891, 1927, 2005, 2008, 2016, 2018, 2021, 2023, 2027, 2028, 2041, 2135, 2187, 2193, 2194, 2276, 2325, 2364, 2388, 2392, 2450, 2516, 2517, 2526, 2531, 2580, 2855, 2869 N 0.9999979304287578 0 na N 0.7675900915830035 0 na N 1.0 0 na N 0.20028778934513805 0 na
RDX_bat_select_mafft_prank RDX RDX 677 11 0.070830287871767122487 0.065 Y 0.0012 3 2, 348, 608 N 0.9999984528438435 0 na Y 0.00870082277450124 2 348, 601 N 1.0 0 na N 0.9464851479532084 0 na
REEP6_LA_bat_select_mafft_prank REEP6 REEP6-A 382 10 0.29517622624592770864 0.237 N 0.6966 1 378 N 0.6758702216545132 0 na N 0.1986200915147093 0 na N 0.6616622828278411 0 na N 0.31505753690349364 0 na
REEP6_LB_bat_select_mafft_prank REEP6 REEP6-B 810 6 0.93827969065999239362 0.952 Y 0.0000 11 377, 380, 492, 522, 526, 576, 589, 590, 599, 618, 793 N 0.11712587948916774 0 na Y 0.04708383426090607 0 Y 0.01900601064703638 0 Y 0.014610716685120028 0
RIPK1_sequences_filtered_longestORFs_quickclean_mafft_prank RIPK1 RIPK1 804 24 0.43700247549171972183 0.412 N 0.0863 20 55, 95, 103, 277, 348, 377, 404, 565, 574, 581, 584, 586, 613, 617, 667, 690, 695, 707, 725, 797 Y 0.046009519791645484 0 Y 0.001203426338327248 1 797 Y 0.024551066100545648 1 797 Y 0.0002102948425097665 1 797
SDF2_bat_select_mafft_prank SDF2 SDF2 246 12 0.082797829312180992734 0.080 N 0.2145 0 na N 0.999999439606811 0 na N 0.9999999930209924 0 na N 1.0 0 na N 1.0 0 na
SELENOS_bat_select_mafft_prank SELENOS SELENOS 218 8 0.16939756613179995925 0.172 N 0.8176 5 71, 72, 85, 128, 173 N 0.9999995791182277 0 na N 0.389714156459387 0 na N 0.9970044955034437 0 na N 0.2133118712228997 0 na
TARS2_bat_same_mafft_prank TARS2 TARS2 1893 11 0.13453722174632359865 0.164 N 1.0000 6 121, 571, 1151, 1732, 1832, 1833 N 0.9999988295192026 0 na N 0.9999999986184775 0 na N 1.0 0 na N 0.4479829971847956 0 na
TBCA_bat_same_mafft_prank TBCA TBCA 190 12 0.13752129483741223903 0.242 Y 0.0000 1 112 N 0.9999997004843402 0 na N 0.9935970583143424 0 na Y 5.8900579536612135e-09 3 2, 3, 4 na na 0 na
TMEM39B_bat_same_mafft_prank TMEM39B TMEM39B 880 12 0.027235007039035353388 0.041 Y 0.0000 3 217, 331, 332 N 0.9999999952856342 0 na N 0.9997082226309755 0 na Y 1.5181774326767543e-06 2 169, 216 na na 0 na
TMPRSS2_bat_same_mafft_prank TMPRSS2 TMPRSS2 1174 12 0.14029058400872645995 0.145 N 0.9333 12 630, 644, 649, 688, 775, 888, 921, 1003, 1051, 1055, 1066, 1173 N 0.9999990104220509 0 na N 0.6218822946709852 0 na N 1.0 0 na N 0.788991288016829 0 na
TMPRSS2_bat_select_cut_mafft_prank TMPRSS2 TMPRSS2 574 12 0.12948903836486919117 0.127 N 0.9358 19 59, 73, 78, 108, 115, 117, 121, 133, 144, 241, 259, 288, 321, 403, 421, 451, 455, 466, 573 N 0.9999999060492017 0 na N 0.3348934269948105 0 na N 1.0 0 na N 0.4210515526274131 0 na
TYSND1_bat_select_mafft_prank TYSND1 TYSND1 676 8 0.21586721789783006042 0.184 N 0.2118 5 54, 93, 95, 124, 205 N 0.9999989615094754 0 na Y 0.0016413879993395736 1 95 N 1.0 0 na Y 1.0838402897292556e-06 2 95, 317
\ No newline at end of file
File Name Gene GeneSize NbSpecies omegaM0Bpp omegaM0codeml BUSTED BUSTED_p-value MEME_NbSites MEME_PSS BppM1M2 BppM1M2_PSS BppM1M2_NbSites BppM1M2_p-value BppM7M8_PSS BppM7M8_p-value BppM7M8_NbSites BppM7M8 BppDFP07_0DFP07 BppDFP07_0DFP07_PSS BppDFP07_0DFP07_NbSites BppDFP07_0DFP07_p-value codemlM1M2 codemlM1M2_p-value codemlM1M2_NbSites codemlM1M2_PSS codemlM7M8 codemlM7M8_NbSites codemlM7M8_PSS codemlM7M8_p-value
SELENOS_sequences_filtered_longestORFs_mafft_mincov_prank SELENOS_all SELENOS 367 24 0.20728717475362265499 0.196 Y 0.0000 1 90 N na 0 0.999999228305 na 0.986116909613 0 N na na 0 na N 1.0 0 na N 0 na 0.692117181689
Us;Else COQ8B;ADCK4 ELOC;TCEB1 ERO1B;ERO1LB MARC1;MTARC1 NSD2;WHSC1 NUP58;NUPL1 PCSK5; RETREG3;FAM134C SPART;SPG20 TIMM29;C19orf52 TOMM70;TOMM70A WASHC4;KIAA1033
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
\documentclass[11pt, oneside]{article} % use "amsart" instead of "article" for AMSLaTeX format
%\usepackage{geometry} % See geometry.pdf to learn the layout options. There are lots.
%\geometry{letterpaper} % ... or a4paper or a5paper or ...
%\geometry{landscape} % Activate for for rotated page geometry
%\usepackage[parfill]{parskip} % Activate to begin paragraphs with an empty line rather than an indent
%\usepackage{graphicx} % Use pdf, png, jpg, or eps with pdflatex; use eps in DVI mode
% TeX will automatically convert eps --> pdf in pdflatex
%\usepackage{amssymb}
\usepackage[utf8]{inputenc}
%\usepackage[cyr]{aeguill}
%\usepackage[francais]{babel}
%\usepackage{hyperref}
\title{Positive selection on genes interacting with SARS-Cov2, Data formating}
\author{Marie Cariou}
\date{Janvier 2021} % Activate to display a given date or no date
\begin{document}
\maketitle
\tableofcontents
\newpage
\section{1st table}
Table containing the DGINN results for both Primates and bats. Conserve all genes.
\subsection{Primates}
Workdir must be adapted to local environment
<<>>=
workdir<-"/home/adminmarie/Documents/CIRI_BIBS_projects/2020_05_Etienne_covid/2020_dginn_covid19/"
#workdir<-getwd()
@
<<>>=
dginnT<-read.delim(paste0(workdir,
"data/DGINN_202005281649summary_cleaned.csv"),
fill=T, h=T, sep=",")
dim(dginnT)
#names(dginnT)
# Rename the columns to include primate
names(dginnT)<-c("File", "Name", "Gene.name", "GeneSize",
"dginn-primate_NbSpecies", "dginn-primate_omegaM0Bpp",
"dginn-primate_omegaM0codeml", "dginn-primate_BUSTED",
"dginn-primate_BUSTED.p.value", "dginn-primate_MEME.NbSites",
"dginn-primate_MEME.PSS", "dginn-primate_BppM1M2",
"dginn-primate_BppM1M2.p.value", "dginn-primate_BppM1M2.NbSites",
"dginn-primate_BppM1M2.PSS", "dginn-primate_BppM7M8",
"dginn-primate_BppM7M8.p.value", "dginn-primate_BppM7M8.NbSites",
"dginn-primate_BppM7M8.PSS", "dginn-primate_codemlM1M2",
"dginn-primate_codemlM1M2.p.value", "dginn-primate_codemlM1M2.NbSites",
"dginn-primate_codemlM1M2.PSS", "dginn-primate_codemlM7M8",
"dginn-primate_codemlM7M8.p.value", "dginn-primate_codemlM7M8.NbSites",
"dginn-primate_codemlM7M8.PSS")
@
Add SELENOS
<<selenos>>=
selenos<-read.delim(paste0(workdir,
"data/resSELENOS.tab"))
# liste of colonne
colonnes<-c("File", "Name", "Gene", "GeneSize",
"NbSpecies", "omegaM0Bpp", "omegaM0codeml", "BUSTED",
"BUSTED_p.value", "MEME_NbSites", "MEME_PSS", "BppM1M2",
"BppM1M2_p.value", "BppM1M2_NbSites", "BppM1M2_PSS", "BppM7M8",
"BppM7M8_p.value", "BppM7M8_NbSites", "BppM7M8_PSS","codemlM1M2",
"codemlM1M2_p.value", "codemlM1M2_NbSites", "codemlM1M2_PSS",
"codemlM7M8", "codemlM7M8_p.value", "codemlM7M8_NbSites",
"codemlM7M8_PSS")
selenos<-selenos[,colonnes]
@
<<>>=
names(selenos)<-names(dginnT)
selenos[,6]<-as.factor(selenos[,6])
selenos[,9]<-as.factor(selenos[,9])
selenos[,11]<-as.factor(selenos[,11])
selenos[,13]<-as.factor(selenos[,13])
selenos[,17]<-as.factor(selenos[,17])
selenos[,21]<-as.factor(selenos[,21])
selenos[,25]<-as.factor(selenos[,25])
## convertir les pvalues
dginnT<-rbind(dginnT, selenos)
@
\subsection{Bats}
<<>>=
# original table
dginnbats<-read.delim(paste0(workdir,
"data/DGINN_202005281339summary_cleaned-LE201108.txt"),
fill=T, h=T)
# rerun on corrected alignment
dginnbatsnew<-read.delim(paste0(workdir,
"data/DGINN_202011262248_hyphybpp-202012192053_codeml-summary.txt"),
fill=T, h=T)
@
<<>>=
# Add both columns
dginnbatsnew$Lucie.s.comments<-""
dginnbatsnew$Action.taken<-""
# Homogenize column names
dginnbats$BUSTED_p.value<-dginnbats$BUSTED.p.value
dginnbats$MEME_NbSites<-dginnbats$MEME.NbSites
dginnbats$MEME_PSS<-dginnbats$MEME.PSS
dginnbats$BppM1M2_p.value<-dginnbats$BppM1M2.p.value
dginnbats$BppM1M2_NbSites<-dginnbats$BppM1M2.NbSites
dginnbats$BppM1M2_PSS<-dginnbats$BppM1M2.PSS
dginnbats$BppM7M8_p.value<-dginnbats$BppM7M8.p.value
dginnbats$BppM7M8_NbSites<-dginnbats$BppM7M8.NbSites
dginnbats$BppM7M8_PSS<-dginnbats$BppM7M8.PSS
dginnbats$codemlM1M2_p.value<-dginnbats$codemlM1M2.p.value
dginnbats$codemlM1M2_NbSites<-dginnbats$codemlM1M2.NbSites
dginnbats$codemlM1M2_PSS<-dginnbats$codemlM1M2.PSS
dginnbats$codemlM7M8_p.value<-dginnbats$codemlM7M8.p.value
dginnbats$codemlM7M8_NbSites<-dginnbats$codemlM7M8.NbSites
dginnbats$codemlM7M8_PSS<-dginnbats$codemlM7M8.PSS
@
<<>>=
# Order columns in the same order in both tables
dginnbats<-dginnbats[,names(dginnbatsnew)]
names(dginnbatsnew) %in% names(dginnbats)
names(dginnbats)==names(dginnbatsnew)
# Put RIPK aside
ripk1<-dginnbatsnew[dginnbatsnew$Gene=="RIPK1",1:27]
# Add it to primate table
names(ripk1)<-names(dginnT)
ripk1$`dginn-primate_omegaM0Bpp`<-as.factor(
ripk1$`dginn-primate_omegaM0Bpp`)
ripk1$`dginn-primate_BUSTED.p.value`<-as.factor(
ripk1$`dginn-primate_BUSTED.p.value`)
ripk1$`dginn-primate_BppM1M2.p.value`<-as.factor(
ripk1$`dginn-primate_BppM1M2.p.value`)
ripk1$`dginn-primate_BppM7M8.p.value`<-as.factor(
ripk1$`dginn-primate_BppM7M8.p.value`)
dginnT<-rbind(dginnT, ripk1)
## Remove it Ripk1 from bats
dginnbatsnew<-dginnbatsnew[dginnbatsnew$Gene!="RIPK1",]
## suppress redundant lines
dginnbats<-dginnbats[(dginnbats$Gene %in% dginnbatsnew$Gene)==FALSE,]
names(dginnbatsnew)<-names(dginnbats)
## replace by new data
dginnbatsnew$omegaM0Bpp<-as.factor(dginnbatsnew$omegaM0Bpp)
dginnbatsnew$BppM1M2_p.value<-as.factor(dginnbatsnew$BppM1M2_p.value)
dginnbatsnew$BppM7M8_p.value<-as.factor(dginnbatsnew$BppM7M8_p.value)
dginnbats<-rbind(dginnbats, dginnbatsnew)
names(dginnbats)<-c("bats_File", "bats_Name", "Gene.name", paste0("bats_",
names(dginnbats)[-(1:3)]))
names(dginnbats)
@
\subsection{Merged table}
<<setup, include=FALSE, cache=FALSE, tidy=TRUE>>=
options(tidy=TRUE, width=70)
@
<<>>=
#tidy.opts = list(width.cutoff = 60)
dim(dginnT)
#dginnT$Gene.name
dim(dginnbats)
#dginnbats$Gene.name
@
Manual corrections:
TMPRSS2 in bats
<<>>=
dginnbats[dginnbats$Gene.name=="TMPRSS2",]
# keeping the uncut one
# renaming the other one TMPRSS2_cut
dginnbats$Gene.name<-as.character(dginnbats$Gene.name)
dginnbats[dginnbats$bats_File==
"TMPRSS2_bat_select_cut_mafft_prank","Gene.name"]<-
"TMPRSS2_cut"
@
RIPK1: ANcestral version kept, suppress it
"RIPK1\_sequences\_filtered\_longestORFs\_mafft\_mincov\_prank"
<<>>=
dginnT<-dginnT[dginnT$File!=
"RIPK1_sequences_filtered_longestORFs_mafft_mincov_prank",]
@
REEP6 eA et B
<<>>=
dginnbats$Gene.name<-as.character(dginnbats$Gene.name)
dginnbats[dginnbats$bats_File==
"REEP6_sequences_filtered_longestORFs_D210gp1_prank", "Gene.name"]<-
"REEP6_old"
dginnbats[dginnbats$bats_File==
"REEP6_LA_bat_select_mafft_prank", "Gene.name"]<-"REEP6"
dginnbats[dginnbats$bats_File==
"REEP6_LB_bat_select_mafft_prank", "Gene.name"]<-"REEP6_like"
@
GNG5
<<>>=
dginnT$Gene.name<-as.character(dginnT$Gene.name)
dginnT[dginnT$File==
"GNG5_sequences_filtered_longestORFs_D189gp2_prank", "Gene.name"]<-
"GNG5_like"
@
<<>>=
dim(dginnbats)
dim(dginnT)
# genes in common
common<-dginnT$Gene.name[dginnT$Gene.name %in% dginnbats$Gene.name]
common
length(dginnT$Gene.name[dginnT$Gene.name %in% dginnbats$Gene.name])
# genes only in primates
onlyprimates<-
dginnT$Gene.name[(dginnT$Gene.name %in% dginnbats$Gene.name)==FALSE]
onlyprimates
length(dginnT$Gene.name[(dginnT$Gene.name %in% dginnbats$Gene.name)==FALSE])
# genes only in bats
onlybats<-
dginnbats$Gene.name[(dginnbats$Gene.name %in% dginnT$Gene.name)==FALSE]
onlybats
length(dginnbats$Gene.name[(dginnbats$Gene.name %in% dginnT$Gene.name)==FALSE])
@
<<>>=
tab<-merge(dginnT, dginnbats, by="Gene.name", all.x=T, all.y=T)
dim(tab)
# add column "shared"/"only bats"/"only primates"
tab$status<-""
tab$status[tab$Gene.name %in% common]<-"shared"
tab$status[tab$Gene.name %in% onlyprimates]<-"onlyprimates"
tab$status[tab$Gene.name %in% onlybats]<-"onlybats"
table(tab$status)
write.table(tab, paste0(
workdir, "out_tab/covid_comp_alldginn.txt"), sep="\t")
@
\section{Complete data}
Merge the previous tab with J Young's original table.
\subsection{Read the original Young table}
<<>>=
young<-read.delim(paste0(workdir,
"data/COVID_PAMLresults_332hits_plusBatScreens_2020_Apr14.csv"),
fill=T, h=T, dec=",")
dim(young)
young$PreyGene<-as.character(young$PreyGene)
young$PreyGene[young$PreyGene=="MTARC1"]<-"MARC1"
@
\subsection{Read the gene names conversion table}
<<>>=
usthem<-read.delim(paste0(workdir,
"/data/table_gene_name_correspondence.csv"),
h=T, sep=";")
young[young$PreyGene %in% usthem$Us, c("PreyGene", "Gene.name")]
usthem[order(usthem$Us),]
@
\subsection{Merge Young and DGINN table}
\textbf{Based on which column?}
How many genes in the Young table are not in the DGINN table. And who are they?
<<>>=
table(young$PreyGene %in% tab$Gene.name)
young[(young$PreyGene %in% tab$Gene.name)==FALSE, "PreyGene"]
tab[(tab$Gene.name %in% young$PreyGene)==FALSE, "Gene.name"]
@
Merge them and keep only the krogan genes
<<>>=
# creation of a dedicated column
young$merge.Gene<-young$PreyGene
tab$merge.Gene<-tab$Gene.name
tablo<-merge(young, tab, by="merge.Gene", all.x=TRUE)
write.table(tablo, paste0(
workdir, "/out_tab/covid_comp_complete.txt"), row.names=FALSE, quote=TRUE, sep="\t")
@
\end{document}
This diff is collapsed.
File added
This diff is collapsed.