生物研究時常需要基因資訊,但分析工具與資料庫之間的註解未必能直接相通,如某些網站僅提供RefSeq ID, 有些只提供Ensembl ID等,雖然都是描述同一基因、但需要做進一步轉換才能串連資訊;另一常見狀況則是在GO enrichment analysis時,需要利用entrezid(gene id)來進行KEGG, GO enrichment分析。今天要介紹三種覺得好用的註解轉換套件:biomaRt, bitr, mygene,文章在R語言環境進行操作,若想使用python可以查詢是否支援並用 conda 安裝。
biomaRT
biomaRT
套件主要協助取得 Ensembl 網站的資訊。
Installation
bioconductor website
1
2
3
4
|
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
|
Select desired BioMart database and dataset
以 useEnsembl()
來指定連線到哪個 biomaRT 資料庫與資料集
1
2
|
library(biomaRt)
ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl", version = 113)
|
函式有兩個必要參數 biomart
和 dataset
(層級 biomart > dataset),首先用 listEnsembl()
或 listMarts()
查詢 biomart
選項,雖名稱不同但內容相等(如: genes
等價於 ENSEMBL_MART_ENSEMBL
)
1
2
3
4
5
6
7
8
9
10
11
12
13
|
> listEnsembl()
biomart version
1 genes Ensembl Genes 113
2 mouse_strains Mouse strains 113
3 snps Ensembl Variation 113
4 regulation Ensembl Regulation 113
> listMarts()
biomart version
1 ENSEMBL_MART_ENSEMBL Ensembl Genes 113
2 ENSEMBL_MART_MOUSE Mouse strains 113
3 ENSEMBL_MART_SNP Ensembl Variation 113
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 113
|
以 useMart()
指定biomart資料庫 + listDatasets()
查詢可用的資料集,可搭配字串篩選:
1
2
|
mart <- useMart("ENSEMBL_MART_ENSEMBL")
ds <- as.data.frame(listDatasets(mart=mart))
|
1
2
3
4
5
6
|
## find rows in dataset which contains 'human' string
ds %>% filter(grepl("human",tolower(description)))
## output:
dataset description version
1 hsapiens_gene_ensembl Human genes (GRCh38.p14) GRCh38.p14
|
最後如果有指定ensembl版本需求,可用 listEnsemblArchives()
函數,並用 useEnsembl(version = xxx)
指定版本 (default: latest release)
1
2
3
4
5
|
name date url version current_release
1 Ensembl GRCh37 Feb 2014 https://grch37.ensembl.org GRCh37
2 Ensembl 113 Oct 2024 https://oct2024.archive.ensembl.org 113 *
3 Ensembl 112 May 2024 https://may2024.archive.ensembl.org 112
4 Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org 111
|
Run biomaRT query
指定ensembl參考資料後,接著以 getBM()
連線至 ensembl server 並針對query返回dataframe:
1
2
3
|
getBM(attributes, filters = "", values = "", mart, curl = NULL,
checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, bmHeader = FALSE,
quote = "\"", useCache = TRUE)
|
參數包含:
atribute
輸出ensembl的欄位,可用 listAttributes(${useEnsembl()_varname})
查看選項
filters
輸入內容的資訊類型,如輸入 ENSGxxx
則 filters = "ensembl_gene_id"
可用 listFilters(${useEnsembl()_varname})
查看選項
values
輸入內容 input query list
mart
指定 useEnsembl()
物件
Check caches
若先前有跑過相同資料庫,透過 biomartCacheInfo()
查詢是否有 cache file
1
2
3
4
|
biomaRt cache
- Location: C:\xxxx/biomaRt/Cache
- No. of files: 12
- Total size: 983.1 Kb
|
Example R code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
library(biomaRt)
## check mart dbs
listMarts()
mart <- useMart("ENSEMBL_MART_ENSEMBL")
## check mart datasets and find rows where column 'description' contains human
ds <- as.data.frame(listDatasets(mart=mart))
ds %>% filter(grepl("human",tolower(description)))
# build this object which specify older version 103
ensembl <- useEnsembl(biomart = 'ensembl', dataset = 'ggallus_gene_ensembl', version = 103)
## run biomart convert
data <- read.table('query_gene.txt', header = T)
inputs <- as.vector(data$ensembl_gene_id)
# send it to the Ensembl BioMart server
genes <- getBM(attributes=c("external_gene_name","ensembl_gene_id","entrezgene_id"),
filters = "ensembl_gene_id",
values = inputs,
mart = ensembl)
|
bitr
bitr()
為 clusterprofiler
套件內的函式,利用 AnnotationDbi::select
來篩選特定物種的 OrgDB
內的資訊,最後輸出dataframe
**所有的.db
結尾的library其實就是使用AnnotationDbi來的物件
Installation
bioconductor website
1
2
3
4
|
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("clusterProfiler")
|
Run bitr() function
1
|
bitr(geneID, fromType, toType, OrgDb, drop = TRUE)
|
geneID
為 input query list
fromType
和 toType
須為 idType(orgDB)
返回的類別(key):
1
|
fromType' should be one of ACCNUM, ALIAS, ENSEMBL, ENSEMBLPROT, ENSEMBLTRANS, ENTREZID, ENZYME, EVIDENCE, EVIDENCEALL, GENENAME, GENETYPE, GO, GOALL, IPI, MAP, OMIM, ONTOLOGY, ONTOLOGYALL, PATH, PFAM, PMID, PROSITE, REFSEQ, SYMBOL, UCSCKG, UNIPROT.
|
OrgDb
為指定物種的註解包,需要先安裝才能使用,物種匹配的orgDB 可以參考這個網站
Example R code
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
library(clusterProfiler)
library(org.Hs.eg.db)
query_gene <- c('TP53','BRAF','PIK3CA','KRAS')
gene_out <- bitr(query_gene, fromType="SYMBOL", toType=c("GENENAME", "ENTREZID","ENSEMBL"), OrgDb=org.Hs.eg.db)
gene_out <- as.data.frame(gene_out)
> print(gene_out)
SYMBOL GENENAME ENTREZID ENSEMBL
1 TP53 tumor protein p53 7157 ENSG00000141510
2 BRAF B-Raf proto-oncogene, serine/threonine kinase 673 ENSG00000157764
3 PIK3CA phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha 5290 ENSG00000121879
4 KRAS KRAS proto-oncogene, GTPase 3845 ENSG00000133703
|
轉換完成會顯示有多少比例轉換失敗
1
2
3
4
5
6
|
50% of input gene IDs are fail to map...
> print(gene_out)
SYMBOL GENENAME ENTREZID
1 TP53 tumor protein p53 7157
4 KRAS KRAS proto-oncogene, GTPase 3845
|
mygene
mygene.info
以API形式獲取公開資料庫的基因資訊,官方文件多以 python code 做示範使用,R 套件函式屬於 API qurey 的 wrapper
queryMany
is a wrapper for POST query of “/query” service
installation
bioconductor website
1
2
3
4
|
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mygene")
|
Run mygene query
主要使用 queryMany
函式
1
|
queryMany(qterms, scopes=NULL, ..., return.as=c("DataFrame", "records", "text"), mygene)
|
qterm
為 input query list
scopes
提供 query list 類型,可用 field請參考官方文件,如: query 給 ENSGxx
則 scopes = ensembl.gene
- 其他常用參數
species
給予物種資訊如 taxid;
fields
指定輸出內容,可填類型跟 scopes
field 相同
return.as
指定輸出類型: 目前支援"DataFrame" (default), “records” (list), “text” (JSON) 三種
Example R code
1
2
3
4
5
6
7
8
9
10
11
12
13
|
library(mygene)
geneList <- c('TP53','BRAF','PIK3CA','KRAS')
query_output <- queryMany(geneList, scopes = 'symbol', species = 9606, fields = 'ensembl.gene,name,entrezgene')
> print(query_output)
DataFrame with 4 rows and 6 columns
query _id X_score entrezgene name ensembl.gene
<character> <character> <numeric> <character> <character> <character>
1 TP53 7157 18.0079 7157 tumor protein p53 ENSG00000141510
2 BRAF 673 17.6012 673 B-Raf proto-oncogene.. ENSG00000157764
3 PIK3CA 5290 17.8778 5290 phosphatidylinositol.. ENSG00000121879
4 KRAS 3845 17.6623 3845 KRAS proto-oncogene,.. ENSG00000133703
|
bitr()
和 mygene queryMany()
相當方便,常見或具有orgDB的物種基本上沒有太大問題,但有時會遇到以下情況導致轉換失敗:
- gene name alias 或 synonyms 無法直接聯繫到正確的HGNC資訊,這時可先用
AnnotationDbi::select(org.${spec}.eg.db, keys = alias_list, columns = c('SYMBOL'), keytype = "ALIAS")
轉換基因名稱,或是找尋好用套件來轉換…
1
2
3
4
5
6
7
8
9
10
|
alias_list <- c('MEK','PI3K','TP53')
> AnnotationDbi::select(org.Hs.eg.db, keys = alias_list, columns = c('SYMBOL'), keytype = "ALIAS")
> 'select()' returned 1:many mapping between keys and columns
ALIAS SYMBOL
1 MEK MAP2K7
2 PI3K PIK3CA
3 PI3K PIK3CB
4 PI3K PIK3CD
5 PI3K PIK3CG
6 TP53 TP53
|
- 若資料庫以舊版 ensembl 註解,有些transcript ID 可能已經被移除,導致HGNC轉換失敗,這時
biomaRT useEnsembl()
指定資料集版本的功能相當方便,能多少救一些轉換,但也要注意這樣的轉換是否有意義,畢竟最新版沒有收錄……
References
https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html
https://mygene.info/
https://github.com/gege-circle/home/issues/702