生資系列:基因註解轉換套件 biomaRt, bitr, mygene

生物研究時常需要基因資訊,但分析工具與資料庫之間的註解未必能直接相通,如某些網站僅提供RefSeq ID, 有些只提供Ensembl ID等,雖然都是描述同一基因、但需要做進一步轉換才能串連資訊;另一常見狀況則是在GO enrichment analysis時,需要利用entrezid(gene id)來進行KEGG, GO enrichment分析。今天要介紹三種覺得好用的註解轉換套件:biomaRt, bitr, mygene,文章在R語言環境進行操作,若想使用python可以查詢是否支援並用 conda 安裝。

biomaRT

biomaRT 套件主要協助取得 Ensembl 網站的資訊。

Installation

bioconductor website

1
2
3
4
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("biomaRt")

Select desired BioMart database and dataset

useEnsembl() 來指定連線到哪個 biomaRT 資料庫與資料集

1
2
library(biomaRt)
ensembl <- useEnsembl(biomart = "genes", dataset = "hsapiens_gene_ensembl", version = 113)

函式有兩個必要參數 biomartdataset(層級 biomart > dataset),首先用 listEnsembl()listMarts() 查詢 biomart選項,雖名稱不同但內容相等(如: genes 等價於 ENSEMBL_MART_ENSEMBL)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
> listEnsembl()
        biomart                version
1         genes      Ensembl Genes 113
2 mouse_strains      Mouse strains 113
3          snps  Ensembl Variation 113
4    regulation Ensembl Regulation 113

> listMarts() 
               biomart                version
1 ENSEMBL_MART_ENSEMBL      Ensembl Genes 113
2   ENSEMBL_MART_MOUSE      Mouse strains 113
3     ENSEMBL_MART_SNP  Ensembl Variation 113
4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 113

useMart() 指定biomart資料庫 + listDatasets() 查詢可用的資料集,可搭配字串篩選:

1
2
mart <- useMart("ENSEMBL_MART_ENSEMBL")
ds <- as.data.frame(listDatasets(mart=mart))
1
2
3
4
5
6
## find rows in dataset which contains 'human' string
ds %>% filter(grepl("human",tolower(description))) 

## output:
                dataset              description    version
1 hsapiens_gene_ensembl Human genes (GRCh38.p14) GRCh38.p14

最後如果有指定ensembl版本需求,可用 listEnsemblArchives() 函數,並用 useEnsembl(version = xxx) 指定版本 (default: latest release)

1
2
3
4
5
             name     date                                 url version current_release
1  Ensembl GRCh37 Feb 2014          https://grch37.ensembl.org  GRCh37                
2     Ensembl 113 Oct 2024 https://oct2024.archive.ensembl.org     113               *
3     Ensembl 112 May 2024 https://may2024.archive.ensembl.org     112                
4     Ensembl 111 Jan 2024 https://jan2024.archive.ensembl.org     111                

Run biomaRT query

指定ensembl參考資料後,接著以 getBM()連線至 ensembl server 並針對query返回dataframe:

1
2
3
getBM(attributes, filters = "", values = "", mart, curl = NULL, 
checkFilters = TRUE, verbose = FALSE, uniqueRows = TRUE, bmHeader = FALSE,
quote = "\"", useCache = TRUE)

參數包含:

  • atribute 輸出ensembl的欄位,可用 listAttributes(${useEnsembl()_varname}) 查看選項
  • filters 輸入內容的資訊類型,如輸入 ENSGxxxfilters = "ensembl_gene_id" 可用 listFilters(${useEnsembl()_varname}) 查看選項
  • values 輸入內容 input query list
  • mart 指定 useEnsembl() 物件

Check caches

若先前有跑過相同資料庫,透過 biomartCacheInfo() 查詢是否有 cache file

1
2
3
4
biomaRt cache
- Location: C:\xxxx/biomaRt/Cache
- No. of files: 12
- Total size: 983.1 Kb

Example R code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
library(biomaRt)

## check mart dbs
listMarts() 
mart <- useMart("ENSEMBL_MART_ENSEMBL") 


## check mart datasets and find rows where column 'description' contains human
ds <- as.data.frame(listDatasets(mart=mart))
ds %>% filter(grepl("human",tolower(description))) 

# build this object which specify older version 103
ensembl <- useEnsembl(biomart = 'ensembl', dataset = 'ggallus_gene_ensembl', version = 103)


## run biomart convert
data <- read.table('query_gene.txt', header = T)
inputs <- as.vector(data$ensembl_gene_id)

# send it to the Ensembl BioMart server
genes <- getBM(attributes=c("external_gene_name","ensembl_gene_id","entrezgene_id"),
               filters = "ensembl_gene_id",
               values = inputs, 
               mart = ensembl)

bitr

bitr()clusterprofiler套件內的函式,利用 AnnotationDbi::select 來篩選特定物種的 OrgDB 內的資訊,最後輸出dataframe **所有的.db結尾的library其實就是使用AnnotationDbi來的物件

Installation

bioconductor website

1
2
3
4
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("clusterProfiler")

Run bitr() function

1
bitr(geneID, fromType, toType, OrgDb, drop = TRUE)
  • geneID 為 input query list
  • fromTypetoType 須為 idType(orgDB) 返回的類別(key):
1
fromType' should be one of ACCNUM, ALIAS, ENSEMBL, ENSEMBLPROT, ENSEMBLTRANS, ENTREZID, ENZYME, EVIDENCE, EVIDENCEALL, GENENAME, GENETYPE, GO, GOALL, IPI, MAP, OMIM, ONTOLOGY, ONTOLOGYALL, PATH, PFAM, PMID, PROSITE, REFSEQ, SYMBOL, UCSCKG, UNIPROT.
  • OrgDb 為指定物種的註解包,需要先安裝才能使用,物種匹配的orgDB 可以參考這個網站

Example R code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
library(clusterProfiler)
library(org.Hs.eg.db)

query_gene <- c('TP53','BRAF','PIK3CA','KRAS')
gene_out <- bitr(query_gene, fromType="SYMBOL", toType=c("GENENAME", "ENTREZID","ENSEMBL"), OrgDb=org.Hs.eg.db)
gene_out <- as.data.frame(gene_out)


> print(gene_out)
  SYMBOL                                                               GENENAME ENTREZID    ENSEMBL
1   TP53                                                      tumor protein p53     7157   ENSG00000141510
2   BRAF                          B-Raf proto-oncogene, serine/threonine kinase      673   ENSG00000157764
3 PIK3CA phosphatidylinositol-4,5-bisphosphate 3-kinase catalytic subunit alpha     5290   ENSG00000121879
4   KRAS                                            KRAS proto-oncogene, GTPase     3845   ENSG00000133703

轉換完成會顯示有多少比例轉換失敗

1
2
3
4
5
6
50% of input gene IDs are fail to map...
> print(gene_out)

    SYMBOL                    GENENAME ENTREZID
1   TP53           tumor protein p53     7157
4   KRAS KRAS proto-oncogene, GTPase     3845

mygene

mygene.info 以API形式獲取公開資料庫的基因資訊,官方文件多以 python code 做示範使用,R 套件函式屬於 API qurey 的 wrapper

queryMany is a wrapper for POST query of “/query” service

installation

bioconductor website

1
2
3
4
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("mygene")

Run mygene query

主要使用 queryMany 函式

1
queryMany(qterms, scopes=NULL, ..., return.as=c("DataFrame", "records", "text"), mygene)
  • qterm 為 input query list
  • scopes 提供 query list 類型,可用 field請參考官方文件,如: query 給 ENSGxxscopes = ensembl.gene
  • 其他常用參數
    • species 給予物種資訊如 taxid;
    • fields 指定輸出內容,可填類型跟 scopes field 相同
    • return.as 指定輸出類型: 目前支援"DataFrame" (default), “records” (list), “text” (JSON) 三種

Example R code

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
library(mygene)
geneList <- c('TP53','BRAF','PIK3CA','KRAS')
query_output <- queryMany(geneList, scopes = 'symbol', species = 9606, fields = 'ensembl.gene,name,entrezgene')


> print(query_output)
DataFrame with 4 rows and 6 columns
        query         _id   X_score  entrezgene                   name    ensembl.gene
  <character> <character> <numeric> <character>            <character>     <character>
1        TP53        7157   18.0079        7157      tumor protein p53 ENSG00000141510
2        BRAF         673   17.6012         673 B-Raf proto-oncogene.. ENSG00000157764
3      PIK3CA        5290   17.8778        5290 phosphatidylinositol.. ENSG00000121879
4        KRAS        3845   17.6623        3845 KRAS proto-oncogene,.. ENSG00000133703

Comments

bitr()mygene queryMany() 相當方便,常見或具有orgDB的物種基本上沒有太大問題,但有時會遇到以下情況導致轉換失敗:

  • gene name alias 或 synonyms 無法直接聯繫到正確的HGNC資訊,這時可先用 AnnotationDbi::select(org.${spec}.eg.db, keys = alias_list, columns = c('SYMBOL'), keytype = "ALIAS") 轉換基因名稱,或是找尋好用套件來轉換…
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    
    alias_list <- c('MEK','PI3K','TP53')
    > AnnotationDbi::select(org.Hs.eg.db, keys = alias_list, columns = c('SYMBOL'), keytype = "ALIAS")
    > 'select()' returned 1:many mapping between keys and columns
    ALIAS SYMBOL
    1   MEK MAP2K7
    2  PI3K PIK3CA
    3  PI3K PIK3CB
    4  PI3K PIK3CD
    5  PI3K PIK3CG
    6  TP53   TP53
    
  • 若資料庫以舊版 ensembl 註解,有些transcript ID 可能已經被移除,導致HGNC轉換失敗,這時 biomaRT useEnsembl() 指定資料集版本的功能相當方便,能多少救一些轉換,但也要注意這樣的轉換是否有意義,畢竟最新版沒有收錄……

References

https://bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/accessing_ensembl.html https://mygene.info/ https://github.com/gege-circle/home/issues/702

comments powered by Disqus