【原】芯片探針I(yè)D的基因注釋以前很麻煩

健明 2021-07-14

展開(kāi)全文

最近在答疑群里收到一個(gè)很經(jīng)典的提問(wèn)，就是:
請(qǐng)問(wèn)各位老師，GPL570芯片中應(yīng)該有部分基因是LncRNA，能否通過(guò)基因重注釋的方式把有意義的LncRNA篩選出來(lái)呢？R語(yǔ)言能否實(shí)現(xiàn)呢？

而且學(xué)生特別的好學(xué)，已經(jīng)懂得去搜索我們已有的1.3萬(wàn)篇教程，找到了芯片探針序列重新注釋的流程，但是我昨天就說(shuō)到過(guò)：芯片探針序列的基因注釋已經(jīng)無(wú)需你自己親自做了, 肯定是學(xué)員沒(méi)有追我們的公眾號(hào)最新教程，不過(guò)這個(gè)不能怪他。這個(gè)是公眾號(hào)的弊端，太多冗余信息讓大家分心，與我們真正的知識(shí)分享初衷背道而馳了。

所以呢，其實(shí)使用我們的包，安裝方法說(shuō)到過(guò)：芯片探針序列的基因注釋已經(jīng)無(wú)需你自己親自做了, ，使用起來(lái)也非常簡(jiǎn)單：

library(AnnoProbe)
ids=idmap('GPL570',type = 'soft')
head(ids)

僅僅是一句話(huà)，就拿到了這個(gè)平臺(tái)的探針的注釋信息，如下：

但是呢，我們還是探索一下，因?yàn)檫@個(gè)是下載的GPL的soft文件里面的注釋信息，所以可以看到是有一些探針居然是對(duì)應(yīng)多個(gè)基因，其實(shí)是因?yàn)檫@些基因本身坐標(biāo)就是有overlap，所以呢，探索的代碼就會(huì)稍微復(fù)雜一點(diǎn)。

ids=ids[nchar(ids[,2])>1,]
ids1=ids[grepl('///',ids[,2]),]
ids2=ids[!grepl('///',ids[,2]),]
# 我覺(jué)得下面的函數(shù)寫(xiě)的很差，運(yùn)行太慢
tmp = do.call(rbind,apply(ids1,1,function(x){
  x[1];x[2]
  data.frame(ID=x[1],symbol=strsplit(x[2],' /// ')[[1]])
})
)
ids=rbind(ids2,tmp)
anno=annoGene(ids$symbol,"SYMBOL")
ids=merge(ids,anno,by.x = 'symbol',by.y='SYMBOL',all.x = T)
sort(table(ids$biotypes))

可以看到，五萬(wàn)多個(gè)探針里面，真正的蛋白編碼基因的探針只有4萬(wàn)，剩余的一萬(wàn)多都是可以進(jìn)行探索的。

但是呢，這個(gè)并不是最佳的選擇，因?yàn)槲覀儾](méi)有對(duì)這個(gè)GPL平臺(tái)的探針的堿基序列進(jìn)行參考基因組比對(duì)后，自己重新注釋?zhuān)€是使用的GPL里面的soft文件的信息。

我們看看其它芯片文獻(xiàn)里面的GPL570探針I(yè)D的基因注釋信息

比如Published: 12 March 2019的文章：Identification of Key Long Non-Coding RNAs in the Pathology of Alzheimer’s Disease and their Functions Based on Genome-Wide Associations Study, Microarray, and RNA-seq Data

Briefly, we first downloaded the reference sequences of these potentially AD-related lncRNAs in FASTA format from NONCODE database . 
Second, probe sets of the microarrays were aligned to the lncRNA sequences using SeqMap tool, and the lncRNA-specific probe sets were obtained which contain at least four probes uniquely mapped to the lncRNA sequences without mismatch.

或者

Briefly, probe sets of HG-U133_Plus_2.0 array were aligned to the human genome (GRCh38) and lncRNA gene sequence from GENCODE (release 23) using SeqMap tool with no mismatch [49]. 
Then lncRNA-specific probes were obtained by mapping the genomic locations of probes to the genomic locations of lncRNAs. 
Finally, expression data of 2332 lncRNA were obtained for further analysis.

又或者

we obtained 3215 probes (probe sets) covering 2330 lncRNAs for Affymetrix HG-U133_Plus_2.0 array and 855 probes (probe sets) covering 663 lncRNAs for Affymetrix HG-U133A array, respectively. The expression data of multiple probes (probe sets) mapping to the same lncRNA were integrated by using the arithmetic mean to represent the expression level of single lncRNA.

total of 598 probes corresponding to 452 lncRNAs were obtained for the HG-U133A microarray, while 5,654 probes were matching with 3,793 lncRNAs in the HG-U133 Plus 2.0 microarray.

又或者

Briefly, the probe sets of Affymetrix HG‐U133 Plus 2.0 were retrieved from the Affymetrix website (http://www.affymetrix.com). We then re‐mapped those probes to the chromosomal positions of the ncRNAs derived from GENCODE (release 24, GRCh38) with no mismatch 14. A total of 2380 probes and 2118 corresponding ncRNA genes were obtained. When multiple probes mapped to the same ncRNA, we used the arithmetic mean of the probe intensities.