GEO 数据挖掘-数据获得
1. 概述
NCBI Gene Expression Omnibus(GEO)是各种高通量实验数据的公共存储库,这些数据包括测量mRNA、基因组DNA和蛋白质丰度的单通道和双通道微阵列实验,以及非阵列技术,如基因表达序列分析(SAGE)、质谱蛋白质组数据和高通量测序数据。相比较TCGA数据库,因为数据是用户上传,所以更新较快
需要知道四个单词 1. GEO Platform (GPL) 芯片平台
- GEO Sample (GSM) 样本ID号
- GEO Series (GSE) study的ID号
- GEO Dataset (GDS) 数据集的ID号
2. 配置包
代码语言:javascript复制# 设置镜像,用于下载
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
# 安装
# install.packages("BiocManager")
# BiocManager::install("GEOquery")
3. 下载数据
代码语言:javascript复制# 加载
library(GEOquery)
#使用getGEO函数获得基因信息
gds <- getGEO("GDS507")# 下载
# 同时支持从本地获得
# gds <- getGEO(filename=system.file("路径",package="GEOquery"))
# 下载gsm数据
gsm <- getGEO("GSM11805")
4. GEOquery数据结构
GEOquery数据结构实际上有两种形式。第一种,包括GDS、GPL和GSM,第二种是GSE是,由GSM和GPL对象组合而成的复合数据类型。
4.1 GDS, GSM, 和 GPL
代码语言:javascript复制# 通过meta查看gsm文件,显示样本的信息
head(Meta(gsm))
代码语言:javascript复制## $channel_count
## [1] "1"
##
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"
代码语言:javascript复制# 查看前5行数值
Table(gsm)[1:5,]
代码语言:javascript复制## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
代码语言:javascript复制# 查看列名
Columns(gsm)
代码语言:javascript复制## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
对于gpl的格式和gsm是一样的,另外GDS的信息要多于上述两种
代码语言:javascript复制Columns(gds)[1:5,1:3]
代码语言:javascript复制## sample disease.state individual
## 1 GSM11815 RCC 035
## 2 GSM11832 RCC 023
## 3 GSM12069 RCC 001
## 4 GSM12083 RCC 005
## 5 GSM12101 RCC 011
4.2 GSE格式
GSE格式较为混乱,它可以表示任意平台的样本,gse有一个元数据部分,它没有DataTable。而是包含两个列表,可以使用GPLList和GSMList方法访问。
代码语言:javascript复制gse <- getGEO("GSE781",GSEMatrix=FALSE)
# 查看meta
head(Meta(gse))
代码语言:javascript复制## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"
##
## $contact_fax
## [1] "617-414-1646"
查看样本id
代码语言:javascript复制names(GSMList(gse))
代码语言:javascript复制## [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
## [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"
获取第一个gsm样本
代码语言:javascript复制GSMList(gse)[[1]]
代码语言:javascript复制## An object of class "GSM"
## channel_count
## [1] "1"
## contact_address
## [1] "715 Albany Street, E613B"
## contact_city
## [1] "Boston"
## contact_country
## [1] "USA"
## contact_department
## [1] "Genetics and Genomics"
## contact_email
## [1] "mlenburg@bu.edu"
## contact_fax
## [1] "617-414-1646"
## contact_institute
## [1] "Boston University School of Medicine"
## contact_name
## [1] "Marc,E.,Lenburg"
## contact_phone
## [1] "617-414-1375"
## contact_state
## [1] "MA"
## contact_web_link
## [1] "http://gg.bu.edu"
## contact_zip/postal_code
## [1] "02130"
## data_row_count
## [1] "22283"
## description
## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
## [2] "Keywords = kidney"
## [3] "Keywords = renal"
## [4] "Keywords = RCC"
## [5] "Keywords = carcinoma"
## [6] "Keywords = cancer"
## [7] "Lot batch = 2004638"
## geo_accession
## [1] "GSM11805"
## last_update_date
## [1] "May 28 2005"
## molecule_ch1
## [1] "total RNA"
## organism_ch1
## [1] "Homo sapiens"
## platform_id
## [1] "GPL96"
## series_id
## [1] "GSE781"
## source_name_ch1
## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
## status
## [1] "Public on Nov 25 2003"
## submission_date
## [1] "Oct 20 2003"
## supplementary_file
## [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM11nnn/GSM11805/suppl/GSM11805.CEL.gz"
## taxid_ch1
## [1] "9606"
## title
## [1] "N035 Normal Human Kidney U133A"
## type
## [1] "RNA"
## An object of class "GEODataTable"
## ****** Column Descriptions ******
## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
## ****** Data Table ******
## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
## 22278 more rows ...
查看平台信息
代码语言:javascript复制names(GPLList(gse))
代码语言:javascript复制## [1] "GPL96" "GPL97"
5. 转换为bioconductor ExpressionSets 和 limma MALists
5.1 获得GSE Series Matrix数据
代码语言:javascript复制# 获得数据集,格式为gse
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
# 显示具体信息
show(gse2553)
代码语言:javascript复制## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
## varLabels: title geo_accession ... data_row_count (30 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1 2 ... 12600 (12600 total)
## fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 16230383
## Annotation: GPL1977
5.2 将GDS转换为ExpressionSet
代码语言:javascript复制eset <- GDS2eSet(gds,do.log2=TRUE)
eset
代码语言:javascript复制## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22645 features, 17 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
## varLabels: sample disease.state individual description
## varMetadata: labelDescription
## featureData
## featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
## fvarLabels: ID Gene title ... GO:Component ID (21 total)
## fvarMetadata: Column labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 14641932
## Annotation:
此时eset就是一个表达矩阵
5.3 转换 GDS 为 MAList
代码语言:javascript复制Meta(gds)$platform
代码语言:javascript复制## [1] "GPL97"
代码语言:javascript复制# 选择
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))
MA <- GDS2MA(gds,GPL=gpl)
class(MA)
代码语言:javascript复制## [1] "MAList"
## attr(,"package")
## [1] "limma"
5.4 GSE转换为 ExpressionSet
查看平台
代码语言:javascript复制gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
head(gsmplatforms)
代码语言:javascript复制## $GSM11805
## [1] "GPL96"
##
## $GSM11810
## [1] "GPL97"
##
## $GSM11814
## [1] "GPL96"
##
## $GSM11815
## [1] "GPL97"
##
## $GSM11823
## [1] "GPL96"
##
## $GSM11827
## [1] "GPL97"
通过fileter,筛选GPL96平台的数据
代码语言:javascript复制gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
length(gsmlist)
代码语言:javascript复制## [1] 17
查看测序数据
代码语言:javascript复制Table(gsmlist[[1]])[1:5,]
代码语言:javascript复制## ID_REF VALUE ABS_CALL
## 1 AFFX-BioB-5_at 953.9 P
## 2 AFFX-BioB-M_at 2982.8 P
## 3 AFFX-BioB-3_at 1657.9 P
## 4 AFFX-BioC-5_at 2652.7 P
## 5 AFFX-BioC-3_at 2019.5 P
查看描述
代码语言:javascript复制Columns(gsmlist[[1]])[1:5,]
代码语言:javascript复制## Column
## 1 ID_REF
## 2 VALUE
## 3 ABS_CALL
## NA <NA>
## NA.1 <NA>
## Description
## 1
## 2 MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call with Alpha1 = 0.05, Alpha2 = 0.065
## NA <NA>
## NA.1 <NA>
代码语言:javascript复制# 探针
probesets <- Table(GPLList(gse)[[1]])$ID
# 提取每个样本的value
# wo ye kan bu dong 大致是通过lapply函数对每一个
# gsm进行匹配,然后返回value,最后cbind
data.matrix <- do.call('cbind',lapply(gsmlist,function(x)
{tab <- Table(x)
mymatch <- match(probesets,tab$ID_REF)
return(tab$VALUE[mymatch])
}))
# 转换为数字
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,1:5]
代码语言:javascript复制## GSM11805 GSM11814 GSM11823 GSM11830 GSM12067
## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376
## [2,] 5.749534 7.908092 7.093814 7.514122 7.901470
## [3,] 7.066089 7.750205 7.244126 7.962896 7.337176
## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568
## [5,] 6.195741 6.061776 6.565293 6.583459 6.877744
对data进一步处理
代码语言:javascript复制require(Biobase)
# 行列命名
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(gsmlist)
# 转换为data.frame
pdata <- data.frame(samples=names(gsmlist))
rownames(pdata) <- names(gsmlist)
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2
代码语言:javascript复制## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 17 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
## varLabels: samples
## varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:
获取平台的信息
代码语言:javascript复制# gpl97 <- getGEO('GPL97')
# Meta(gpl97)$title
通过上述代码可以获得这个平台的所有信息,并通过id等查看,下载比较费时,不做展示
结束语
关于GEO数据挖掘的第一步,获得数据GEOquery的学习已经完成,没有学过关于测序的知识,这些信息获得之后还是懵逼的,2020-7-10更新 love&peace