GEO 数据挖掘-数据获得

1. 概述

NCBI Gene Expression Omnibus（GEO）是各种高通量实验数据的公共存储库，这些数据包括测量mRNA、基因组DNA和蛋白质丰度的单通道和双通道微阵列实验，以及非阵列技术，如基因表达序列分析(SAGE)、质谱蛋白质组数据和高通量测序数据。相比较TCGA数据库，因为数据是用户上传，所以更新较快

需要知道四个单词 1. GEO Platform (GPL) 芯片平台

GEO Sample (GSM) 样本ID号
GEO Series (GSE) study的ID号
GEO Dataset (GDS) 数据集的ID号

2. 配置包

代码语言：javascript复制

# 设置镜像，用于下载
options(BioC_mirror="https://mirrors.tuna.tsinghua.edu.cn/bioconductor")
options("repos" = c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))

# 安装
# install.packages("BiocManager")
# BiocManager::install("GEOquery")

3. 下载数据

代码语言：javascript复制

# 加载
library(GEOquery)
#使用getGEO函数获得基因信息
gds <- getGEO("GDS507")# 下载
# 同时支持从本地获得
# gds <- getGEO(filename=system.file("路径",package="GEOquery"))
# 下载gsm数据
gsm <- getGEO("GSM11805")

4. GEOquery数据结构

GEOquery数据结构实际上有两种形式。第一种，包括GDS、GPL和GSM，第二种是GSE是，由GSM和GPL对象组合而成的复合数据类型。

4.1 GDS, GSM, 和 GPL

代码语言：javascript复制

# 通过meta查看gsm文件，显示样本的信息
head(Meta(gsm))

代码语言：javascript复制

## $channel_count
## [1] "1"
##
## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"

代码语言：javascript复制

# 查看前5行数值
Table(gsm)[1:5,]

代码语言：javascript复制

##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P

代码语言：javascript复制

# 查看列名
Columns(gsm)

代码语言：javascript复制

##     Column
## 1   ID_REF
## 2    VALUE
## 3 ABS_CALL
##                                                                  Description
## 1
## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065

对于gpl的格式和gsm是一样的，另外GDS的信息要多于上述两种

代码语言：javascript复制

Columns(gds)[1:5,1:3]

代码语言：javascript复制

##     sample disease.state individual
## 1 GSM11815           RCC        035
## 2 GSM11832           RCC        023
## 3 GSM12069           RCC        001
## 4 GSM12083           RCC        005
## 5 GSM12101           RCC        011

4.2 GSE格式

GSE格式较为混乱，它可以表示任意平台的样本，gse有一个元数据部分，它没有DataTable。而是包含两个列表，可以使用GPLList和GSMList方法访问。

代码语言：javascript复制

gse <- getGEO("GSE781",GSEMatrix=FALSE)
# 查看meta
head(Meta(gse))

代码语言：javascript复制

## $contact_address
## [1] "715 Albany Street, E613B"
##
## $contact_city
## [1] "Boston"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Genetics and Genomics"
##
## $contact_email
## [1] "mlenburg@bu.edu"
##
## $contact_fax
## [1] "617-414-1646"

查看样本id

代码语言：javascript复制

names(GSMList(gse))

代码语言：javascript复制

##  [1] "GSM11805" "GSM11810" "GSM11814" "GSM11815" "GSM11823" "GSM11827"
##  [7] "GSM11830" "GSM11832" "GSM12067" "GSM12069" "GSM12075" "GSM12078"
## [13] "GSM12079" "GSM12083" "GSM12098" "GSM12099" "GSM12100" "GSM12101"
## [19] "GSM12105" "GSM12106" "GSM12268" "GSM12269" "GSM12270" "GSM12274"
## [25] "GSM12283" "GSM12287" "GSM12298" "GSM12299" "GSM12300" "GSM12301"
## [31] "GSM12399" "GSM12412" "GSM12444" "GSM12448"

获取第一个gsm样本

代码语言：javascript复制

GSMList(gse)[[1]]

代码语言：javascript复制

## An object of class "GSM"
## channel_count
## [1] "1"
## contact_address
## [1] "715 Albany Street, E613B"
## contact_city
## [1] "Boston"
## contact_country
## [1] "USA"
## contact_department
## [1] "Genetics and Genomics"
## contact_email
## [1] "mlenburg@bu.edu"
## contact_fax
## [1] "617-414-1646"
## contact_institute
## [1] "Boston University School of Medicine"
## contact_name
## [1] "Marc,E.,Lenburg"
## contact_phone
## [1] "617-414-1375"
## contact_state
## [1] "MA"
## contact_web_link
## [1] "http://gg.bu.edu"
## contact_zip/postal_code
## [1] "02130"
## data_row_count
## [1] "22283"
## description
## [1] "Age = 70; Gender = Female; Right Kidney; Adjacent Tumor Type = clear cell; Adjacent Tumor Fuhrman Grade = 3; Adjacent Tumor Capsule Penetration = true; Adjacent Tumor Perinephric Fat Invasion = true; Adjacent Tumor Renal Sinus Invasion = false; Adjacent Tumor Renal Vein Invasion = true; Scaling Target = 500; Scaling Factor = 7.09; Raw Q = 2.39; Noise = 2.60; Background = 55.24."
## [2] "Keywords = kidney"
## [3] "Keywords = renal"
## [4] "Keywords = RCC"
## [5] "Keywords = carcinoma"
## [6] "Keywords = cancer"
## [7] "Lot batch = 2004638"
## geo_accession
## [1] "GSM11805"
## last_update_date
## [1] "May 28 2005"
## molecule_ch1
## [1] "total RNA"
## organism_ch1
## [1] "Homo sapiens"
## platform_id
## [1] "GPL96"
## series_id
## [1] "GSE781"
## source_name_ch1
## [1] "Trizol isolation of total RNA from normal tissue adjacent to Renal Cell Carcinoma"
## status
## [1] "Public on Nov 25 2003"
## submission_date
## [1] "Oct 20 2003"
## supplementary_file
## [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM11nnn/GSM11805/suppl/GSM11805.CEL.gz"
## taxid_ch1
## [1] "9606"
## title
## [1] "N035 Normal Human Kidney U133A"
## type
## [1] "RNA"
## An object of class "GEODataTable"
## ****** Column Descriptions ******
##     Column
## 1   ID_REF
## 2    VALUE
## 3 ABS_CALL
##                                                                  Description
## 1
## 2                         MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3 MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065
## ****** Data Table ******
##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P
## 22278 more rows ...

查看平台信息

代码语言：javascript复制

names(GPLList(gse))

代码语言：javascript复制

## [1] "GPL96" "GPL97"

5. 转换为bioconductor ExpressionSets 和 limma MALists

5.1 获得GSE Series Matrix数据

代码语言：javascript复制

# 获得数据集，格式为gse
gse2553 <- getGEO('GSE2553',GSEMatrix=TRUE)
# 显示具体信息
show(gse2553)

代码语言：javascript复制

## $GSE2553_series_matrix.txt.gz
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12600 features, 181 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: GSM48681 GSM48682 ... GSM48861 (181 total)
##   varLabels: title geo_accession ... data_row_count (30 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1 2 ... 12600 (12600 total)
##   fvarLabels: ID PenAt ... Chimeric_Cluster_IDs (13 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 16230383
## Annotation: GPL1977

5.2 将GDS转换为ExpressionSet

代码语言：javascript复制

eset <- GDS2eSet(gds,do.log2=TRUE)
eset

代码语言：javascript复制

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22645 features, 17 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: GSM11815 GSM11832 ... GSM12448 (17 total)
##   varLabels: sample disease.state individual description
##   varMetadata: labelDescription
## featureData
##   featureNames: 200000_s_at 200001_at ... AFFX-TrpnX-M_at (22645 total)
##   fvarLabels: ID Gene title ... GO:Component ID (21 total)
##   fvarMetadata: Column labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 14641932
## Annotation:

此时eset就是一个表达矩阵

5.3 转换 GDS 为 MAList

代码语言：javascript复制

Meta(gds)$platform

代码语言：javascript复制

## [1] "GPL97"

代码语言：javascript复制

# 选择
gpl <- getGEO(filename=system.file("extdata/GPL97.annot.gz",package="GEOquery"))
MA <- GDS2MA(gds,GPL=gpl)
class(MA)

代码语言：javascript复制

## [1] "MAList"
## attr(,"package")
## [1] "limma"

5.4 GSE转换为 ExpressionSet

查看平台

代码语言：javascript复制

gsmplatforms <- lapply(GSMList(gse),function(x) {Meta(x)$platform_id})
head(gsmplatforms)

代码语言：javascript复制

## $GSM11805
## [1] "GPL96"
##
## $GSM11810
## [1] "GPL97"
##
## $GSM11814
## [1] "GPL96"
##
## $GSM11815
## [1] "GPL97"
##
## $GSM11823
## [1] "GPL96"
##
## $GSM11827
## [1] "GPL97"

通过fileter，筛选GPL96平台的数据

代码语言：javascript复制

gsmlist = Filter(function(gsm) {Meta(gsm)$platform_id=='GPL96'},GSMList(gse))
length(gsmlist)

代码语言：javascript复制

## [1] 17

查看测序数据

代码语言：javascript复制

Table(gsmlist[[1]])[1:5,]

代码语言：javascript复制

##           ID_REF  VALUE ABS_CALL
## 1 AFFX-BioB-5_at  953.9        P
## 2 AFFX-BioB-M_at 2982.8        P
## 3 AFFX-BioB-3_at 1657.9        P
## 4 AFFX-BioC-5_at 2652.7        P
## 5 AFFX-BioC-3_at 2019.5        P

查看描述

代码语言：javascript复制

Columns(gsmlist[[1]])[1:5,]

代码语言：javascript复制

##        Column
## 1      ID_REF
## 2       VALUE
## 3    ABS_CALL
## NA       <NA>
## NA.1     <NA>
##                                                                     Description
## 1
## 2                            MAS 5.0 Statistical Algorithm (mean scaled to 500)
## 3    MAS 5.0 Absent, Marginal, Present call  with Alpha1 = 0.05, Alpha2 = 0.065
## NA                                                                         <NA>
## NA.1                                                                       <NA>

代码语言：javascript复制

# 探针
probesets <- Table(GPLList(gse)[[1]])$ID
# 提取每个样本的value
# wo ye kan bu dong 大致是通过lapply函数对每一个
# gsm进行匹配，然后返回value，最后cbind
data.matrix <- do.call('cbind',lapply(gsmlist,function(x)
                                      {tab <- Table(x)
                                       mymatch <- match(probesets,tab$ID_REF)
                                       return(tab$VALUE[mymatch])
                                     }))
# 转换为数字
data.matrix <- apply(data.matrix,2,function(x) {as.numeric(as.character(x))})
data.matrix <- log2(data.matrix)
data.matrix[1:5,1:5]

代码语言：javascript复制

##       GSM11805  GSM11814  GSM11823  GSM11830  GSM12067
## [1,] 10.926963 11.105254 11.275019 11.438636 11.424376
## [2,]  5.749534  7.908092  7.093814  7.514122  7.901470
## [3,]  7.066089  7.750205  7.244126  7.962896  7.337176
## [4,] 12.660353 12.479755 12.215897 11.458355 11.397568
## [5,]  6.195741  6.061776  6.565293  6.583459  6.877744

对data进一步处理

代码语言：javascript复制

require(Biobase)
# 行列命名
rownames(data.matrix) <- probesets
colnames(data.matrix) <- names(gsmlist)
# 转换为data.frame
pdata <- data.frame(samples=names(gsmlist))
rownames(pdata) <- names(gsmlist)
pheno <- as(pdata,"AnnotatedDataFrame")
eset2 <- new('ExpressionSet',exprs=data.matrix,phenoData=pheno)
eset2

代码语言：javascript复制

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 22283 features, 17 samples
##   element names: exprs
## protocolData: none
## phenoData
##   sampleNames: GSM11805 GSM11814 ... GSM12444 (17 total)
##   varLabels: samples
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
## Annotation:

获取平台的信息

代码语言：javascript复制

# gpl97 <- getGEO('GPL97')
# Meta(gpl97)$title

通过上述代码可以获得这个平台的所有信息，并通过id等查看，下载比较费时，不做展示

结束语

关于GEO数据挖掘的第一步，获得数据GEOquery的学习已经完成，没有学过关于测序的知识，这些信息获得之后还是懵逼的，2020-7-10更新 love&peace

express 数据挖掘数据结构

0 人点赞