TCGA数据挖掘(四):表达差异分析(4)

2019-09-18 11:03:33 浏览数 (1)

在之前我们的文章:TCGA数据挖掘(三):表达差异分析中,我们利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数进行差异表达分析,我们也提到可以选择基于limma或edgeR包进行分析,TCGA数据挖掘(三):表达差异分析这一讲中我们利用的是edgeR包,之后我们在文章:TCGA数据挖掘(四):表达差异分析(2)和TCGA数据挖掘(四):表达差异分析(3)中分别也介绍了其他方法的差异分析,包括edgeR和DESeq包,今天这一讲,我们就利用TCGAbiolinks包中的TCGAanalyze_DEA函数基于limma包进行差异分析。

数据下载

基因表达数据的下载

数据下载代码和之前的一样,这里再提供一次。避免出错不知道原因。

代码语言:javascript复制
setwd("E:\BioInfoCloud\TCGABiolinks\case_study1")
# 加载相应的包,可能会需要其他包,提示错误就安装缺少的包。
# 因为每个人已经安装的包不一样。
library(TCGAbiolinks)
library(plyr)
library(limma)
library(biomaRt)
library(SummarizedExperiment)
# 请求数据。
query <- GDCquery(project = "TCGA-BRCA",
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification", 
                  workflow.type = "HTSeq - Counts")
# 从query中获取结果表,它可以选择带有cols参数的列,并使用rows参数返回若干行。
# 1222个barcode                 
samplesDown <- getResults(query,cols=c("cases"))
# samplesDown中筛选出TP(主要实体瘤)样本的barcode
# 1102个barcode
dataSmTP <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "TP")
# 113个barcode
dataSmNT <- TCGAquery_SampleTypes(barcode = samplesDown,
                                  typesample = "NT")

# 选择100个正常组织和100个肿瘤组织样本作为研究对象
dataSmTP_short <- dataSmTP[1:100]
dataSmNT_short <- dataSmNT[1:100]
# 根据前面的筛选,再次请求数据
queryDown <- GDCquery(project = "TCGA-BRCA", 
                      data.category = "Transcriptome Profiling",
                      data.type = "Gene Expression Quantification", 
                      workflow.type = "HTSeq - Counts", 
                      barcode = c(dataSmTP_short, dataSmNT_short))
#读取下载的数据并将其准备到R对象中,在工作目录生成brca_case1.rda文件,
#同时还产生Human_genes__GRCh38_p12_.rda文件
GDCdownload(query = queryDown)
dataPrep1 <- GDCprepare(query = queryDown, save = TRUE, save.filename = "brca_case1.rda")

数据处理

代码语言:javascript复制
# 去除dataPrep1中的异常值,dataPrep数据中含有肿瘤组织和正常组织的数据
dataPrep <- TCGAanalyze_Preprocessing(object = dataPrep1, 
                                      cor.cut = 0.6,
                                      datatype = "HTSeq - Counts")
#使用EDASeq进行标准化
dataNorm <- TCGAanalyze_Normalization(tabDF = dataPrep,
                                      geneInfo = geneInfoHT,
                                      method = "gcContent")
#根据分位数一处count较低的基因。
dataFilt <- TCGAanalyze_Filtering(tabDF = dataNorm,
                                  method = "quantile", 
                                    qnt.cut =  0.25)
# 将计数数据转换为log2计数/百万(LogCPM),估计均值-方差关系并使用此关系计算适当的观测级别权重。
# 然后,数据即可用于线性建模。
v.dataFilt<-voom(dataFilt)
#in order to have 2 batches with multiple samples and avoid batches with one sample
#the user need to call first the get_IDs function on the top of this script if this has not been done already
c1<-which(get_IDs(dataPrep)$tss=="E9")
c2<-which(get_IDs(dataPrep)$tss=="E2")
#taking log transformed data for exploration of batch effects
TCGAbatch_Correction(tabDF = v.dataFilt$E[,c(c1,c2)], batch.factor="TSS", adjustment=NULL)
#if you get error in plot.new():figure margins too large - please use:
#par(mar=c(1,1,1,1))
#and then rerun the command above
代码语言:javascript复制
#### After exploration of the batches we can continue to work on the original gene matrix for DEA ###
###############Tumor purity filtering###########
###vector containing all TCGA barcodes that hhave 60% tumor purity or more
Purity.BRCA<-TCGAtumor_purity(colnames(dataPrep1), 0, 0, 0, 0, 0.6)$pure_barcodes
################DEA with Molecular subtypes############
####Molecular subtypes:
####Filtering data so all samples have a pam50 subtype for BRCA:

#diff contains TCGA samples that have an available molecular subtype
###Also Applying Tumor purity filtering by making a set difference
#documentation for TCGA_MolecularSubtype is available
##########Important########
####in this example no change happens because all the TCGA barcodes
####have available subtype info
###setdiff() is here used to remove samples with no subtype info from the TCGA BRCA samples that satisfy
###tumor purity criteria
diff<-setdiff(Purity.BRCA, TCGA_MolecularSubtype(colnames(dataPrep[,dataSmTP_short]))$filtered)
dataFilt.brca.cancer<-dataPrep[,diff]
dataFilt.brca.normal<-dataPrep[,dataSmNT_short]
dataFilt.brca<-cbind(dataFilt.brca.normal, dataFilt.brca.cancer)
mol_subtypes<-c(rep("Normal", 100), TCGA_MolecularSubtype(colnames(dataFilt.brca.cancer))$subtypes$subtype)
# All barcodes have available molecular subtype info
mol_subtypes<-make.names(mol_subtypes)
# to convert ENSEMBL id to gene name
rownames(dataFilt.brca)<-rowData(dataPrep1)$external_gene_name
# we redo here the normalization and filtering steps 
dataNorm.brca <- TCGAanalyze_Normalization(tabDF = dataFilt.brca,
                                           geneInfo = geneInfo,
                                           method = "gcContent")
# 将标准化后的数据再过滤,得到最终的数据
dataFilt.brca.final <- TCGAanalyze_Filtering(tabDF = dataNorm.brca,
                                       method = "quantile", 
                                       qnt.cut =  0.25)

表达差异分析

这里利用的是TCGAbiolinks包中的TCGAanalyze_DEA函数,是基于limma包的差异分析。

代码语言:javascript复制
DEG.brca.TSS <- TCGAanalyze_DEA(MAT=dataFilt.brca.final,
                            pipeline="limma",
                            batch.factors = c("TSS"),
                            Cond1type = "Normal",
                            Cond2type = "Tumor",
                            fdr.cut = 0.01 ,
                            logFC.cut = 1,
                            voom = TRUE,
                            Condtypes = mol_subtypes,
                            contrast.formula = "Mycontrast = (BRCA.LumA BRCA.LumB)/2 -Normal")
#in this DEA we use Normal as a reference, thus genes with LogFC > 1 are up regulated in the subtypes
#with respect to the normal samples and viceversa for the LogFC < -1.

write.csv(DEG.brca.TSS, "DEGsBRCA_limma_091018.csv", quote = FALSE)
#3241 genes identified

#to check how many with logFC > 5 (for plotting purposes)
DEG.brca.filt.TSS<-DEG.brca.TSS$Mycontrast[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6), ]
#47 genes identified

=-6)],-----------------------highlight.color-=-"orange")">可视化 利用火山图进行可视化。 TCGAVisualize_volcano(DEG.brca.TSS$Mycontrast$logFC, DEG.brca.TSS$Mycontrast$adj.P.Val, filename = "LuminalABvsNormal_FC6.TSS.pdf", xlab = "logFC", names = rownames(DEG.brca.TSS$Mycontrast), show.names = "highlighted", x.cut = 1, y.cut = 0.01, highlight = rownames(DEG.brca.TSS$Mycontrast)[which(abs(DEG.brca.TSS$Mycontrast$logFC) >= 6)], highlight.color = "orange")

在之前利用edgeR包获得的结果图如下,我们简单比较一下2中方法的差异。

很明显看到2中方法得到的结果是有差异的,这有时候我们会纠结那种方法好,这个就看你自己的研究啦,什么结果符合自己的用什么,当然这有点投机取巧的感觉,最好是2中方法得到后取交集。 不过下面我们对这2种方法得到的结果进行一个相关性评估。 limma和edgeR结果的相关性比较library(ggplot2) DEGs_limma <- read.csv("DEGsBRCA_limma_091018.csv", row.names = 1) DEGs_edgeR <- read.csv("DEGsBRCA_edgeR_091018.csv", row.names = 1) genes_toTest <- rownames(DEGs_limma)[1:1000] genes_common <- intersect(genes_toTest, rownames(DEGs_edgeR)) dataset <- subset(DEGs_limma, rownames(DEGs_limma) %in% genes_common, select = "Mycontrast.logFC") dataset <- cbind(DEGs_edgeR$Mycontrast.logFC[match(rownames(dataset),rownames(DEGs_edgeR))], dataset) colnames(dataset) <- c("logFC_edgeR", "logFC_limma") pdf("scatterplot_logFC_limma_edgeR_top1000.pdf", width=5, height=3) ggplot(dataset, aes(x=logFC_limma,y=logFC_edgeR)) geom_point() xlab("logFC_limma") ylab("logFC_edgeR") theme(legend.title=element_blank(),axis.title=element_text(size=10),axis.text=element_text(size=10), legend.text=element_text(size=20)) dev.off() cor(dataset$logFC_edgeR, dataset$logFC_limma) #0.9936 这2种方法得到的结果虽然有差异,但很小得到的大多数差异基因是一样的。

0 人点赞