单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析1:https://cloud.tencent.com/developer/article/2055573
单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析2:https://cloud.tencent.com/developer/article/2072069
单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析3:https://cloud.tencent.com/developer/article/2078159
单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析4:https://cloud.tencent.com/developer/article/2078348
单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析5:https://cloud.tencent.com/developer/article/2084580
单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析6:https://cloud.tencent.com/developer/article/2085385
单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析7:https://cloud.tencent.com/developer/article/2085705
单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析8:https://cloud.tencent.com/developer/article/2086805
单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析9:https://cloud.tencent.com/developer/article/2087563
代码解析
Part I: Peak-to-gene correlation analysis with empirically-derived false discovery rate (eFDR)
这部分的代码来源:https://github.com/RegnerM2015/scENDO_scOVAR_2020/wiki/Figure-2
为了将染色质可及性的变化与基因表达的差异联系起来,我们进行了大规模的峰基因连锁分析,可在 ArchR 程序套件中使用,并引入了一个复杂的经验错误发现率 (eFDR) 程序确定单细胞数据中具有统计学意义的峰基因关联(Granja 等人,2021;Storey 和 Tibshirani,2003)。虽然这种方法提供了与标准 Benjamini-Hochberg 程序相似的结果,但我们能够更清楚地证明峰值可及性是确定推断基因表达的重要因素,如下面的相关性和 p 值直方图所示:
此工作流程的起始输入是保存为 rds 对象的多样本 ArchR 项目(包括来自所有 11 名患者的细胞),其中包含 1)500 bp 基因组平铺矩阵,2)估计的基因活动矩阵,3)推断的基因表达矩阵,和 4) 峰值矩阵。我们使用图 1 页面上显示的 /scATAC-seq Processing Scripts/Full_Cohort/Patients1-11_scATAC-seq.R 生成了这个多样本 ArchR 项目。该脚本的输出是一个新的 ArchR 项目,其中包含所有峰到基因的关联,包括那些可能在统计上不显着的关联以及显示远端峰可访问性与推断的基因表达密切相关的热图。我们还编写了一个峰到基因元数据表,其中列出了峰坐标、峰类型(远端、启动子、内含子、外显子)、基因名称、相关值、p 值、方差测量等。
这部分的代码基本就是不停的调换,主体还是一个部分,我把后面类似的都删了,只对最前面的解析。
代码语言:javascript复制########R包的加载
library(ArchR)
library(ChIPpeakAnno)
library(stats)
ArchR::addArchRThreads(threads = 64)
source("./Archr_Peak_Null_Permute.R")
source("./Archr_Peak_RawPval.R")
dir.create("./Significant_P2G_Outputs")
SAMPLE.ID <- "All"
key.word.1 <- "pithelia"
key.word.2 <- "-Ciliated"
# Add p2g links (no restrictions on FDR, Correlation, Variance cutoff) with raw pvalue
##############################################################################################################
proj <- readRDS("./final_archr_proj_archrGS.rds")
##将 Peak2GeneLinks 添加到 ArchRProject
##函数具体解析地址:[https://www.archrproject.com/reference/addPeak2GeneLinks.html](https://www.archrproject.com/reference/addPeak2GeneLinks.html)
proj <- addPeak2GeneLinks(
ArchRProj = proj ,
reducedDims = "IterativeLSI",
useMatrix = "GeneIntegrationMatrix_ArchR",
dimsToUse = 1:50,
scaleDims = NULL,
corCutOff = 0.75,
k = 100,
knnIteration = 500,
overlapCutoff = 0.8,
maxDist = 250000,
scaleTo = 10^4,
log2Norm = TRUE,
predictionCutoff = 0.5,
seed = 1,
threads = max(floor(getArchRThreads()/2), 1),
verbose = TRUE,
logFile = createLogFile("addPeak2GeneLinks")
)
## 提取metadata信息
p2geneDF <- metadata(proj@peakSet)$Peak2GeneLinks
#组合IRanges对象并维护mcols,r
p2geneDF$geneName <- mcols(metadata(p2geneDF)$geneSet)$name[p2geneDF$idxRNA]
p2geneDF$peakName <- (metadata(p2geneDF)$peakSet %>% {paste0(seqnames(.), ":", start(.), "-", end(.))})[p2geneDF$idxATAC]
annot <- readRDS(metadata(p2geneDF)$seATAC)
p2geneDF$peakType <- annot@rowRanges$peakType[p2geneDF$idxATAC]
p2g.df.obs <- as.data.frame(p2geneDF)
p2g.df.obs <- p2g.df.obs[complete.cases(p2g.df.obs),]
##summary () :获取描述性统计量
first.quart <- summary(p2g.df.obs$RawPVal)[2]
# Permuated p2g links
#################################################################################################################
proj <- readRDS("./final_archr_proj_archrGS.rds")
store <- as.numeric(0)
for (i in 1:100){
proj.null <- addPermPeak2GeneLinks(
ArchRProj = proj ,
reducedDims = "IterativeLSI",
useMatrix = "GeneIntegrationMatrix_ArchR",
dimsToUse = 1:50,
scaleDims = NULL,
corCutOff = 0.75,
k = 100,
knnIteration = 500,
overlapCutoff = 0.8,
maxDist = 250000,
scaleTo = 10^4,
log2Norm = TRUE,
predictionCutoff = 0.5,
seed = 1,
threads = max(floor(getArchRThreads()/2), 1),
verbose = TRUE,
logFile = createLogFile("addPeak2GeneLinks")
)
p2geneDF <- metadata(proj.null@peakSet)$Peak2GeneLinks
p2geneDF$geneName <- mcols(metadata(p2geneDF)$geneSet)$name[p2geneDF$idxRNA]
p2geneDF$peakName <- (metadata(p2geneDF)$peakSet %>% {paste0(seqnames(.), "_", start(.), "_", end(.))})[p2geneDF$idxATAC]
p2g.df.null <- as.data.frame(p2geneDF)
p2g.df.null <- p2g.df.null[complete.cases(p2g.df.null),]
p2g.null.sub <- dplyr::filter(p2g.df.null,RawPVal <=1e-12)
store[i] <- nrow(p2g.null.sub)
}
#Example histograms
proj.null <- addPermPeak2GeneLinks(
ArchRProj = proj ,
reducedDims = "IterativeLSI",
useMatrix = "GeneIntegrationMatrix_ArchR",
dimsToUse = 1:50,
scaleDims = NULL,
corCutOff = 0.75,
k = 100,
knnIteration = 500,
overlapCutoff = 0.8,
maxDist = 250000,
scaleTo = 10^4,
log2Norm = TRUE,
predictionCutoff = 0.5,
seed = 1,
threads = max(floor(getArchRThreads()/2), 1),
verbose = TRUE,
logFile = createLogFile("addPeak2GeneLinks")
)
store.prop <- numeric(0)
test <- readRDS(paste0("./",SAMPLE.ID,"/Peak2GeneLinks/seATAC-Group-KNN.rds"))
test <- test@metadata$KNNList@listData
for ( i in 1:length(test)){
test[[i]] <- gsub("\#.*","",test[[i]])
num <- max(table(test[[i]]))
store.prop[i] <- num/100
}
saveRDS(store.prop,"./Significant_P2G_Outputs/store_null_KNN_proportions.rds")
pdf("./Significant_P2G_Outputs/PatientPurityPerAgg-null.pdf",width = 5,height = 3.5)
hist(store.prop,main="Distribtion of patient purity per cell aggregate")
dev.off()
p2geneDF <- metadata(proj.null@peakSet)$Peak2GeneLinks
p2geneDF$geneName <- mcols(metadata(p2geneDF)$geneSet)$name[p2geneDF$idxRNA]
p2geneDF$peakName <- (metadata(p2geneDF)$peakSet %>% {paste0(seqnames(.), "_", start(.), "_", end(.))})[p2geneDF$idxATAC]
p2g.df.null <- as.data.frame(p2geneDF)
p2g.df.null <- p2g.df.null[complete.cases(p2g.df.null),]
saveRDS(p2g.df.null,"./Significant_P2G_Outputs/All_P2G_Null_example.rds")
pdf("./Significant_P2G_Outputs/P2G_Correlation-null.pdf",width = 5,height = 3.5)
hist(p2g.df.null$Correlation,col = "lightblue",main = "Histogram of null peak-to-gene correlations",xlab = "Correlation")
dev.off()
pdf("./Significant_P2G_Outputs/P2g_Pval-null.pdf",width = 5,height = 3.5)
hist(p2g.df.null$RawPVal,col="lightblue",main = "Histogram null peak-to-gene p-values",xlab = "p-value")
abline(v=0.01,col = "red")
dev.off()
# Compute eFDR for alpha 1e-12
print(median(store)/nrow(p2g.df.obs[p2g.df.obs$RawPVal <=1e-12,]))
saveRDS(store,"./Significant_P2G_Outputs/store_null_tests.rds")
####去计算相关性
p2g.df.hist <- dplyr::filter(p2g.df.obs,RawPVal <=1e-12 & Correlation >= 0.45)
pdf("./Significant_P2G_Outputs/GenesPerPeak_histogram.pdf",width = 5,height = 3.5)
hist(table(p2g.df.hist$idxRNA),main="Distribution of genes per peaks")
dev.off()
pdf("./Significant_P2G_Outputs/PeaksPerGene_histogram.pdf",width = 5,height = 3.5)
hist(table(p2g.df.hist$idxATAC),main="Distribution of peaks per gene")
dev.off()
#Subset to postive correlation P2Gs
p2g.df.sub <- dplyr::filter(p2g.df.obs,RawPVal <=1e-12 & Correlation >= 0.45& peakType == "Distal")
# Plot peak2 gene heatmap
p2g.df.sub$idx <- paste0(p2g.df.sub$idxATAC,"-",p2g.df.sub$idxRNA)
p2g.df.sub.plot <- p2g.df.sub
# Make color scheme for heatmap based on original UMAP colors:
proj$cluster.num <- factor(gsub("-.*","",proj$predictedGroup_ArchR))
my_levels <- c("11","20","21","22","31",
"19","34",
"3",
"9","10",
"16","17",
"0","27",
"6","8","12","14","15","18","24","25","26","29",
"7","23",
"1","33",
"2","4","30",
"5","13",
"32",
"28","35")
# Relevel object@ident
proj@cellColData$cluster.new <- factor(x=proj$cluster.num, levels = my_levels)
# Make order of colors:
epithelial.cols <- colorRampPalette(c("#A0E989", "#245719"))
epithelial.cols <- epithelial.cols(14)
##colorRampPalette:生成颜色的渐变梯度
fibro.cols <-colorRampPalette(c("#FABFD2", "#B1339E"))
fibro.cols <- fibro.cols(10)
smooth.cols <- c("#b47fe5","#d8b7e8")
endo.cols <- c("#93CEFF","#4A99FF")
t.cols <- c("gray60","gray20","gray40")
macro.cols <- c("#ff6600","#ff9d5c")
mast.cols <- "gold3"
b.cols <- c("#B22222","#CD5C5C")
cols <- c(epithelial.cols,fibro.cols,smooth.cols,endo.cols,t.cols,macro.cols,mast.cols,b.cols)
names(cols) <- levels(factor(proj@cellColData$cluster.new))
source("P2G_Heatmap_Distal.R")
test <- plotPeak2GeneHeatmap.distal(proj,peaks=p2g.df.sub.plot,groupBy = "cluster.new",k=length(levels(factor(proj$cluster.new))),
corCutOff = .45,varCutOffATAC = 0,varCutOffRNA = 0,FDRCutOff = 1,nPlot =100000,palGroup=cols,
palATAC = paletteContinuous("solarExtra"),
palRNA = paletteContinuous("solarExtra"))
pdf("./Significant_P2G_Outputs/Peak2Gene_Heatmap_Legend.pdf",width = 6,height = 12)
draw(test, heatmap_legend_side = "bottom")
dev.off()
p2g.heat.df <- plotPeak2GeneHeatmap.distal(proj,peaks=p2g.df.sub.plot,groupBy = "predictedGroup_ArchR",k=length(levels(factor(proj$predictedGroup_ArchR))),
corCutOff = .45,varCutOffATAC = 0,varCutOffRNA = 0,FDRCutOff = 1,returnMatrices = T,nPlot = 100000)
# Save P2G peaknames and Kmeans cluster for the genes of interest (specific to cancer cells)
p2g.df.sub.plot$kmeans <- p2g.heat.df$RNA$kmeansId
saveRDS(p2g.df.sub.plot,"./Significant_P2G_Outputs/Distal_P2Gs_Kmeans.rds")
pdf("./Significant_P2G_Outputs/GenesPerPeak_histogram-distal.pdf",width = 5,height = 3.5)
hist(table(p2g.df.sub.plot$idxRNA),main="Distribution of genes per distal peaks")
dev.off()
pdf("./Significant_P2G_Outputs/PeaksPerGene_histogram-distal.pdf",width = 5,height = 3.5)
hist(table(p2g.df.sub.plot$idxATAC),main="Distribution of distal peaks per gene")
dev.off()
store.kmeans <- c("5","10","11","12","13","14","15","16","17","18","19","20","21","22","23","24","25","26","27","28")
# Extract P2Gs for kmeans clusters of interest
p2g.df.sub.plot.cancer.kmeans <- p2g.df.sub.plot[p2g.df.sub.plot$kmeans %in% store.kmeans,]
saveRDS(p2g.df.sub.plot.cancer.kmeans,"./Significant_P2G_Outputs/Cancer_enriched_P2G_table.rds")
p2g <- GRanges(p2g.df.sub.plot.cancer.kmeans$peakName)
encode.all <- read.delim("./GRCh38-ccREs.bed",header =F)
colnames(encode.all)[1:3] <- c("seqnames","start","end")
widths <- encode.all$end - encode.all$start
encode.all <- GRanges(encode.all)
ft.peaks <- readRDS("Fallopian_Tube_Cell_line_Peaks.rds")
widths.2 <- ft.peaks$end - ft.peaks$start
ft.peaks <- GRanges(ft.peaks)
oe.peaks <- readRDS("Ovarian_Epithelial_Cell_line_Peaks.rds")
widths.3 <- oe.peaks$end - oe.peaks$start
oe.peaks <- GRanges(oe.peaks)
widths <- c(widths,widths.2,widths.3)
tot <- 3.3e 9*0.98/mean(widths)
##
ol <- findOverlapsOfPeaks(unique(p2g), encode.all,unique(oe.peaks),unique(ft.peaks), minoverlap = 1,connectedPeaks = "min")
saveRDS(ol,"./Significant_P2G_Outputs/Ol.rds")
overlappingPeaks <- ol$overlappingPeaks
peaklist <- ol$peaklist
saveRDS(overlappingPeaks,"./Significant_P2G_Outputs/OverlappingPeaks.rds")
names <- names(overlappingPeaks)[4:6]
print(names)
total <- c(overlappingPeaks[[names[1]]]$overlapFeature,
overlappingPeaks[[names[2]]]$overlapFeature,
overlappingPeaks[[names[3]]]$overlapFeature)
pdf("./Significant_P2G_Outputs/ChipPeakAnno_Pie_Overlaps.pdf")
pie1(table(total))
dev.off()
ft.overlap <- overlappingPeaks[[names[1]]]
levels(factor(ft.overlap$overlapFeature))
colnames(ft.overlap)[8:12] <- c("seqnames2","start2","end2","width2","strand2")
ft.overlap$overlap <- ifelse(ft.overlap$overlapFeature == "includeFeature",ft.overlap$width2,"fill")
ft.overlap$overlap <- ifelse(ft.overlap$overlapFeature == "inside",ft.overlap$width,ft.overlap$overlap)
ft.overlap$overlap <- ifelse(ft.overlap$overlapFeature == "overlapEnd",ft.overlap$end2 - ft.overlap$start,ft.overlap$overlap)
ft.overlap$overlap <- ifelse(ft.overlap$overlapFeature == "overlapStart",ft.overlap$end - ft.overlap$start2,ft.overlap$overlap)
levels(factor(ft.overlap$overlap))
ft.overlap$overlap <- as.numeric(ft.overlap$overlap)
hist(ft.overlap$overlap)
oe.overlap <- overlappingPeaks[[names[2]]]
levels(factor(oe.overlap$overlapFeature))
colnames(oe.overlap)[8:12] <- c("seqnames2","start2","end2","width2","strand2")
oe.overlap$overlap <- ifelse(oe.overlap$overlapFeature == "includeFeature",oe.overlap$width2,"fill")
oe.overlap$overlap <- ifelse(oe.overlap$overlapFeature == "inside",oe.overlap$width,oe.overlap$overlap)
oe.overlap$overlap <- ifelse(oe.overlap$overlapFeature == "overlapEnd",oe.overlap$end2 - oe.overlap$start,oe.overlap$overlap)
oe.overlap$overlap <- ifelse(oe.overlap$overlapFeature == "overlapStart",oe.overlap$end - oe.overlap$start2,oe.overlap$overlap)
levels(factor(oe.overlap$overlap))
oe.overlap$overlap <- as.numeric(oe.overlap$overlap)
hist(oe.overlap$overlap)
encode.overlap <- overlappingPeaks[[names[3]]]
levels(factor(encode.overlap$overlapFeature))
colnames(encode.overlap)[8:12] <- c("seqnames2","start2","end2","width2","strand2")
encode.overlap$overlap <- ifelse(encode.overlap$overlapFeature == "includeFeature",encode.overlap$width2,"fill")
encode.overlap$overlap <- ifelse(encode.overlap$overlapFeature == "inside",encode.overlap$width,encode.overlap$overlap)
encode.overlap$overlap <- ifelse(encode.overlap$overlapFeature == "overlapEnd",encode.overlap$end2 - encode.overlap$start,encode.overlap$overlap)
encode.overlap$overlap <- ifelse(encode.overlap$overlapFeature == "overlapStart",encode.overlap$end - encode.overlap$start2,encode.overlap$overlap)
levels(factor(encode.overlap$overlap))
encode.overlap$overlap <- as.numeric(encode.overlap$overlap)
hist(encode.overlap$overlap)
head(ft.overlap[1:3,])
head(oe.overlap[1:3,])
head(encode.overlap[1:3,])
ft.overlap <- ft.overlap[,c(1:12,ncol(ft.overlap),grep("overlapFeature",colnames(ft.overlap)))]
oe.overlap <- oe.overlap[,c(1:12,ncol(oe.overlap),grep("overlapFeature",colnames(oe.overlap)))]
encode.overlap <- encode.overlap[,c(1:12,ncol(encode.overlap),grep("overlapFeature",colnames(encode.overlap)))]
all.overlap <- rbind(ft.overlap,oe.overlap,encode.overlap)
pdf("./Significant_P2G_Outputs/Distribtion_of_All_Overlaps.pdf")
hist(all.overlap$overlap)
dev.off()
types <- c("overlapEnd","overlapStart")
all.overlap <- all.overlap[all.overlap$overlapFeature %in% types,]
pdf("./Significant_P2G_Outputs/Distribution_of_Partial_Overlaps.pdf")
hist(all.overlap$overlap)
dev.off()
pdf("./Significant_P2G_Outputs/ChipPeakAnno_Venn_Overlaps.pdf")
makeVennDiagram(ol,totalTest = tot,connectedPeaks = "min")
dev.off()
venn <- makeVennDiagram(ol,totalTest = tot,connectedPeaks = "min")
saveRDS(venn,"./Significant_P2G_Outputs/venn.rds")
peak.names <- paste0(peaklist$unique.p2g.@seqnames,":",peaklist$unique.p2g.@ranges)
# Find cancer specific peak to gene links
p2g.df.sub.plot.cancer.kmeans <- p2g.df.sub.plot.cancer.kmeans[p2g.df.sub.plot.cancer.kmeans$peakName %in% peak.names,]
saveRDS(p2g.df.sub.plot.cancer.kmeans,"./Significant_P2G_Outputs/Cancer_specific_P2G_table.rds")
# Plot P2G heatmap for cancer specific distal elements
##############################################################################################################
source("P2G_Heatmap_Distal.R")
test <- plotPeak2GeneHeatmap.distal(proj,peaks=p2g.df.sub.plot.cancer.kmeans,groupBy = "cluster.new",k=length(levels(factor(proj$cluster.new))),
corCutOff = .45,varCutOffATAC = 0,varCutOffRNA = 0,FDRCutOff = 1,nPlot =100000,palGroup=cols,
palATAC = paletteContinuous("solarExtra"),
palRNA = paletteContinuous("solarExtra"))
pdf("./Significant_P2G_Outputs/Peak2Gene_Heatmap_Legend-cancerspecific.pdf",width = 6,height = 12)
draw(test, heatmap_legend_side = "bottom")
dev.off()
p2g.heat.df <- plotPeak2GeneHeatmap.distal(proj,peaks=p2g.df.sub.plot.cancer.kmeans,groupBy = "predictedGroup_ArchR",k=length(levels(factor(proj$predictedGroup_ArchR))),
corCutOff = .45,varCutOffATAC = 0,varCutOffRNA = 0,FDRCutOff = 1,returnMatrices = T,nPlot = 100000)
p2g.df.sub.plot.cancer.kmeans$kmeans <- p2g.heat.df$RNA$kmeansId
saveRDS(p2g.df.sub.plot.cancer.kmeans,"./Significant_P2G_Outputs/Distal_P2Gs_Kmeans-cancerspecific.rds")
pdf("./Significant_P2G_Outputs/GenesPerPeak_histogram-distal-cancerspecific.pdf",width = 5,height = 3.5)
hist(table(p2g.df.sub.plot.cancer.kmeans$idxRNA),main="Distribution of genes per distal peaks")
dev.off()
pdf("./Significant_P2G_Outputs/PeaksPerGene_histogram-distal-cancerspecific.pdf",width = 5,height = 3.5)
hist(table(p2g.df.sub.plot.cancer.kmeans$idxATAC),main="Distribution of distal peaks per gene")
dev.off()
##############################################################################################
总结
作者对这部分代码主要长的原因是前面做了一些参数调整的尝试,主要是为了将这ATAC的数据映射到RNA数据集上,去找每个bin之间的锚定的关系,其实这部分在Seuratv3中有专门的代码,感觉比作者的这部分简洁,参考学习来源:https://www.jianshu.com/p/587be3b24463。然后将这两部分进行整合后,一般都要去分析这两个数据之间的相关性。
随后作者通过判断那些远端调控元件,然后对结果进行可视化。因此上述的代码可以简化成首先用seurat进行两个组学之间的锚连接,然后去计算两个组学之间的相关性,然后通过相关的注释结果进行不同的染色质类型进行可视化,去判断什么类型的调控元件在研究中占据比较重要的价值。