单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析14

2022-09-05 10:33:03 浏览数 (3)

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析1:https://cloud.tencent.com/developer/article/2055573

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析2:https://cloud.tencent.com/developer/article/2072069

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析3:https://cloud.tencent.com/developer/article/2078159

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析4:https://cloud.tencent.com/developer/article/2078348

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析5:https://cloud.tencent.com/developer/article/2084580

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析6:https://cloud.tencent.com/developer/article/2085385

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析7:https://cloud.tencent.com/developer/article/2085705

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析8:https://cloud.tencent.com/developer/article/2086805

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析9:https://cloud.tencent.com/developer/article/2087563

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析10:https://cloud.tencent.com/developer/article/2090290

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析11:https://cloud.tencent.com/developer/article/2093123

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析12:https://cloud.tencent.com/developer/article/2093208

单细胞代码解析-妇科癌症单细胞转录组及染色质可及性分析13:https://cloud.tencent.com/developer/article/2093567

image.pngimage.png

通过上述的分析流程获得图片中的研究内容,这部分还是fig3的分析内容。

image.pngimage.png

代码解析

代码语言:javascript复制
library(ArchR)
library(ChIPpeakAnno)
##stats:Wilcoxon 符号秩统计的分布。
library(stats)
ArchR::addArchRThreads(threads = 64)
source("./Archr_Peak_Null_Permute.R")
source("./Archr_Peak_RawPval.R")
dir.create("./Significant_P2G_Outputs")
SAMPLE.ID <- "EEC"
key.word.1 <- "pithelia"
key.word.2 <- "-Ciliated"

# Observed data final run:
proj <- readRDS("./final_archr_proj_archrGS.rds")
# Add p2g links (no restrictions on FDR, Correlation, Variance cutoff) with raw pvalue
##############################################################################################################
proj <- addPeak2GeneLinks(
  ArchRProj = proj , ##一个 ArchRProject 对象
  reducedDims = "IterativeLSI",  ##从指定的 ArchRProject 中检索的 reduceDims 对象的名称(即“IterativeLSI”)。
  useMatrix = "GeneIntegrationMatrix_ArchR",  ##包含用于确定峰到基因链接的基因表达信息的矩阵的名称
  dimsToUse = 1:50, ##一个向量,包含来自reduceDims 对象的维度以用于聚类。
  scaleDims = NULL,  ##一个布尔值,表示是否对每个单元的缩小尺寸进行Z-评分。如果设置为NULL,这将根据在降维过程中最初创建的缩减的Dims的值来扩展维度
  corCutOff = 0.75,  ##每个维度与测序深度的相关性的数值截止值。如果维度与测序深度的相关性大于 corCutOff,则将其排除在分析之外。
  k = 100,  ##用于创建单细胞组进行相关分析的k-近邻的数量。
  knnIteration = 500,  ##测试通过提供的重叠截止的 k-最近邻分组的数量。
  overlapCutoff = 0.8, ##当前组与所有先前组之间允许的最大重叠,以允许在 k-最近邻计算期间将当前组添加到组列表中
  maxDist = 250000,  ##两个峰之间碱基对中允许的最大距离,以考虑共同可及性
  scaleTo = 10^4,  ##来自指定的单细胞组的总插入计数在ArchRProject的peakSet中的所有相关峰值区域中被加总,并归一化为scaleTo提供的总深度。
  log2Norm = TRUE,  ##一个布尔值,指示是否在计算可访问性相关性之前对单细胞组进行 log2 变换
  predictionCutoff = 0.5, ##一个数字,描述了在选择细胞进行分组时使用的 RNA 整合的截止值。
  seed = 1, ##一个数字,用作 knn 确定所需的随机数生成的种子
  threads = max(floor(getArchRThreads()/2), 1),  ##用于并行计算的线程数。
  verbose = TRUE,  ##一个布尔值,用于确定是否应打印标准输出。
  logFile = createLogFile("addPeak2GeneLinks")  ##用于记录 ArchR 输出的文件的路径。
)
saveRDS(proj,"./final_archr_proj_archrGS-P2Gs.rds")
## numeric:数值型
store.prop <- numeric(0)
test <- readRDS(paste0("./",SAMPLE.ID,"/Peak2GeneLinks/seATAC-Group-KNN.rds"))
test <- test@metadata$KNNList@listData
for ( i in 1:length(test)){
  
  test[[i]] <- gsub("\#.*","",test[[i]])
  num <- max(table(test[[i]]))
  store.prop[i] <- num/100
}
saveRDS(store.prop,"./Significant_P2G_Outputs/store_KNN_proportions.rds")

pdf("./Significant_P2G_Outputs/PatientPurityPerAgg.pdf",width = 5,height = 3.5)
hist(store.prop,main="Distribtion of patient purity per cell aggregate")
dev.off()


p2geneDF <- metadata(proj@peakSet)$Peak2GeneLinks
p2geneDF$geneName <- mcols(metadata(p2geneDF)$geneSet)$name[p2geneDF$idxRNA]
p2geneDF$peakName <- (metadata(p2geneDF)$peakSet %>% {paste0(seqnames(.), ":", start(.), "-", end(.))})[p2geneDF$idxATAC]
annot <- readRDS(metadata(p2geneDF)$seATAC)
p2geneDF$peakType <- annot@rowRanges$peakType[p2geneDF$idxATAC]
p2g.df.obs <- as.data.frame(p2geneDF)
p2g.df.obs <- p2g.df.obs[complete.cases(p2g.df.obs),]

pdf("./Significant_P2G_Outputs/P2G_Correlation.pdf",width = 5,height = 3.5)
hist(p2g.df.obs$Correlation,col = "lightblue",
##paste0 以空字符串连接字符
     main = paste0("Histogram ofn",nrow(p2g.df.obs)," P2G correlations"),xlab = "Correlation")
dev.off()

pdf("./Significant_P2G_Outputs/P2g_Pval.pdf",width = 5,height = 3.5)
hist(p2g.df.obs$RawPVal,col="lightblue",
     main = paste0("Histogram ofn",nrow(p2g.df.obs)," P2G p-values"),xlab = "p-value")
abline(v=0.01,col = "red")
dev.off()

saveRDS(p2g.df.obs,"./Significant_P2G_Outputs/All_P2G_Observed.rds")

################################################################################################################

p2g.df.hist <- dplyr::filter(p2g.df.obs,RawPVal <=1e-12 & Correlation >= 0.45)

pdf("./Significant_P2G_Outputs/GenesPerPeak_histogram.pdf",width = 5,height = 3.5)
hist(table(p2g.df.hist$idxRNA),main="Distribution of genes per peaks")
dev.off()

pdf("./Significant_P2G_Outputs/PeaksPerGene_histogram.pdf",width = 5,height = 3.5)
hist(table(p2g.df.hist$idxATAC),main="Distribution of peaks per gene")
dev.off()


#Subset to postive correlation P2Gs
p2g.df.sub <- dplyr::filter(p2g.df.obs,RawPVal <=1e-12 & Correlation >= 0.45& peakType == "Distal")

# Plot peak2 gene heatmap

p2g.df.sub$idx <- paste0(p2g.df.sub$idxATAC,"-",p2g.df.sub$idxRNA)
p2g.df.sub.plot <- p2g.df.sub

# Make color scheme for heatmap based on original UMAP colors:
proj$cluster.num <- factor(gsub("-.*","",proj$predictedGroup_ArchR))
my_levels <- c("6","8","10","14", "15","19",
                            "21",
                            "22",
                            "1",
                            "2",
                            "9",
                            "13",
                            "16",
                            "17",
                            "23",# Remove cluster 20 b/c it only had 10 cells
                            "4" ,"5","12",
                            "3","11","26",
                            "0","18",
                            "24","7",
                            "25",
                            "27","28")
# Relevel object@ident
proj@cellColData$cluster.new <- factor(x=proj$cluster.num, levels = my_levels)
# Make order of colors:
epithelial.cols <- colorRampPalette(c("#A0E989", "#337B24"))
epithelial.cols <- epithelial.cols(8)
fibro.cols <- c("#f593f1","#f272ed","#ef52e9","#e415dd","#c412bd","#a30f9d","#62095e")
smooth.cols <- c("#c2abd6","#b47fe5","#8948a1")
endo.cols <- c("#93CEFF","#4A99FF","#286ab5")
t.cols <- c("gray40","gray60")
macro.cols <- c("#ff6600","#ff9d5c")
mast.cols <- "gold3"
b.cols <- c("#B22222","#CD5C5C")
cols <- c(epithelial.cols,fibro.cols,smooth.cols,endo.cols,t.cols,macro.cols,mast.cols,b.cols)
names(cols) <- levels(factor(proj@cellColData$cluster.new))

source("P2G_Heatmap_Distal.R")
##一个整合描述要在热图中绘制的峰到基因链接的最大数量。 cellColData 中用于标记 KNN 分组的列的名称。使用出现在 KNN 分组中的最大组。描述 groupBy 中颜色的调色板。
test <- plotPeak2GeneHeatmap.distal(proj,peaks=p2g.df.sub.plot,groupBy = "cluster.new",k=length(levels(factor(proj$cluster.new))),
                                    corCutOff = .45,varCutOffATAC = 0,varCutOffRNA = 0,FDRCutOff = 1,nPlot =100000,palGroup=cols,
                                    palATAC = paletteContinuous("solarExtra"),
                                    palRNA = paletteContinuous("solarExtra"))

pdf("./Significant_P2G_Outputs/Peak2Gene_Heatmap_Legend.pdf",width = 8,height = 12)
draw(test, heatmap_legend_side = "bottom")
dev.off()

p2g.heat.df <- plotPeak2GeneHeatmap.distal(proj,peaks=p2g.df.sub.plot,groupBy = "predictedGroup_ArchR",k=length(levels(factor(proj$predictedGroup_ArchR))),
                                           corCutOff = .45,varCutOffATAC = 0,varCutOffRNA = 0,FDRCutOff = 1,returnMatrices = T,nPlot = 100000)


# Save P2G peaknames and Kmeans cluster for the genes of interest (specific to cancer cells)

p2g.df.sub.plot$kmeans <- p2g.heat.df$RNA$kmeansId
saveRDS(p2g.df.sub.plot,"./Significant_P2G_Outputs/Distal_P2Gs_Kmeans.rds")

pdf("./Significant_P2G_Outputs/GenesPerPeak_histogram-distal.pdf",width = 5,height = 3.5)
hist(table(p2g.df.sub.plot$idxRNA),main="Distribution of genes per distal peaks")
dev.off()

pdf("./Significant_P2G_Outputs/PeaksPerGene_histogram-distal.pdf",width = 5,height = 3.5)
hist(table(p2g.df.sub.plot$idxATAC),main="Distribution of distal peaks per gene")
dev.off()

store.kmeans <- c("7","14","15","16","17","18","19","20")
# Extract P2Gs for kmeans clusters of interest
p2g.df.sub.plot.cancer.kmeans <- p2g.df.sub.plot[p2g.df.sub.plot$kmeans %in% store.kmeans,]
saveRDS(p2g.df.sub.plot.cancer.kmeans,"./Significant_P2G_Outputs/Cancer_enriched_P2G_table.rds")
##GRanges 类储存的是一系列基因组区间,每个区间都有一个起始位点和终止位点,可用来存储基因组特征的位置 (比如转录本,外显子等
p2g <- GRanges(p2g.df.sub.plot.cancer.kmeans$peakName)

encode.all <- read.delim("./GRCh38-ccREs.bed",header =F)
colnames(encode.all)[1:3] <- c("seqnames","start","end")
widths <- encode.all$end - encode.all$start
encode.all <- GRanges(encode.all)

ft.peaks <- readRDS("Fallopian_Tube_Cell_line_Peaks.rds")
widths.2 <- ft.peaks$end - ft.peaks$start
ft.peaks <- GRanges(ft.peaks)

oe.peaks <- readRDS("Ovarian_Epithelial_Cell_line_Peaks.rds")
widths.3 <- oe.peaks$end - oe.peaks$start
oe.peaks <- GRanges(oe.peaks)

widths <- c(widths,widths.2,widths.3)
tot <- 3.3e 9*0.98/mean(widths)
##
ol <- findOverlapsOfPeaks(unique(p2g), encode.all,unique(oe.peaks),unique(ft.peaks), minoverlap = 1,connectedPeaks = "min")
saveRDS(ol,"./Significant_P2G_Outputs/Ol.rds")
overlappingPeaks <- ol$overlappingPeaks
peaklist <- ol$peaklist
saveRDS(overlappingPeaks,"./Significant_P2G_Outputs/OverlappingPeaks.rds")
names <- names(overlappingPeaks)[4:6]
print(names)
total <- c(overlappingPeaks[[names[1]]]$overlapFeature,
           overlappingPeaks[[names[2]]]$overlapFeature,
           overlappingPeaks[[names[3]]]$overlapFeature)

pdf("./Significant_P2G_Outputs/ChipPeakAnno_Pie_Overlaps.pdf")
pie1(table(total))
dev.off()

ft.overlap <- overlappingPeaks[[names[1]]]
levels(factor(ft.overlap$overlapFeature))
colnames(ft.overlap)[8:12] <- c("seqnames2","start2","end2","width2","strand2")
ft.overlap$overlap <- ifelse(ft.overlap$overlapFeature == "includeFeature",ft.overlap$width2,"fill")
ft.overlap$overlap <- ifelse(ft.overlap$overlapFeature == "inside",ft.overlap$width,ft.overlap$overlap)
ft.overlap$overlap <- ifelse(ft.overlap$overlapFeature == "overlapEnd",ft.overlap$end2 - ft.overlap$start,ft.overlap$overlap)
ft.overlap$overlap <- ifelse(ft.overlap$overlapFeature == "overlapStart",ft.overlap$end - ft.overlap$start2,ft.overlap$overlap)
levels(factor(ft.overlap$overlap))
ft.overlap$overlap <- as.numeric(ft.overlap$overlap)
hist(ft.overlap$overlap)

oe.overlap <- overlappingPeaks[[names[2]]]
levels(factor(oe.overlap$overlapFeature))
colnames(oe.overlap)[8:12] <- c("seqnames2","start2","end2","width2","strand2")
oe.overlap$overlap <- ifelse(oe.overlap$overlapFeature == "includeFeature",oe.overlap$width2,"fill")
oe.overlap$overlap <- ifelse(oe.overlap$overlapFeature == "inside",oe.overlap$width,oe.overlap$overlap)
oe.overlap$overlap <- ifelse(oe.overlap$overlapFeature == "overlapEnd",oe.overlap$end2 - oe.overlap$start,oe.overlap$overlap)
oe.overlap$overlap <- ifelse(oe.overlap$overlapFeature == "overlapStart",oe.overlap$end - oe.overlap$start2,oe.overlap$overlap)
levels(factor(oe.overlap$overlap))
oe.overlap$overlap <- as.numeric(oe.overlap$overlap)
hist(oe.overlap$overlap)

encode.overlap <- overlappingPeaks[[names[3]]]
levels(factor(encode.overlap$overlapFeature))
colnames(encode.overlap)[8:12] <- c("seqnames2","start2","end2","width2","strand2")
encode.overlap$overlap <- ifelse(encode.overlap$overlapFeature == "includeFeature",encode.overlap$width2,"fill")
encode.overlap$overlap <- ifelse(encode.overlap$overlapFeature == "inside",encode.overlap$width,encode.overlap$overlap)
encode.overlap$overlap <- ifelse(encode.overlap$overlapFeature == "overlapEnd",encode.overlap$end2 - encode.overlap$start,encode.overlap$overlap)
encode.overlap$overlap <- ifelse(encode.overlap$overlapFeature == "overlapStart",encode.overlap$end - encode.overlap$start2,encode.overlap$overlap)
levels(factor(encode.overlap$overlap))
encode.overlap$overlap <- as.numeric(encode.overlap$overlap)
hist(encode.overlap$overlap)

head(ft.overlap[1:3,])
head(oe.overlap[1:3,])
head(encode.overlap[1:3,])

ft.overlap <- ft.overlap[,c(1:12,ncol(ft.overlap),grep("overlapFeature",colnames(ft.overlap)))]
oe.overlap <- oe.overlap[,c(1:12,ncol(oe.overlap),grep("overlapFeature",colnames(oe.overlap)))]
encode.overlap <- encode.overlap[,c(1:12,ncol(encode.overlap),grep("overlapFeature",colnames(encode.overlap)))]

all.overlap <- rbind(ft.overlap,oe.overlap,encode.overlap)

pdf("./Significant_P2G_Outputs/Distribtion_of_All_Overlaps.pdf")
hist(all.overlap$overlap)
dev.off()
types <- c("overlapEnd","overlapStart")
all.overlap <- all.overlap[all.overlap$overlapFeature %in% types,]
pdf("./Significant_P2G_Outputs/Distribution_of_Partial_Overlaps.pdf")
hist(all.overlap$overlap)
dev.off()

pdf("./Significant_P2G_Outputs/ChipPeakAnno_Venn_Overlaps.pdf")
makeVennDiagram(ol,totalTest = tot,connectedPeaks = "min")
dev.off()
venn <- makeVennDiagram(ol,totalTest = tot,connectedPeaks = "min")
saveRDS(venn,"./Significant_P2G_Outputs/venn.rds")
peak.names <- paste0(peaklist$unique.p2g.@seqnames,":",peaklist$unique.p2g.@ranges)

# Find cancer specific peak to gene links 
p2g.df.sub.plot.cancer.kmeans <- p2g.df.sub.plot.cancer.kmeans[p2g.df.sub.plot.cancer.kmeans$peakName %in% peak.names,]
saveRDS(p2g.df.sub.plot.cancer.kmeans,"./Significant_P2G_Outputs/Cancer_specific_P2G_table.rds")

# Plot P2G heatmap for cancer specific distal elements
##############################################################################################################

source("P2G_Heatmap_Distal.R")
test <- plotPeak2GeneHeatmap.distal(proj,peaks=p2g.df.sub.plot.cancer.kmeans,groupBy = "cluster.new",k=length(levels(factor(proj$cluster.new))),
                                    corCutOff = .45,varCutOffATAC = 0,varCutOffRNA = 0,FDRCutOff = 1,nPlot =100000,palGroup=cols,
                                    palATAC = paletteContinuous("solarExtra"),
                                    palRNA = paletteContinuous("solarExtra"))

pdf("./Significant_P2G_Outputs/Peak2Gene_Heatmap_Legend-cancerspecific.pdf",width = 8,height = 12)
draw(test, heatmap_legend_side = "bottom")
dev.off()

p2g.heat.df <- plotPeak2GeneHeatmap.distal(proj,peaks=p2g.df.sub.plot.cancer.kmeans,groupBy = "predictedGroup_ArchR",k=length(levels(factor(proj$predictedGroup_ArchR))),
                                           corCutOff = .45,varCutOffATAC = 0,varCutOffRNA = 0,FDRCutOff = 1,returnMatrices = T,nPlot = 100000)

p2g.df.sub.plot.cancer.kmeans$kmeans <- p2g.heat.df$RNA$kmeansId
saveRDS(p2g.df.sub.plot.cancer.kmeans,"./Significant_P2G_Outputs/Distal_P2Gs_Kmeans-cancerspecific.rds")

pdf("./Significant_P2G_Outputs/GenesPerPeak_histogram-distal-cancerspecific.pdf",width = 5,height = 3.5)
hist(table(p2g.df.sub.plot.cancer.kmeans$idxRNA),main="Distribution of genes per distal peaks")
dev.off()

pdf("./Significant_P2G_Outputs/PeaksPerGene_histogram-distal-cancerspecific.pdf",width = 5,height = 3.5)
hist(table(p2g.df.sub.plot.cancer.kmeans$idxATAC),main="Distribution of distal peaks per gene")
dev.off()
##############################################################################################

总结

作者提供的有关于染色质可及性远端调控元件的热图的可视化的脚本画出来的图还是挺好看的。由于我一直不太懂ArchR的原理,这个博主写的相对不错https://www.jianshu.com/p/a2bcdfb77961,博主写了一个系列的内容,可以有效的对其学习一下。

里面发现寻找几个样本之间重复的峰的方法跟普通的组学用到的是一样的,也是我最近这两天在从新对cuttag结果进行可视化后,又重新学习了一下。

发现在对代码和报错解决后,一定在整理,如何定期去更新代码,因为有很多代码在随着作者对软件的维护后,有可能升级了一些函数,导致前面的函数用法不能用,学习是一个无止境的内容呐。

0 人点赞