在单细胞转录组冒尖的时候,单细胞其它组学却一直不温不火,比如单细胞ATAC以及单细胞免疫组库。所以虽然很早之前我们就推荐了github代码:https://github.com/GreenleafLab/10x-scATAC-2019
- 01_Filter_Cells_v2.R
- 02_Get_Peak_Set_hg19_v2.R
- 03_Run_chromVAR_v2.R
- 04_Run_Cicero_v2.R
- 05_Cluster_Unique_Peaks_v2.R
- 06_Analyze_UMAP_Trajectory.R
- 07_ChromVAR_For_GWAS_w_CoAccessbility_v2.R
- 08_Run_scCNV_v2.R
但是一直没有深入解读它,在2021的时候《单细胞天地》的一个小伙伴写了个开头:
- scATAC-seq1:由转录组到表观组
- scATAC-seq2: scATAC-seq技术原理
- scATAC-seq3:常用工具—SnapATAC简介
但是没有能坚持下来,其实文章给的配套github代码非常齐全了,就是需要花时间钻研和解读。
恰好2022我们又看到了一个带有全套github代码以及配套数据的文章,见:10x的单细胞ATAC上游流程之cellranger-atac,在等待的期间,交流群的学徒就先拿 atac_v1_pbmc_10k 这样的示例数据开始练习了。所以转发学徒的系列教程给大家。
- scATAC-seq| 数据导入
- scATAC-seq| 数据质控
- scATAC-seq| 降维聚类
- 整合scRNA和scATAC数据
- scATAC-seq| 细胞注释(结合scRNA)
现在继续:
上回求出差异的peaks之后,还能挖掘更多的数据信息:
例如用热图来展示结果,并且求peak区域附近的基因,看其分布位置
↓↓↓
To investigate the differences in chromatin accessibility among all cell types, we first identified a total of 212,326 peaks in our scATAC-seq data by using MACS223 and found that ~10.6% (22,682 unique peaks) of them exhibited significant differences among cell types; these sites were defined as differentially accessible chromatin regions (DARs) (adjusted P < 0.05 and log2(fold change (FC)) > 0.25)(Fig. 2a)
The majority of DARs were located in the promoter and intronic regions of the genome, and the distribution of DARs was relatively conserved across cell types (Fig. 2b and Supplementary Fig. S2b).
文章用MACS223求出scATAC-seq中的peak,随后用过设置阈值求出有显著差异的染色质开放区域(differentially accessible chromatin regions (DARs)即有显著差异的peak)及其基因的分布位置
文章:Single-cell multiomics analysis reveals regulatory programs in clear cell renal cell carcinoma
流程:
- 寻找差异的peaks:
FindAllMarkers
orFindMarkers
- peaks附近的基因:
ClosestFeature
- 设置阈值:求出DAR
- 求平均表达量:
AverageExpression
- 可视化:热图,饼图,直方图,条形图等
1. 显著差异的peaks
代码语言:javascript复制# 确保已经分好细胞群,不再是序号
table(Idents(pbmc))
# CD14 Monocytes NK dim CD4 Memory CD4 Naive
# 3122 360 820 698
# Double negative T cell CD8 effector CD8 Naive CD16 Monocytes
# 391 563 317 217
# pre-B cell B cell progenitor Dendritic cell pDC
# 280 170 68 49
# NK bright
# 5
levels(pbmc)
# [1] "CD14 Monocytes" "NK dim" "CD4 Memory"
# [4] "CD4 Naive" "Double negative T cell" "CD8 effector"
# [7] "CD8 Naive" "CD16 Monocytes" "pre-B cell"
# [10] "B cell progenitor" "Dendritic cell" "pDC"
# [13] "NK bright"
############################# identify differentially accessible chromatin regions between celltypes
DefaultAssay(pbmc) <- "peaks"
# 寻找显著差异的peaks
# cellType.DARs <- FindAllMarkers(scATAC.data,
# test.use = 'LR',
# logfc.threshold=0,
# min.pct = 0.05, # often necessary to lower the min.pct threshold
# latent.vars = "peak_region_fragments")
# 运行比较慢,这里只比较两个细胞群的差异peaks
cellType.DARs <- Seurat::FindMarkers(
object = pbmc,
ident.1 = 'CD4 Naive',
ident.2 = 'CD8 Naive',
logfc.threshold=0, # 默认是0.25
only.pos = TRUE,
test.use = 'LR', # 推荐用logistic regression
min.pct = 0.05, # 该基因表达数目占该类细胞总数的比例
latent.vars = 'nCount_peaks'
)
head(cellType.DARs)
# p_val avg_log2FC pct.1 pct.2 p_val_adj
# chr1-234544345-234545516 8.454563e-16 0.5295198 0.191 0.022 7.402900e-11
# chr16-79632101-79636717 1.431972e-13 0.2235968 0.268 0.073 1.253849e-08
# chr6-167362837-167366950 1.648728e-13 0.2943871 0.285 0.085 1.443643e-08
# chr12-6895592-6896474 3.370162e-13 0.4385442 0.123 0.006 2.950948e-08
# chr12-6898094-6899126 8.623449e-13 0.2885348 0.106 0.003 7.550778e-08
# chr9-90340551-90341713 1.854913e-12 0.2595704 0.142 0.016 1.624180e-07
dim(cellType.DARs) #12642 5
有些教程在FindMarkers
用的变量是latent.vars = "peak_region_fragments"
,其实出来的结果差不多,因其相关性很高
cor(pbmc$peak_region_fragments, pbmc$nCount_peaks) # 0.999884
2. peaks附近的基因
peaks附近的基因: ClosestFeature
# 寻找附近的基因
cf <- ClosestFeature(pbmc, regions = rownames(cellType.DARs))# Find the closest feature to a given set of genomic regions
head(cf)
table(rownames(cellType.DARs) %in% cf$query_region) # 对应peaks
# TRUE
# 12642
table(rownames(cellType.DARs) %in% cf$closest_region) # 对应基因的region
# FALSE
# 12642
这里就把求出的regions = rownames(cellType.DARs)
当成所有细胞群差异的peak,如果着重研究某两个细胞群的差异,regions
就需要改变一下
open_CD4_Naive <- rownames(cellType.DARs[cellType.DARs$avg_log2FC > 0.5, ])
open_CD8_Naive <- rownames(cellType.DARs[cellType.DARs$avg_log2FC < -0.5, ])
closest_genes_CD4_Naive <- ClosestFeature(pbmc, regions = open_CD4_Naive)
closest_genes_CD8_Naive <- ClosestFeature(pbmc, regions = open_CD8_Naive)
结果:
cf$query_region:
输入的peakscf$closest_region:
基因的region
> head(cf)
tx_id gene_name gene_id gene_biotype type
ENST00000481183 ENST00000481183 TARBP1 ENSG00000059588 protein_coding gap
ENST00000569649 ENST00000569649 MAF ENSG00000178573 protein_coding cds
ENST00000508775 ENST00000508775 RNASET2 ENSG00000026297 protein_coding cds
ENST00000535707 ENST00000535707 CD4 ENSG00000010610 protein_coding gap
ENST00000541982 ENST00000541982 CD4 ENSG00000010610 protein_coding utr
ENST00000343150 ENST00000343150 CTSL ENSG00000135047 protein_coding utr
closest_region query_region distance
ENST00000481183 chr1-234541846-234546190 chr1-234544345-234545516 0
ENST00000569649 chr16-79632682-79633799 chr16-79632101-79636717 0
ENST00000508775 chr6-167365976-167366036 chr6-167362837-167366950 0
ENST00000535707 chr12-6896172-6898654 chr12-6895592-6896474 0
ENST00000541982 chr12-6898702-6898828 chr12-6898094-6899126 0
ENST00000343150 chr9-90340434-90341313 chr9-90340551-90341713 0
> table(cf$gene_biotype)
lincRNA processed_transcript protein_coding rRNA
1017 114 11490 21
> table(cf$type)
cds exon gap utr
5465 1781 2724 2672
这里出来的基因类型和位置都可以用饼图来展示
3. 求平均表达量
AverageExpression
# 求表达量,画热图
cellType.DARs <- cbind(cellType.DARs,
gene=cf$gene_name,
query_region = cf$query_region,
type = cf$type,
distance=cf$distance)
# 设置阈值
#require logfc.threshold >= 0.25 & p_val_adj < 0.05
cellType.sig.pos.DARs <- cellType.DARs %>% dplyr::filter(avg_log2FC >=0.25 & p_val_adj < 0.05) %>% arrange(desc(avg_log2FC))
dim(cellType.sig.pos.DARs) #19 9
#plot--- differentially accessible chromatin regions heatmap
sig.region <- cellType.sig.pos.DARs %>% dplyr::select(query_region) %>% distinct()
# 求平均表达量
#average fragment of each peak in each cell type
sig.region.mean <- AverageExpression(pbmc,
features = sig.region$query_region,
assays = "peaks")
sig.region.mean.scale <- scale(t(sig.region.mean$peaks))
代码语言:javascript复制> sig.region.mean.scale[1:3,]
chr1-234544345-234545516 chr12-6895592-6896474 chr1-213843536-213844175
CD14 Monocytes -0.4226576 -0.2815046 -0.3905448
NK dim -0.4918933 -0.5234233 -0.4341829
CD4 Memory 0.9332080 0.6218469 0.4443438
chr2-112938926-112941640 chr14-59103807-59105663 chr21-46890735-46891216
CD14 Monocytes -0.5949016 -0.5212354 -0.2926606
NK dim -0.5968046 -0.5656007 -0.3216966
CD4 Memory 1.1153124 -0.6531111 0.7722765
chr15-57883649-57884489 chr3-118937835-118938422 chr3-32279336-32281981
CD14 Monocytes -0.4343374 -0.66028682 -0.4855104
NK dim -0.9293215 -0.15753047 -0.4648244
CD4 Memory 0.1321672 0.09150125 1.4333515
chr21-37660946-37662184 chr8-18870780-18872100 chr6-11728007-11729822
CD14 Monocytes -0.7490132 0.3656120 0.35311667
NK dim -0.7044903 -0.6868159 -0.72428267
CD4 Memory 1.5498568 -0.3555641 0.05747466
chr10-35300735-35301082 chr6-167362837-167366950 chr12-6898094-6899126
CD14 Monocytes -0.23194611 -0.8103845 1.1220201
NK dim -0.32961628 -0.7481198 -0.7828576
CD4 Memory 0.09374038 0.5397166 0.8174112
chr19-52191949-52194762 chr12-6899696-6902624 chr10-62491054-62493599 chr9-90340551-90341713
CD14 Monocytes -0.3312373 1.2080742 -1.2251931 1.9848987
NK dim -0.7100521 -0.8918025 -0.6249917 -0.5224442
CD4 Memory 1.4545402 0.3754083 0.7368930 -0.2492964
4. 画热图
代码语言:javascript复制# ?Heatmap
Heatmap(sig.region.mean.scale,
name = "z-score",
show_column_dend = F,
show_row_dend = F,
row_names_side = "left",
show_column_names = F,
row_names_gp = gpar(fontsize = 10),
width = unit(10, "cm"),
height = unit(8, "cm"),
# heatmap_legend_param = list(direction = "horizontal")
show_heatmap_legend=F)
library(circlize)
# 添加注释
col_fun = colorRamp2(c(-4,0,4), c("blue", "white", "red"))
lgd = Legend(col_fun = col_fun,
title = "z-score",
direction = "horizontal", # 横向注释
title_position='lefttop')
lgd
draw(lgd, x = unit(20.7, "cm"), y = unit(7.4, "cm")) # 调整注释位置
参考:https://satijalab.org/signac/reference/closestfeature