分享是一种态度
举杯邀明月,对影成三人
本章介绍SCP中对于单细胞数据的整合流程,适用于批次效应显著的多样本或多批次数据。
- 主要函数: Integration_SCP;
- SCP版本:0.5.6;Seurat版本:v4.4.0;
Integration_SCP函数
Integration_SCP
是针对具有批次效应单细胞数据整合流程。在前处理的一般步骤上与Standard_SCP
相同(例如数据标准化、高变异基因检测等),其余步骤根据各批次整合算法的原理和官方教程进行了规范和调整。
整合流程共包括了一个非矫正(Uncorrected)流程,以及11个算法(Seurat,scVI,MNN,fastMNN,Harmony,Scanorama,BBKNN,CSS,LIGER,Conos,ComBat)的整合流程。
Integration_SCP
函数在参数上整体与Standard_SCP
相同,具体参数说明请查阅Integration_SCP函数文档[1],以下为一些特殊的参数选项:
- 输入可以是一个包含有批次信息的Seurat对象(
srtMerge
),也可以是一个包含了多个Seurat对象的list
(srtList
),两者均需要指定batch
参数。注意当输入为list
时,要确保list
中的每个Seurat对象均包含有指定的batch
信息。 HVF_source
参数:HVF_source
可以指定HVF的来源,选项分别为global
或separete
。 默认值:HVF_source="separate"
HVF_min_intersection=1
HVF_source="global"
,意味着将所有批次的细胞合并在一起,进行高变基因检测,也就是说在Seurat::FindVariableFeatures
检测HVF过程里,将使用所有批次所有细胞基因的均值和方差拟合loess模型,这种方法得到的HVF一般会导致后续分析具有较大的批次效应;HVF_source="separate"
,意味着高变基因检测分别在各自批次的细胞中进行,然后使用Seurat::SelectIntegrationFeatures
筛选用于整合的高变基因,此外还可以指定一个额外参数HVF_min_intersection
,来进一步要求需要至少在N个批次中被检测为高变基因,N越大HVF数量越小,这种方法得到的HVF一般会减小批次效应(见后续)。
- 各整合算法一般会输出矫正后的embedding、graph、feature-cell matrix至少一项,同时输出多项的时候按顺序优先使用,其中:
- 整合算法若只输出矫正后的feature-cell matrix的话,意味着前处理没有进行线性降维,需要对矫正后矩阵进一步做线性降维,例如Seurat、MNN、ComBat等,此时
linear_reduction_dims_use
用于控制最终要使用的线性降维后向量空间的维度,默认将使用自动估计的内在维度; - 矫正后的embedding理论上应当使用所有的dims(默认,一般情况下效果最好),当矫正效果不佳时也可以手动指定,例如
scVI_dims_use
、fastMNN_dims_use
、Harmony_dims_use
、Scanorama_dims_use
、CSS_dims_use
、LIGER_dims_use
等;
- 整合算法若只输出矫正后的feature-cell matrix的话,意味着前处理没有进行线性降维,需要对矫正后矩阵进一步做线性降维,例如Seurat、MNN、ComBat等,此时
- 除了一般步骤可以传递额外参数外,各整合算法主要函数的参数也可以通过
Integration_SCP
进行传递: 整合算法参数名称对应函数SeuratFindIntegrationAnchors_paramsSeurat::FindIntegrationAnchorsSeuratIntegrateData_paramsSeurat::IntegrateDataSeuratIntegrateEmbeddings_paramsSeurat::IntegrateEmbeddingsscVISCVI_paramsscvi.model.SCVI*scVIPEAKVI_paramsscvi.model.PEAKVI*MNNmnnCorrect_paramsbatchelor::mnnCorrectfastMNNfastMNN_paramsbatchelor::fastMNNHarmonyRunHarmony_paramsharmony::RunHarmonyScanoramaScanorama_paramsscanorama.correct*BBKNNbbknn_paramsbbknn.matrix.bbknn*CSSCSS_paramssimspec::cluster_sim_spectrumLIGERoptimizeALS_paramsrliger::optimizeALSLIGERquantilenorm_paramsrliger::quantile_normConosbuildGraph_paramsconos::ConosComBatComBat_paramssva::ComBat
*表示该函数为python函数
函数示例
使用下采样后的8个成人胰腺单细胞数据集为例,该数据中共含有8个样本,5种单细胞建库技术平台,通过在R中运行?panc8_sub
来查看该示例数据相关信息。
library(SCP)
library(Seurat)
data("panc8_sub")
panc8_sub
##> An object of class Seurat
##> 13716 features across 1600 samples within 1 assay
##> Active assay: RNA (13716 features, 0 variable features)
head(panc8_sub[[]])
##> orig.ident nCount_RNA nFeature_RNA tech replicate assigned_cluster
##> D71_58 D71 9915.908 3669 celseq celseq <NA>
##> D102_7 D102 8003.868 3300 celseq celseq <NA>
##> D3en4_73 D3en4 6643.200 3024 celseq celseq <NA>
##> D73_92 D73 6503.113 2901 celseq celseq <NA>
##> D1713_56 D1713 11674.000 4156 celseq celseq <NA>
##> D1713_19 D1713 5161.300 2150 celseq celseq <NA>
##> celltype dataset
##> D71_58 alpha celseq
##> D102_7 beta celseq
##> D3en4_73 ductal celseq
##> D73_92 beta celseq
##> D1713_56 ductal celseq
##> D1713_19 beta celseq
table(panc8_sub$dataset, panc8_sub$tech)
##>
##> celseq celseq2 fluidigmc1 indrop smartseq2
##> celseq 200 0 0 0 0
##> celseq2 0 200 0 0 0
##> fluidigmc1 0 0 200 0 0
##> indrop1 0 0 0 200 0
##> indrop2 0 0 0 200 0
##> indrop3 0 0 0 200 0
##> indrop4 0 0 0 200 0
##> smartseq2 0 0 0 0 200
查看批次效应
我们可以用Integration_SCP
中的Uncorrected流程观察批次效应,这里指定Integration_SCP
中的待矫正的批次变量为tech
:
panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Uncorrected"
)
#> [2023-10-30 09:17:39] Start Uncorrected_integrate
#> [2023-10-30 09:17:39] Spliting srtMerge into srtList by column tech... ...
#> [2023-10-30 09:17:40] Checking srtList... ...
#> Data 1/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-10-30 09:17:44] Finished checking.
#> [2023-10-30 09:17:45] Perform integration(Uncorrected) on the data...
#> [2023-10-30 09:17:45] Start Standard_SCP
#> [2023-10-30 09:17:45] Checking srtList... ...
#> Data 1/1 of the srtList has been log-normalized.
#> [2023-10-30 09:17:46] Finished checking.
#> [2023-10-30 09:17:46] Perform ScaleData on the data...
#> [2023-10-30 09:17:47] Perform linear dimension reduction (pca) on the data...
#> [2023-10-30 09:17:48] Perform FindClusters (louvain) on the data...
#> [2023-10-30 09:17:48] Reorder clusters...
#> [2023-10-30 09:17:48] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-10-30 09:17:57] Standard_SCP done
#> Elapsed time: 11.43 secs
#> [2023-10-30 09:17:57] Uncorrected_integrate done
#> Elapsed time: 17.83 secs
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
注意在流程运行时的提示,panc8_sub
各批次间细胞数据的处理程度是不同的,有两个数据集是原始counts(raw_counts
),有三个数据集经过了标准化但是没有经过log转换(raw_normalized_counts
,比如fpkm,tpm之类的数据),所以在前处理中均进行了NormalizeData(LogNormalize)
操作,在之后的示例中该操作将不再进行。
结果中批次效应比较明显:各细胞类型均被tech
所分割。使用”global” HVF或者Standard_SCP能观察到更大的批次效应,这两者的结果其实是一致的:
panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Uncorrected",
HVF_source = "global"
)
p1 <- CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
panc8_sub <- Standard_SCP(panc8_sub)
p2 <- CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
p1 / p2
使用整合算法去除批次效应
整合算法的调用只需要在Integration_SCP
中指定integration_method
参数:
panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Seurat"
)
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
结果中的批次效应被矫正了:相同细胞类型成群,细胞群之间的区分也比较明显。返回的Seurat对象新增的graphs、reductions、clusters等名称前缀均为所使用的整合算法:
代码语言:javascript复制Graphs(panc8_sub)
##> [1] "Uncorrectedpca_KNN" "Uncorrectedpca_SNN" "Standardpca_KNN"
##> [4] "Standardpca_SNN" "Seurat_KNN" "Seurat_SNN"
Reductions(panc8_sub)
##> [1] "Uncorrectedpca" "UncorrectedUMAP2D" "UncorrectedUMAP3D"
##> [4] "Standardpca" "StandardpcaUMAP2D" "StandardpcaUMAP3D"
##> [7] "StandardUMAP2D" "StandardUMAP3D" "Seuratpca"
##> [10] "SeuratUMAP2D" "SeuratUMAP3D"
colnames(panc8_sub@meta.data)
##> [1] "orig.ident" "tech"
##> [3] "replicate" "assigned_cluster"
##> [5] "celltype" "dataset"
##> [7] "Uncorrectedpca_SNN_res.0.6" "Uncorrectedclusters"
##> [9] "Standardpca_SNN_res.0.6" "ident"
##> [11] "Standardpcaclusters" "Standardclusters"
##> [13] "nCount_RNA" "nFeature_RNA"
##> [15] "Seurat_SNN_res.0.6" "Seuratclusters"
对于一些整合流程而言(例如Seurat),可以尝试指定不同数量的HVF和dims:
代码语言:javascript复制panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Seurat",
nHVF = 3000, linear_reduction_dims_use = 1:50
)
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
也可以通过传入额外参数来控制整合过程,例如使用Seurat的rpca整合流程[2]:
代码语言:javascript复制panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Seurat",
FindIntegrationAnchors_params = list(reduction = "rpca")
)
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
我们可以比较不同整合算法下的表现,这里为了保证可比性,对于所有在整合前或整合后使用线性降维的算法,统一使用前50个维度进行后续的非线性降维和聚类:
代码语言:javascript复制integration_methods <- c(
"Uncorrected", "Seurat", "scVI", "MNN", "fastMNN", "Harmony",
"Scanorama", "BBKNN", "CSS", "LIGER", "Conos", "ComBat"
)
plist1 <- list()
for (method in integration_methods) {
panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = method,
linear_reduction_dims_use = 1:50,
nonlinear_reduction = "umap"
)
plist1[[method]] <- CellDimPlot(panc8_sub,
group.by = "celltype",
reduction = paste0(method, "UMAP2D"),
xlab = "", ylab = "", title = method,
legend.position = "none", theme_use = "theme_blank"
)
}
patchwork::wrap_plots(plotlist = plist1)
与Standard_SCP
略有不同的是,Integration_SCP
单次只能使用一种线性降维方法以及多种非线性降维方法,以Seurat整合流程为例:
panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Seurat",
linear_reduction_dims_use = 1:50,
nonlinear_reduction = c("umap", "tsne", "dm", "phate", "pacmap", "trimap", "largevis", "fr")
)
plist2 <- lapply(c("umap", "tsne", "dm", "phate", "pacmap", "trimap", "largevis", "fr"), function(nr) {
CellDimPlot(panc8_sub,
group.by = "celltype",
reduction = paste0("Seurat", nr, "2D"),
xlab = "", ylab = "", title = nr,
legend.position = "none",
theme_use = "theme_blank"
)
})
patchwork::wrap_plots(plotlist = plist2)
另外注意,BBKNN和Conos流程的整合输出均为graph,所以非线性降维方法只能使用umap
或者fr
,例如:
panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "BBKNN",
linear_reduction_dims_use = 1:50,
nonlinear_reduction = c("umap", "fr")
)
plist3 <- lapply(c("umap", "fr"), function(nr) {
CellDimPlot(panc8_sub,
group.by = "celltype",
reduction = paste0("BBKNN", nr, "2D"),
xlab = "", ylab = "", title = nr,
legend.position = "none",
theme_use = "theme_blank"
)
})
patchwork::wrap_plots(plotlist = plist3)
合理地选择HVF可以减小批次效应
之前提到使用 HVF_source="separate" 结合HVF_min_intersection
来检测HVF,用于下游分析时会减小批次效应。这里我们使用Uncorrected流程来说明HVF_min_intersection
参数对结果的影响。因为数据包含了5种单细胞建库技术平台,HVF_min_intersection
最小为1,最大为5:
plist4 <- list()
for (i in 1:5) {
panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Uncorrected",
HVF_min_intersection = i
)
plist4[[i]] <- CellDimPlot(panc8_sub,
group.by = "celltype",
reduction = "UncorrectedUMAP2D",
xlab = "", ylab = "",
title = paste0("HVF_min_intersection=", i),
legend.position = "none",
theme_use = "theme_blank"
)
}
patchwork::wrap_plots(plotlist = plist4)
代码语言:javascript复制panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Uncorrected",
HVF_min_intersection = 5,
scale_within_batch = TRUE
)
#> [2023-11-21 16:45:52] Start Uncorrected_integrate
#> [2023-11-21 16:45:53] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 16:46:37] Checking srtList... ...
#> Data 1/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-11-21 16:46:43] Finished checking.
#> [2023-11-21 16:46:49] Perform integration(Uncorrected) on the data...
#> [2023-11-21 16:46:49] Perform ScaleData on the data...
#> [2023-11-21 16:46:50] Perform linear dimension reduction (pca) on the data...
#> [2023-11-21 16:46:51] Perform FindClusters (louvain) on the data...
#> [2023-11-21 16:46:51] Reorder clusters...
#> [2023-11-21 16:46:52] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-11-21 16:47:04] Uncorrected_integrate done
#> Elapsed time: 1.19 mins
CellDimPlot(panc8_sub,
group.by = c("tech", "celltype"),
title = "scale_within_batch"
)
图上可以看出,随着HVF_min_intersection
升高,批次效应逐渐减小,HVF_min_intersection=5
时批次效应几乎消除,如果再结合批次内的scale策略,此时结果已经媲美整合算法。
为什么不用整合算法,仅仅选择一部分HVF就可以达到这种效果?
实际上,一个数据集中的HVF反应的是整个数据集内的,也就是该数据集所有的细胞和细胞之间差异较大的基因(稳定方差后依然高于预期方差的基因),这些HVF可以用来表征、衡量和代表这个数据集。
什么样的基因在细胞之间变异较大?
- 细胞类型的主要marker,或者是细胞状态的marker(例如反应分化程度、细胞周期等相关的基因,表达较稳定);
- 随机的,或是具有较大变异的、但无关紧要的噪声基因(可能是由于生物因素导致、也可能是细胞建库测序技术或其他批次因素导致的表达波动)等等。
对于每个数据集来讲,所检测到的HVF组成并不相同,尤其是在批次效应较大数据集之间。可以预期的是,大概率上,各批次中的噪声基因是不同的,但是细胞状态或细胞类型的组成有重叠甚至完全相同。
因此,当使用所有数据集HVF的交集时,保留下的基因大部分也就是各细胞类型、反应细胞主要特征的marker了,这些marker在各批次中的表达特异性是保守的(比如基因GCG在任何批次中均在Alpha细胞中最高表达),借助它们从而达到去除噪声、去除批次效应的效果。
进一步思考,如果先单独分析各数据集间的细胞类型,鉴定出批次之间保守的各细胞类型marker,用这些基因进行整合应该也可以达到去批次效应的目的。这种方式相对于全局的无监督整合方法而言,更加依赖于每个数据集内部的标注,所以是一种半监督整合(semi-supervised integration)。
那么简单地测试一下?
由于panc8_sub
已经包含有细胞类型的信息,我们跳过单独分析的步骤,直接使用RunDEtest
来检测介于各批次间的细胞类型保守marker(markers_type = "conserved"
),然后筛选出变异较大的marker作为候选HVF(p_val_adj < 0.05 & avg_log2FC > 3
)。RunDEtest
可以系统鉴定差异基因、保守基因或者扰动基因等,在后续章节中将会详细说明该函数。
panc8_sub <- RunDEtest(panc8_sub, group_by = "celltype", grouping.var = "tech", markers_type = "conserved")
conserved <- dplyr::filter(panc8_sub@tools$DEtest_celltype$ConservedMarkers_wilcox, p_val_adj < 0.05 & avg_log2FC > 3)
ht <- GroupHeatmap(panc8_sub,
slot = "data",
features = conserved$gene, feature_split = conserved$group1,
group.by = "tech", split.by = "celltype", within_groups = TRUE
)
ht$plot
使用这些批次间保守的细胞类型marker作为HVF,进行Uncorrected整合:
代码语言:javascript复制panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Uncorrected",
HVF = conserved$gene
)
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
可以看见,使用了974个批次间保守的细胞类型marker,也可以消除批次效应,达到相同细胞类型被聚在一起的效果。
简化到极致呢?
选择更少、表达更特异的marker作为HVF,应当也可以达到整合的效果。
这里我们为各组选择top3的marker,实际情况中还可以手动为每种细胞类型指定特异的marker:
代码语言:javascript复制top <- conserved %>%
dplyr::group_by(group1) %>%
dplyr::top_n(3, avg_log2FC)
table(top$group1)
#>
#> alpha beta ductal acinar
#> 3 3 3 3
#> activated_stellate gamma delta epsilon
#> 3 3 3 0
#> mast endothelial macrophage quiescent_stellate
#> 0 3 3 3
#> schwann
#> 3
panc8_sub <- Integration_SCP(
srtMerge = panc8_sub, batch = "tech",
integration_method = "Uncorrected",
HVF = top$gene,
scale_within_batch = TRUE
)
#> [2023-11-21 16:48:36] Start Uncorrected_integrate
#> [2023-11-21 16:48:36] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 16:49:19] Checking srtList... ...
#> Data 1/5 of the srtList has been log-normalized.
#> Data 2/5 of the srtList has been log-normalized.
#> Data 3/5 of the srtList has been log-normalized.
#> Data 4/5 of the srtList has been log-normalized.
#> Data 5/5 of the srtList has been log-normalized.
#> [2023-11-21 16:49:20] Finished checking.
#> [2023-11-21 16:49:25] Perform integration(Uncorrected) on the data...
#> [2023-11-21 16:49:25] Perform ScaleData on the data...
#> [2023-11-21 16:49:25] Perform linear dimension reduction (pca) on the data...
#> [2023-11-21 16:49:29] Perform FindClusters (louvain) on the data...
#> [2023-11-21 16:49:29] Reorder clusters...
#> [2023-11-21 16:49:30] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-11-21 16:49:42] Uncorrected_integrate done
#> Elapsed time: 1.1 mins
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))
实际上,因为使用HVF数量很少,已经不需要线性降维这一步骤了,直接使用这些HVF运行UMAP即可:
代码语言:javascript复制panc8_sub %>%
RunUMAP(features = top$gene) %>%
CellDimPlot(group.by = c("tech", "celltype"), reduction = "umap")
总结
单细胞数据中所存在的批次效应,反应出的是一些特征(基因)在批次间的表达波动。这类波动主要来源于各类非生物学因素,在分析中会造成许多干扰:
- 一方面,这些基因的波动会直接影响数据间的比较,需要借助整合算法来排除它们对于细胞邻近关系的干扰;
- 另一方面,在个别数据集中由这类基因所产生的结论是无法轻易外推的,因为它们极有可能是某些批次因素产生的不可重复的结果。
因此,批次效应相关的基因可以视为噪声,在数据集间的整合、比较或映射时需要排除其干扰,在其他下游分析中也需谨慎对待;
单细胞数据整合的目标是在相同类型或状态的细胞之间建立邻近关系,另外对于细胞谱系分化的数据集而言,例如胚胎发生或者精子发生过程,还需要进一步连接不同分化状态的细胞,从而保持细胞分群和分化路径在数据集间的一致性。
要实现这种统一,也就是要排除批次效应的影响,除了使用整合算法,最简单、最朴素的思想是利用一些关键的、批次间保守的特征(基因)来进行整合,正如我们所常做的,在整合后利用marker来确定各群细胞的细胞类型那样。本章节最后的例子也可以看出,仅使用几十个基因,便足以用一般的单细胞流程实现批次间的整合。
参考资料
[1]
Integration_SCP函数文档: https://zhanghao-njmu.github.io/SCP/reference/Integration_SCP.html
[2]
rpca整合流程: https://satijalab.org/seurat/articles/integration_rpca
往期回顾
端到端的单细胞管道SCP-细胞质控
端到端的单细胞管道SCP-标准流程
端到端的单细胞管道SCP-快速开始
端到端的单细胞管道SCP-安装
SCP—为单细胞分析设计的端到端解决方案
如果你对单细胞转录组研究感兴趣,但又不知道如何入门,也许你可以关注一下下面的课程
- 生信入门&数据挖掘线上直播课11月班
- 时隔5年,我们的生信技能树VIP学徒继续招生啦
- 搭配GPU服务再升级—256线程2Tb内存服务器共享一年仍然是仅需800
生物 | 单细胞 | 转录组丨资料
每天都精彩
长按扫码可关注