分享是一种态度
千淘万漉虽辛苦,吹尽狂沙始到金
本章开始介绍SCP各模块的使用教程,这里建议将SCP升级至v0.5.3之后的版本,包含了更完善的函数文档和示例。注意,目前SCP仅支持SeuratV4,后续版本计划与SeuratV5兼容。
- 主要函数:RunCellQC;RunDoubletCalling;
- SCP版本:0.5.3;Seurat版本:v4.4.0;
数据准备
这里我们使用10X Genomics官方的数据进行分析,该数据为健康成人的PBMC,捕获细胞数大约5k,详细的数据信息可以在官网查询到(5k[1]Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor (Next GEM))。
使用R下载原始数据(Feature[2] / cell matrix (filtered)),读取并创建Seurat对象:
代码语言:javascript复制library(SCP)
library(Seurat)
temp <- tempdir()
file <- paste0(temp, "/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz")
download.file(
url = "https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3_nextgem/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz",
destfile = file
)
untar(tarfile = file, exdir = temp)
counts <- Read10X(paste0(temp, "/filtered_feature_bc_matrix"))
srt <- CreateSeuratObject(counts = counts)
srt
#> An object of class Seurat
#> 33538 features across 5155 samples within 1 assay
#> Active assay: RNA (33538 features, 0 variable features)
质控前处理
在对手头数据进行质控前,建议进行初步的分析,避免盲目的”看指标质控”。因此,建议首先对进行单细胞的标准流程处理:
代码语言:javascript复制srt <- Standard_SCP(srt)
#> [2023-10-26 15:57:07] Start Standard_SCP
#> [2023-10-26 15:57:07] Checking srtList... ...
#> Data 1/1 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 1/1 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-10-26 15:57:11] Finished checking.
#> [2023-10-26 15:57:11] Perform ScaleData on the data...
#> [2023-10-26 15:57:12] Perform linear dimension reduction (pca) on the data...
#> [2023-10-26 15:57:16] Perform FindClusters (louvain) on the data...
#> [2023-10-26 15:57:16] Reorder clusters...
#> [2023-10-26 15:57:17] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-10-26 15:57:31] Standard_SCP done
#> Elapsed time: 23.83 secs
Key(srt[["StandardUMAP2D"]]) <- "UMAP_"
CellDimPlot(srt, group.by = "Standardclusters", label = TRUE)
创建Seurat对象会自动计算各细胞的总UMI(nCount_RNA)和总基因数(nFeature_RNA),部分低质量的细胞已经可以通过这两种指标观察到:
代码语言:javascript复制FeatureDimPlot(srt,
features = c("nCount_RNA", "nFeature_RNA"),
cells.highlight = colnames(srt)[srt$nCount_RNA < 3000]
)
如果手头已经有一些细胞类型的marker列表,我们还可观察下这些已知的marker分布,以确保细胞群/细胞类型是真实存在的:
代码语言:javascript复制markers <- list(
"Naive T" = c("LEF1", "CCR7"),
"CD4 T" = c("CD4", "CCR6"),
"CD8 T" = c("CD8A", "CD8B"),
"NK" = c("GNLY", "NKG7"),
"B" = c("IL4R", "MS4A1"),
"Plasma" = c("CD27", "IGHG1"),
"CD14 Mono" = c("CD14", "LYZ"),
"CD16 Mono" = c("FCGR3A", "MS4A7"),
"DC" = c("FCER1A", "CST3"),
"Platelet" = c("PPBP", "PF4")
)
# 在新版的SCP中支持list输入
FeatureDimPlot(srt,
features = markers,
theme_use = "theme_blank", legend.position = "none"
)
为了演示方便,这里先进行了细胞注释。实际中可以先不做细胞群的定义,只需要观察细胞群的marker分布即可。
代码语言:javascript复制annotation <- list(
"Naive T" = c(1, 2),
"CD4 T" = 3,
"CD8 T" = 8,
"NK" = 10,
"B" = 6,
"Plasma" = 5,
"CD14 Mono" = 13,
"CD16 Mono" = 14,
"DC" = 11,
"Platelet" = 15,
"Unknown" = c(4, 7, 9, 12)
)
srt <- RenameClusters(srt, group.by = "Standardclusters", nameslist = annotation, name = "Annotation")
CellDimPlot(srt, group.by = "Annotation")
也可以借助各物种细胞类型参考数据集来进行快速注释,例如SCP内置的scHCL[3],scMCA[4],scZCL[5],来进行初步注释(注意注释结果不一定准确,仅供参考):
代码语言:javascript复制data("ref_scHCL")
srt <- RunKNNPredict(srt, query_group = "Standardclusters", bulk_ref = ref_scHCL)
CellDimPlot(srt, group.by = "KNNPredict_classification")
细胞质控
整体而言,质控(QC)目标是过滤掉两类细胞:
- 非正常细胞:不含细胞仅含环境RNA污染的空滴、含大量环境RNA污染的细胞、二(多)细胞、细胞碎片等;
- 各细胞群中的离群细胞:测序低质量细胞、凋亡状态的细胞等;
在SCP中,单细胞数据的相关细胞质控指标计算和过滤主要使用RunCellQC函数来完成。
RunCellQC根据以下内容进行细胞质控:
- doublets: 双(多)细胞过滤,可用的识别算法有scDblFinder, Scrublet, DoubletDetection, scds等。不同的单细胞建库平台doublets比例不同,一般上机细胞数越多,doublets比例越高,比如10X 每1000个细胞增加1%(doublets比例系数为0.01),也就是10000个细胞的doublets比例约为10%。默认使用scDblFinder算法,doublets比例系数为0.01。要获得不同算法的结果,可以直接使用RunDoubletCalling[6]函数进行doublets识别。
- outlier: 离群点过滤,这里主要使用绝对中位差(MAD,Median Absolute Deviation)来计算一些指标的离散程度并判断离群细胞,包括了log10_nCount, log10_nFeature, featurecount_dist等,若细胞在这些指标上高于或低于N倍的MAD,将视为该在指标上离群。其中,featurecount_dist是基于log10_nFeature和log10_nCount拟合的局部加权回归模型(LOESS模型)得到的log10_nFeature预测值与观察值的差值。另外还可以设定离群点判定的指标数量,默认为1,意味着任意一个指标离群则该细胞被判定为离群。
- umi, gene: 分别为细胞总UMI(nCount)和总基因数(nFeature)的硬阈值,根据
srt[[assay]]@counts
进行计算,默认分别为3000和1000。 - mito, ribo, ribo_mito_ratio: 线粒体(mito)基因的含量可作为提示细胞凋亡状态的指标,默认阈值为20(百分比)。核糖体(ribo)基因是细胞内表达最丰富的基因,其含量取决于细胞类型,有些细胞类型甚至超过50%。在细胞受挤压、破碎、凋亡时,胞质内的核糖体基因首先流出,而线粒体保留,所以过低的核糖体基因含量(常见凋亡细胞)或者过高的核糖体基因含量(常见受污染细胞、空滴)也可以作为筛选标准,默认ribo阈值为50(百分比)。ribo和mito比例过高均会压缩其余基因的相对表达量,导致可比性下降,必要时可考虑剔除这部分基因。另外ribo和mito的比值也可用作质控,默认大于1。
- species: 当使用物种混合样本建库测序时(常用于衡量二细胞比例),可以根据细胞内含有的兴趣物种基因比例进行过滤。默认阈值为95,细胞要求包含有兴趣物种基因含量高于95%。
另外,如果输入的Seurat对象包含有多个实验数据,可以设定split.by
参数来进行数据拆分后分别质控。其他参数请查阅RunCellQC函数文档[7]。
我们使用RunCellQC默认参数进行质控,但不做过滤:
代码语言:javascript复制srt <- RunCellQC(srt, return_filtered = FALSE)
#> >>> Total cells: 5155
#> >>> Cells which are filtered out: 1063
#> ... 256 potential doublets
#> ... 678 outlier cells
#> ... 646 low-UMI cells
#> ... 547 low-gene cells
#> ... 525 high-mito cells
#> ... 0 high-ribo cells
#> ... 728 ribo_mito_ratio outlier cells
#> ... 0 species-contaminated cells
#> >>> Remained cells after filtering: 4092
table(srt$CellQC)
#>
#> Pass Fail
#> 4092 1063
RunCellQC会计算生成各类QC指标和QC结果:
QC指标 | 中间结果 | QC结果 | 最终结果 |
---|---|---|---|
db.xxx_score | db.xxx_class | db_qc | CellQC |
nCount、log10_nCount | log10_nCount.lower.2.5 | umi_qc | |
nFeature、log10_nFeature | log10_nCount.higher.5 | gene_qc | |
featurecount_dist | log10_nFeature.lower.2.5 | outlier_qc | |
percent.mito | log10_nFeature.higher.5 | mito_qc | |
percent.ribo | featurecount_dist.lower.2.5 | ribo_qc | |
ribo.mito.ratio | ribo_mito_ratio_qc | ||
percent.genome | species_qc |
这里默认参数下共1063个细胞没有通过质控,其中有256个潜在的双(多)细胞、678个相关指标离群的细胞、646个低UMI的细胞、547个低基因数的细胞,525个高线粒体的细胞、0个高核糖体的细胞、728个核糖体/线粒体比例异常的细胞。
首先可以先看下计算的QC指标分布:
代码语言:javascript复制qc_metrics <- c("db.scDblFinder_score", "percent.mito", "percent.ribo", "ribo.mito.ratio", "log10_nFeature_RNA", "log10_nCount_RNA", "featurecount_dist")
FeatureDimPlot(srt, features = qc_metrics, theme_use = "theme_blank")
使用CellStatPlot
函数可以对QC结果进行统计和可视化:
CellStatPlot(
srt = srt,
stat.by = c(
"db_qc", "outlier_qc", "umi_qc", "gene_qc",
"mito_qc", "ribo_qc", "ribo_mito_ratio_qc"
),
plot_type = "upset", stat_level = "Fail"
)
QC结果为Fail
的细胞基本上分成两个部分,一部分未能通过umi_qc
,gene_qc
, mito_qc
, ribo_qc
, ribo_mito_ratio_qc
任意一项QC,另一部分仅仅未能通过db_qc
。
我们可以使用CellDimPlot
函数来在UMAP上显示这些未通过QC的细胞分布:
CellDimPlot(srt,
group.by = c(
"CellQC", "db_qc", "outlier_qc", "umi_qc", "gene_qc",
"mito_qc", "ribo_qc", "ribo_mito_ratio_qc"
),
theme_use = "theme_blank"
)
小心db_qc
的结果,如果单细胞数据集是一个连续分化/发育状态下的细胞类型(例如ESC/iPSC分化中的细胞谱系),一些doublets鉴定算法容易将那些分化过程中的细胞(两个细胞群之间的过度状态)判定为doublets,此时需要手动调整质控标准(例如db_qc
联合另外一种指标共同为Fail
最终才被过滤)。
如果数据集已经有了初步注释,也可以统计下各细胞群中的QC指标和QC结果:
代码语言:javascript复制FeatureStatPlot(srt,
stat.by = c("nCount_RNA", "nFeature_RNA"),
group.by = "Annotation",
add_box = TRUE
)
代码语言:javascript复制CellStatPlot(srt, stat.by = "CellQC", group.by = "Annotation", label = TRUE)
这里注意,所有Platelet细胞由于UMI/基因数低导致QC未通过。如果想要保留,可以手动划分它们为Pass
:
srt$CellQC[colnames(srt)[srt$Annotation == "Platelet"]] <- "Pass"
确认好所有要被过滤的细胞后,进行最终过滤:
代码语言:javascript复制srt_filtered <- subset(srt, CellQC == "Pass")
srt_filtered <- Standard_SCP(srt_filtered)
#> [2023-10-26 15:58:33] Start Standard_SCP
#> [2023-10-26 15:58:33] Checking srtList... ...
#> Data 1/1 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/1 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-10-26 15:58:39] Finished checking.
#> [2023-10-26 15:58:39] Perform ScaleData on the data...
#> [2023-10-26 15:58:39] Perform linear dimension reduction (pca) on the data...
#> [2023-10-26 15:58:42] Perform FindClusters (louvain) on the data...
#> [2023-10-26 15:58:43] Reorder clusters...
#> [2023-10-26 15:58:43] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-10-26 15:58:57] Standard_SCP done
#> Elapsed time: 23.34 secs
CellDimPlot(srt_filtered, "Annotation")
文中链接
[1]
5k: https://www.10xgenomics.com/resources/datasets/5-k-peripheral-blood-mononuclear-cells-pbm-cs-from-a-healthy-donor-next-gem-3-1-standard-3-0-2
[2]
Feature: https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3_nextgem/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz
[3]
scHCL: https://www.nature.com/articles/s41586-020-2157-4
[4]
scMCA: https://www.cell.com/cell/fulltext/S0092-8674(18)30116-8
[5]
scZCL: https://www.frontiersin.org/articles/10.3389/fcell.2021.743421/full
[6]
RunDoubletCalling: https://zhanghao-njmu.github.io/SCP/reference/RunDoubletCalling.html
[7]
RunCellQC函数文档: https://zhanghao-njmu.github.io/SCP/reference/RunCellQC.html