端到端的单细胞管道SCP-细胞质控

分享是一种态度

千淘万漉虽辛苦，吹尽狂沙始到金

本章开始介绍SCP各模块的使用教程，这里建议将SCP升级至v0.5.3之后的版本，包含了更完善的函数文档和示例。注意，目前SCP仅支持SeuratV4，后续版本计划与SeuratV5兼容。

主要函数：RunCellQC；RunDoubletCalling；
SCP版本：0.5.3；Seurat版本：v4.4.0；

数据准备

这里我们使用10X Genomics官方的数据进行分析，该数据为健康成人的PBMC，捕获细胞数大约5k，详细的数据信息可以在官网查询到（5k[1]Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor (Next GEM)）。

使用R下载原始数据（Feature[2] / cell matrix (filtered)），读取并创建Seurat对象:

代码语言：javascript复制

library(SCP)
library(Seurat)

temp <- tempdir()
file <- paste0(temp, "/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz")
download.file(
  url = "https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3_nextgem/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz",
  destfile = file
)
untar(tarfile = file, exdir = temp)

counts <- Read10X(paste0(temp, "/filtered_feature_bc_matrix"))
srt <- CreateSeuratObject(counts = counts)
srt
#> An object of class Seurat 
#> 33538 features across 5155 samples within 1 assay 
#> Active assay: RNA (33538 features, 0 variable features)

质控前处理

在对手头数据进行质控前，建议进行初步的分析，避免盲目的”看指标质控”。因此，建议首先对进行单细胞的标准流程处理：

代码语言：javascript复制

srt <- Standard_SCP(srt)
#> [2023-10-26 15:57:07] Start Standard_SCP
#> [2023-10-26 15:57:07] Checking srtList... ...
#> Data 1/1 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 1/1 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-10-26 15:57:11] Finished checking.
#> [2023-10-26 15:57:11] Perform ScaleData on the data...
#> [2023-10-26 15:57:12] Perform linear dimension reduction (pca) on the data...
#> [2023-10-26 15:57:16] Perform FindClusters (louvain) on the data...
#> [2023-10-26 15:57:16] Reorder clusters...
#> [2023-10-26 15:57:17] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-10-26 15:57:31] Standard_SCP done
#> Elapsed time: 23.83 secs
Key(srt[["StandardUMAP2D"]]) <- "UMAP_"
CellDimPlot(srt, group.by = "Standardclusters", label = TRUE)

创建Seurat对象会自动计算各细胞的总UMI（nCount_RNA）和总基因数（nFeature_RNA），部分低质量的细胞已经可以通过这两种指标观察到：

代码语言：javascript复制

FeatureDimPlot(srt,
  features = c("nCount_RNA", "nFeature_RNA"),
  cells.highlight = colnames(srt)[srt$nCount_RNA < 3000]
)

如果手头已经有一些细胞类型的marker列表，我们还可观察下这些已知的marker分布，以确保细胞群/细胞类型是真实存在的：

代码语言：javascript复制

markers <- list(
  "Naive T" = c("LEF1", "CCR7"),
  "CD4  T" = c("CD4", "CCR6"),
  "CD8  T" = c("CD8A", "CD8B"),
  "NK" = c("GNLY", "NKG7"),
  "B" = c("IL4R", "MS4A1"),
  "Plasma" = c("CD27", "IGHG1"),
  "CD14  Mono" = c("CD14", "LYZ"),
  "CD16  Mono" = c("FCGR3A", "MS4A7"),
  "DC" = c("FCER1A", "CST3"),
  "Platelet" = c("PPBP", "PF4")
)
# 在新版的SCP中支持list输入
FeatureDimPlot(srt,
  features = markers,
  theme_use = "theme_blank", legend.position = "none"
)

为了演示方便，这里先进行了细胞注释。实际中可以先不做细胞群的定义，只需要观察细胞群的marker分布即可。

代码语言：javascript复制

annotation <- list(
  "Naive T" = c(1, 2),
  "CD4  T" = 3,
  "CD8  T" = 8,
  "NK" = 10,
  "B" = 6,
  "Plasma" = 5,
  "CD14  Mono" = 13,
  "CD16  Mono" = 14,
  "DC" = 11,
  "Platelet" = 15,
  "Unknown" = c(4, 7, 9, 12)
)
srt <- RenameClusters(srt, group.by = "Standardclusters", nameslist = annotation, name = "Annotation")
CellDimPlot(srt, group.by = "Annotation")

也可以借助各物种细胞类型参考数据集来进行快速注释，例如SCP内置的scHCL[3]，scMCA[4]，scZCL[5]，来进行初步注释（注意注释结果不一定准确，仅供参考）：

代码语言：javascript复制

data("ref_scHCL")
srt <- RunKNNPredict(srt, query_group = "Standardclusters", bulk_ref = ref_scHCL)
CellDimPlot(srt, group.by = "KNNPredict_classification")

细胞质控

整体而言，质控(QC)目标是过滤掉两类细胞：

非正常细胞：不含细胞仅含环境RNA污染的空滴、含大量环境RNA污染的细胞、二（多）细胞、细胞碎片等;
各细胞群中的离群细胞：测序低质量细胞、凋亡状态的细胞等;

在SCP中，单细胞数据的相关细胞质控指标计算和过滤主要使用RunCellQC函数来完成。

RunCellQC根据以下内容进行细胞质控：

doublets: 双（多）细胞过滤，可用的识别算法有scDblFinder, Scrublet, DoubletDetection, scds等。不同的单细胞建库平台doublets比例不同，一般上机细胞数越多，doublets比例越高，比如10X 每1000个细胞增加1%（doublets比例系数为0.01），也就是10000个细胞的doublets比例约为10%。默认使用scDblFinder算法，doublets比例系数为0.01。要获得不同算法的结果，可以直接使用RunDoubletCalling[6]函数进行doublets识别。
outlier: 离群点过滤，这里主要使用绝对中位差（MAD，Median Absolute Deviation）来计算一些指标的离散程度并判断离群细胞，包括了log10_nCount, log10_nFeature, featurecount_dist等，若细胞在这些指标上高于或低于N倍的MAD，将视为该在指标上离群。其中，featurecount_dist是基于log10_nFeature和log10_nCount拟合的局部加权回归模型（LOESS模型）得到的log10_nFeature预测值与观察值的差值。另外还可以设定离群点判定的指标数量，默认为1，意味着任意一个指标离群则该细胞被判定为离群。
umi, gene: 分别为细胞总UMI（nCount）和总基因数（nFeature）的硬阈值，根据srt[[assay]]@counts进行计算，默认分别为3000和1000。
mito, ribo, ribo_mito_ratio: 线粒体（mito）基因的含量可作为提示细胞凋亡状态的指标，默认阈值为20（百分比）。核糖体（ribo）基因是细胞内表达最丰富的基因，其含量取决于细胞类型，有些细胞类型甚至超过50%。在细胞受挤压、破碎、凋亡时，胞质内的核糖体基因首先流出，而线粒体保留，所以过低的核糖体基因含量（常见凋亡细胞）或者过高的核糖体基因含量（常见受污染细胞、空滴）也可以作为筛选标准，默认ribo阈值为50（百分比）。ribo和mito比例过高均会压缩其余基因的相对表达量，导致可比性下降，必要时可考虑剔除这部分基因。另外ribo和mito的比值也可用作质控，默认大于1。
species: 当使用物种混合样本建库测序时（常用于衡量二细胞比例），可以根据细胞内含有的兴趣物种基因比例进行过滤。默认阈值为95，细胞要求包含有兴趣物种基因含量高于95%。

另外，如果输入的Seurat对象包含有多个实验数据，可以设定split.by参数来进行数据拆分后分别质控。其他参数请查阅RunCellQC函数文档[7]。

我们使用RunCellQC默认参数进行质控，但不做过滤：

代码语言：javascript复制

srt <- RunCellQC(srt, return_filtered = FALSE)
#> >>> Total cells: 5155 
#> >>> Cells which are filtered out: 1063 
#> ... 256 potential doublets 
#> ... 678 outlier cells 
#> ... 646 low-UMI cells 
#> ... 547 low-gene cells 
#> ... 525 high-mito cells 
#> ... 0 high-ribo cells 
#> ... 728 ribo_mito_ratio outlier cells 
#> ... 0 species-contaminated cells 
#> >>> Remained cells after filtering: 4092
table(srt$CellQC)
#> 
#> Pass Fail 
#> 4092 1063

RunCellQC会计算生成各类QC指标和QC结果:

QC指标	中间结果	QC结果	最终结果
db.xxx_score	db.xxx_class	db_qc	CellQC
nCount、log10_nCount	log10_nCount.lower.2.5	umi_qc
nFeature、log10_nFeature	log10_nCount.higher.5	gene_qc
featurecount_dist	log10_nFeature.lower.2.5	outlier_qc
percent.mito	log10_nFeature.higher.5	mito_qc
percent.ribo	featurecount_dist.lower.2.5	ribo_qc
ribo.mito.ratio		ribo_mito_ratio_qc
percent.genome		species_qc

这里默认参数下共1063个细胞没有通过质控，其中有256个潜在的双（多）细胞、678个相关指标离群的细胞、646个低UMI的细胞、547个低基因数的细胞，525个高线粒体的细胞、0个高核糖体的细胞、728个核糖体/线粒体比例异常的细胞。

首先可以先看下计算的QC指标分布：

代码语言：javascript复制

qc_metrics <- c("db.scDblFinder_score", "percent.mito", "percent.ribo", "ribo.mito.ratio", "log10_nFeature_RNA", "log10_nCount_RNA", "featurecount_dist")
FeatureDimPlot(srt, features = qc_metrics, theme_use = "theme_blank")

使用CellStatPlot函数可以对QC结果进行统计和可视化：

代码语言：javascript复制

CellStatPlot(
  srt = srt,
  stat.by = c(
    "db_qc", "outlier_qc", "umi_qc", "gene_qc",
    "mito_qc", "ribo_qc", "ribo_mito_ratio_qc"
  ),
  plot_type = "upset", stat_level = "Fail"
)

QC结果为Fail的细胞基本上分成两个部分，一部分未能通过umi_qc,gene_qc, mito_qc, ribo_qc, ribo_mito_ratio_qc任意一项QC，另一部分仅仅未能通过db_qc。

我们可以使用CellDimPlot函数来在UMAP上显示这些未通过QC的细胞分布：

代码语言：javascript复制

CellDimPlot(srt,
  group.by = c(
    "CellQC", "db_qc", "outlier_qc", "umi_qc", "gene_qc",
    "mito_qc", "ribo_qc", "ribo_mito_ratio_qc"
  ),
  theme_use = "theme_blank"
)

小心db_qc的结果，如果单细胞数据集是一个连续分化/发育状态下的细胞类型（例如ESC/iPSC分化中的细胞谱系），一些doublets鉴定算法容易将那些分化过程中的细胞（两个细胞群之间的过度状态）判定为doublets，此时需要手动调整质控标准（例如db_qc联合另外一种指标共同为Fail最终才被过滤）。

如果数据集已经有了初步注释，也可以统计下各细胞群中的QC指标和QC结果：

代码语言：javascript复制

FeatureStatPlot(srt,
  stat.by = c("nCount_RNA", "nFeature_RNA"),
  group.by = "Annotation",
  add_box = TRUE
)

代码语言：javascript复制

CellStatPlot(srt, stat.by = "CellQC", group.by = "Annotation", label = TRUE)

这里注意，所有Platelet细胞由于UMI/基因数低导致QC未通过。如果想要保留，可以手动划分它们为Pass：

代码语言：javascript复制

srt$CellQC[colnames(srt)[srt$Annotation == "Platelet"]] <- "Pass"

确认好所有要被过滤的细胞后，进行最终过滤：

代码语言：javascript复制

srt_filtered <- subset(srt, CellQC == "Pass")
srt_filtered <- Standard_SCP(srt_filtered)
#> [2023-10-26 15:58:33] Start Standard_SCP
#> [2023-10-26 15:58:33] Checking srtList... ...
#> Data 1/1 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/1 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-10-26 15:58:39] Finished checking.
#> [2023-10-26 15:58:39] Perform ScaleData on the data...
#> [2023-10-26 15:58:39] Perform linear dimension reduction (pca) on the data...
#> [2023-10-26 15:58:42] Perform FindClusters (louvain) on the data...
#> [2023-10-26 15:58:43] Reorder clusters...
#> [2023-10-26 15:58:43] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-10-26 15:58:57] Standard_SCP done
#> Elapsed time: 23.34 secs
CellDimPlot(srt_filtered, "Annotation")