端到端的单细胞管道SCP-细胞质控

2023-10-31 14:46:32 浏览数 (2)

分享是一种态度

千淘万漉虽辛苦,吹尽狂沙始到金

本章开始介绍SCP各模块的使用教程,这里建议将SCP升级至v0.5.3之后的版本,包含了更完善的函数文档和示例。注意,目前SCP仅支持SeuratV4,后续版本计划与SeuratV5兼容。

  • 主要函数:RunCellQC;RunDoubletCalling;
  • SCP版本:0.5.3;Seurat版本:v4.4.0;

数据准备

这里我们使用10X Genomics官方的数据进行分析,该数据为健康成人的PBMC,捕获细胞数大约5k,详细的数据信息可以在官网查询到(5k[1]Peripheral Blood Mononuclear Cells (PBMCs) from a Healthy Donor (Next GEM))。

使用R下载原始数据(Feature[2] / cell matrix (filtered)),读取并创建Seurat对象:

代码语言:javascript复制
library(SCP)
library(Seurat)

temp <- tempdir()
file <- paste0(temp, "/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz")
download.file(
  url = "https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3_nextgem/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz",
  destfile = file
)
untar(tarfile = file, exdir = temp)

counts <- Read10X(paste0(temp, "/filtered_feature_bc_matrix"))
srt <- CreateSeuratObject(counts = counts)
srt
#> An object of class Seurat 
#> 33538 features across 5155 samples within 1 assay 
#> Active assay: RNA (33538 features, 0 variable features)

质控前处理

在对手头数据进行质控前,建议进行初步的分析,避免盲目的”看指标质控”。因此,建议首先对进行单细胞的标准流程处理:

代码语言:javascript复制
srt <- Standard_SCP(srt)
#> [2023-10-26 15:57:07] Start Standard_SCP
#> [2023-10-26 15:57:07] Checking srtList... ...
#> Data 1/1 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 1/1 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-10-26 15:57:11] Finished checking.
#> [2023-10-26 15:57:11] Perform ScaleData on the data...
#> [2023-10-26 15:57:12] Perform linear dimension reduction (pca) on the data...
#> [2023-10-26 15:57:16] Perform FindClusters (louvain) on the data...
#> [2023-10-26 15:57:16] Reorder clusters...
#> [2023-10-26 15:57:17] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-10-26 15:57:31] Standard_SCP done
#> Elapsed time: 23.83 secs
Key(srt[["StandardUMAP2D"]]) <- "UMAP_"
CellDimPlot(srt, group.by = "Standardclusters", label = TRUE)

创建Seurat对象会自动计算各细胞的总UMI(nCount_RNA)和总基因数(nFeature_RNA),部分低质量的细胞已经可以通过这两种指标观察到:

代码语言:javascript复制
FeatureDimPlot(srt,
  features = c("nCount_RNA", "nFeature_RNA"),
  cells.highlight = colnames(srt)[srt$nCount_RNA < 3000]
)

如果手头已经有一些细胞类型的marker列表,我们还可观察下这些已知的marker分布,以确保细胞群/细胞类型是真实存在的:

代码语言:javascript复制
markers <- list(
  "Naive T" = c("LEF1", "CCR7"),
  "CD4  T" = c("CD4", "CCR6"),
  "CD8  T" = c("CD8A", "CD8B"),
  "NK" = c("GNLY", "NKG7"),
  "B" = c("IL4R", "MS4A1"),
  "Plasma" = c("CD27", "IGHG1"),
  "CD14  Mono" = c("CD14", "LYZ"),
  "CD16  Mono" = c("FCGR3A", "MS4A7"),
  "DC" = c("FCER1A", "CST3"),
  "Platelet" = c("PPBP", "PF4")
)
# 在新版的SCP中支持list输入
FeatureDimPlot(srt,
  features = markers,
  theme_use = "theme_blank", legend.position = "none"
)

为了演示方便,这里先进行了细胞注释。实际中可以先不做细胞群的定义,只需要观察细胞群的marker分布即可。

代码语言:javascript复制
annotation <- list(
  "Naive T" = c(1, 2),
  "CD4  T" = 3,
  "CD8  T" = 8,
  "NK" = 10,
  "B" = 6,
  "Plasma" = 5,
  "CD14  Mono" = 13,
  "CD16  Mono" = 14,
  "DC" = 11,
  "Platelet" = 15,
  "Unknown" = c(4, 7, 9, 12)
)
srt <- RenameClusters(srt, group.by = "Standardclusters", nameslist = annotation, name = "Annotation")
CellDimPlot(srt, group.by = "Annotation")

也可以借助各物种细胞类型参考数据集来进行快速注释,例如SCP内置的scHCL[3]scMCA[4]scZCL[5],来进行初步注释(注意注释结果不一定准确,仅供参考):

代码语言:javascript复制
data("ref_scHCL")
srt <- RunKNNPredict(srt, query_group = "Standardclusters", bulk_ref = ref_scHCL)
CellDimPlot(srt, group.by = "KNNPredict_classification")

细胞质控

整体而言,质控(QC)目标是过滤掉两类细胞:

  1. 非正常细胞:不含细胞仅含环境RNA污染的空滴、含大量环境RNA污染的细胞、二(多)细胞、细胞碎片等;
  2. 各细胞群中的离群细胞:测序低质量细胞、凋亡状态的细胞等;

在SCP中,单细胞数据的相关细胞质控指标计算和过滤主要使用RunCellQC函数来完成。

RunCellQC根据以下内容进行细胞质控:

  • doublets: 双(多)细胞过滤,可用的识别算法有scDblFinder, Scrublet, DoubletDetection, scds等。不同的单细胞建库平台doublets比例不同,一般上机细胞数越多,doublets比例越高,比如10X 每1000个细胞增加1%(doublets比例系数为0.01),也就是10000个细胞的doublets比例约为10%。默认使用scDblFinder算法,doublets比例系数为0.01。要获得不同算法的结果,可以直接使用RunDoubletCalling[6]函数进行doublets识别。
  • outlier: 离群点过滤,这里主要使用绝对中位差(MAD,Median Absolute Deviation)来计算一些指标的离散程度并判断离群细胞,包括了log10_nCount, log10_nFeature, featurecount_dist等,若细胞在这些指标上高于或低于N倍的MAD,将视为该在指标上离群。其中,featurecount_dist是基于log10_nFeature和log10_nCount拟合的局部加权回归模型(LOESS模型)得到的log10_nFeature预测值与观察值的差值。另外还可以设定离群点判定的指标数量,默认为1,意味着任意一个指标离群则该细胞被判定为离群。
  • umi, gene: 分别为细胞总UMI(nCount)和总基因数(nFeature)的硬阈值,根据srt[[assay]]@counts进行计算,默认分别为3000和1000。
  • mito, ribo, ribo_mito_ratio: 线粒体(mito)基因的含量可作为提示细胞凋亡状态的指标,默认阈值为20(百分比)。核糖体(ribo)基因是细胞内表达最丰富的基因,其含量取决于细胞类型,有些细胞类型甚至超过50%。在细胞受挤压、破碎、凋亡时,胞质内的核糖体基因首先流出,而线粒体保留,所以过低的核糖体基因含量(常见凋亡细胞)或者过高的核糖体基因含量(常见受污染细胞、空滴)也可以作为筛选标准,默认ribo阈值为50(百分比)。ribo和mito比例过高均会压缩其余基因的相对表达量,导致可比性下降,必要时可考虑剔除这部分基因。另外ribo和mito的比值也可用作质控,默认大于1。
  • species: 当使用物种混合样本建库测序时(常用于衡量二细胞比例),可以根据细胞内含有的兴趣物种基因比例进行过滤。默认阈值为95,细胞要求包含有兴趣物种基因含量高于95%。

另外,如果输入的Seurat对象包含有多个实验数据,可以设定split.by参数来进行数据拆分后分别质控。其他参数请查阅RunCellQC函数文档[7]

我们使用RunCellQC默认参数进行质控,但不做过滤:

代码语言:javascript复制
srt <- RunCellQC(srt, return_filtered = FALSE)
#> >>> Total cells: 5155 
#> >>> Cells which are filtered out: 1063 
#> ... 256 potential doublets 
#> ... 678 outlier cells 
#> ... 646 low-UMI cells 
#> ... 547 low-gene cells 
#> ... 525 high-mito cells 
#> ... 0 high-ribo cells 
#> ... 728 ribo_mito_ratio outlier cells 
#> ... 0 species-contaminated cells 
#> >>> Remained cells after filtering: 4092
table(srt$CellQC)
#> 
#> Pass Fail 
#> 4092 1063

RunCellQC会计算生成各类QC指标和QC结果:

QC指标

中间结果

QC结果

最终结果

db.xxx_score

db.xxx_class

db_qc

CellQC

nCount、log10_nCount

log10_nCount.lower.2.5

umi_qc

nFeature、log10_nFeature

log10_nCount.higher.5

gene_qc

featurecount_dist

log10_nFeature.lower.2.5

outlier_qc

percent.mito

log10_nFeature.higher.5

mito_qc

percent.ribo

featurecount_dist.lower.2.5

ribo_qc

ribo.mito.ratio

ribo_mito_ratio_qc

percent.genome

species_qc

这里默认参数下共1063个细胞没有通过质控,其中有256个潜在的双(多)细胞、678个相关指标离群的细胞、646个低UMI的细胞、547个低基因数的细胞,525个高线粒体的细胞、0个高核糖体的细胞、728个核糖体/线粒体比例异常的细胞。

首先可以先看下计算的QC指标分布:

代码语言:javascript复制
qc_metrics <- c("db.scDblFinder_score", "percent.mito", "percent.ribo", "ribo.mito.ratio", "log10_nFeature_RNA", "log10_nCount_RNA", "featurecount_dist")
FeatureDimPlot(srt, features = qc_metrics, theme_use = "theme_blank")

使用CellStatPlot函数可以对QC结果进行统计和可视化:

代码语言:javascript复制
CellStatPlot(
  srt = srt,
  stat.by = c(
    "db_qc", "outlier_qc", "umi_qc", "gene_qc",
    "mito_qc", "ribo_qc", "ribo_mito_ratio_qc"
  ),
  plot_type = "upset", stat_level = "Fail"
)

QC结果为Fail的细胞基本上分成两个部分,一部分未能通过umi_qc,gene_qc, mito_qc, ribo_qc, ribo_mito_ratio_qc任意一项QC,另一部分仅仅未能通过db_qc

我们可以使用CellDimPlot函数来在UMAP上显示这些未通过QC的细胞分布:

代码语言:javascript复制
CellDimPlot(srt,
  group.by = c(
    "CellQC", "db_qc", "outlier_qc", "umi_qc", "gene_qc",
    "mito_qc", "ribo_qc", "ribo_mito_ratio_qc"
  ),
  theme_use = "theme_blank"
)

小心db_qc的结果,如果单细胞数据集是一个连续分化/发育状态下的细胞类型(例如ESC/iPSC分化中的细胞谱系),一些doublets鉴定算法容易将那些分化过程中的细胞(两个细胞群之间的过度状态)判定为doublets,此时需要手动调整质控标准(例如db_qc联合另外一种指标共同为Fail最终才被过滤)。

如果数据集已经有了初步注释,也可以统计下各细胞群中的QC指标和QC结果:

代码语言:javascript复制
FeatureStatPlot(srt,
  stat.by = c("nCount_RNA", "nFeature_RNA"),
  group.by = "Annotation",
  add_box = TRUE
)
代码语言:javascript复制
CellStatPlot(srt, stat.by = "CellQC", group.by = "Annotation", label = TRUE)

这里注意,所有Platelet细胞由于UMI/基因数低导致QC未通过。如果想要保留,可以手动划分它们为Pass

代码语言:javascript复制
srt$CellQC[colnames(srt)[srt$Annotation == "Platelet"]] <- "Pass"

确认好所有要被过滤的细胞后,进行最终过滤:

代码语言:javascript复制
srt_filtered <- subset(srt, CellQC == "Pass")
srt_filtered <- Standard_SCP(srt_filtered)
#> [2023-10-26 15:58:33] Start Standard_SCP
#> [2023-10-26 15:58:33] Checking srtList... ...
#> Data 1/1 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/1 of the srtList...
#> Use the separate HVF from srtList...
#> [2023-10-26 15:58:39] Finished checking.
#> [2023-10-26 15:58:39] Perform ScaleData on the data...
#> [2023-10-26 15:58:39] Perform linear dimension reduction (pca) on the data...
#> [2023-10-26 15:58:42] Perform FindClusters (louvain) on the data...
#> [2023-10-26 15:58:43] Reorder clusters...
#> [2023-10-26 15:58:43] Perform nonlinear dimension reduction (umap) on the data...
#> [2023-10-26 15:58:57] Standard_SCP done
#> Elapsed time: 23.34 secs
CellDimPlot(srt_filtered, "Annotation")

文中链接

[1]

5k: https://www.10xgenomics.com/resources/datasets/5-k-peripheral-blood-mononuclear-cells-pbm-cs-from-a-healthy-donor-next-gem-3-1-standard-3-0-2

[2]

Feature: https://cf.10xgenomics.com/samples/cell-exp/3.0.2/5k_pbmc_v3_nextgem/5k_pbmc_v3_nextgem_filtered_feature_bc_matrix.tar.gz

[3]

scHCL: https://www.nature.com/articles/s41586-020-2157-4

[4]

scMCA: https://www.cell.com/cell/fulltext/S0092-8674(18)30116-8

[5]

scZCL: https://www.frontiersin.org/articles/10.3389/fcell.2021.743421/full

[6]

RunDoubletCalling: https://zhanghao-njmu.github.io/SCP/reference/RunDoubletCalling.html

[7]

RunCellQC函数文档: https://zhanghao-njmu.github.io/SCP/reference/RunCellQC.html

0 人点赞