分享是一种态度
很开心能在2020年加入“单细胞天地”这个团队,作为单细胞新手,我希望未来能学会更多的单细胞测序知识,写更多的笔记,撸起袖子干呗!
最近看到了2018年Cell的一篇乳腺癌单细胞文章,文章题目是“Unravelling subclonal heterogeneity and aggressive disease states in TNBC through single-cell RNA-seq”,相关数据集是GSE118390,代码存放在Github上(https://github.com/Michorlab/tnbc_scrnaseq)。
刚开始看了一下,我觉得代码很有意思,作者从QC、Normalized等一系列下游分析全程用R分析,并且在补充材料也详细说明了这个流程。虽然这篇文献的代码都已经上传到Guitub上,但是如果自己去跑代码的话,势必出现奇奇怪怪的error, 于是我结合文献的methods,对文献进行解读并在作者的代码基础上,进行翻译和修改,最后重现里面的图片。
接下来我会分为几篇笔记来记录学习的过程,我都亲自跑过并且修改出现error的代码,基本不会出现error,如果有问题的话,可以留言反馈。
一.质控
scRNA-seq数据相比RNA-seq数据,具有更多的噪声,这主要是因为要可靠地转换和扩增单个细胞中的少量RNA(也称为RNA捕获)。因此,在scRNA-seq数据中,经常会出现许多细胞中一个基因中没有表达或是低表达水平的情况,也称dropout events,这不一定反映真实的生物信号。如果不进行质控,不去除低质量的细胞或表达过低的基因,都会使下游分析产生偏差,使分离生物信号或生物噪声变得更加困难。
1.去除低质量的细胞
代码语言:javascript复制# 读入数据,包括进行TPM的数据,原始counts数据,原始qc数据,还有基因注释信息。
tpm.rsem <- read.table(here::here("data", "tpm_original.txt"), sep = "t")
counts.rsem <- read.table(here::here("data", "counts_original.txt"), sep = "t")
qc <- read.table(here::here("data", "qc_original.txt"), sep = "t", stringsAsFactors = FALSE)
mappings <- readRDS(here::here("data", "mappings.RDS"))
length(tpm.rsem)
> length(tpm.rsem)
[1] 1536
#总共1536个细胞
在这一步骤,又细分为3个具体流程来去除低质量的细胞:
- 1. 针对Total reads(Library size)进行过滤
- 2. 针对Number of expressed genes进行过滤
- 3. 针对Total amount of mRNA过滤
#创建SingSingleCellExperiment对象
#min_thresh_log_tpm <- 0.1
#定义基因表达的标准:如果一个基因的log2(TPM 1)》0.1,则认为该基因表达
sceset_all<- SingleCellExperiment(assays = list(counts = as.matrix(counts.rsem),
tpm = as.matrix(tpm.rsem),
exprs = log2(as.matrix(tpm.rsem) 1),
expressed = log2(tpm.rsem 1) > min_thresh_log_tpm),
colData = qc,
rowData = mappings)
#最开始拿到的数据是1536个细胞,21785个基因
> sceset_all
class: SingleCellExperiment
dim: 21785 1536
metadata(0):
assays(4): counts tpm exprs expressed
rownames(21785): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(7): ensembl_gene_id hgnc_symbol ... gc gene_short_name
colnames(1536): PT089_P1_A01 PT089_P1_A02 ... PT039_P10_H11_S287 PT039_P10_H12_S288
colData names(50): Sample uniquely_mapped_percent ... number_batch depletion_batch
reducedDimNames(0):
altExpNames(0):
#看看有没有线粒体基因,可以看见,没有线粒体基因
location <- rowRanges(sceset_all)
is.mito <- any(seqnames(location)=="MT")
> table(is.mito)
is.mito
FALSE
21785
接下来,作者是使用scater的calculateQCMetrics
进行质控,但是这个函数已经更新为perCellQCMetrics
,因此我们修改一下,修改后的结果跟原来的会有略微不同。
sceset_all <- perCellQCMetrics(sceset_all,exprs_values = "exprs",
detection_limit = min_thresh_log_tpm,flatten=FALSE,
subsets= list("regular" = which(qc$pool_H12 == 0), "pool" = which(qc$pool_H12 == 1)))
> sceset_all
class: SingleCellExperiment
dim: 21785 1536
metadata(0):
assays(4): counts tpm exprs expressed
rownames(21785): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(7): ensembl_gene_id hgnc_symbol ... gc gene_short_name
colnames(1536): PT089_P1_A01 PT089_P1_A02 ... PT039_P10_H11_S287 PT039_P10_H12_S288
colData names(50): Sample uniquely_mapped_percent ... number_batch depletion_batch
reducedDimNames(0):
altExpNames(0):
代码语言:javascript复制#我们单独看一看subsets的内容
subsets_A= list("regular" = which(qc$pool_H12 == 0), "pool" = which(qc$pool_H12 == 1))
View(subsets_A)
可以看到总共1536个samples,7个是pooled samples,1529个是single cells。这样子我们就能理解作者这句话的意思了。
但是接下来我们继续按照作者的代码跑下去,就开始出现问题了。
代码语言:javascript复制reads.drop <- isOutlier(as.numeric(sceset_all$total_reads), nmads = 4, type = "lower", log = TRUE)
hist(log10(sceset_all$total_reads), breaks = 100, main = "", col = "grey80", xlab = expression(log[10]~"library size"), ylab = "frequency", cex.lab = 1.4, cex.axis = 1.4)
abline(v = max(log10(sceset_all$total_reads[which(reads.drop == 1)])), col = "blue", lwd = 2, lty = 2)
> Error in log10(sceset_all$total_reads) : 数学函数中用了非数值参数
#我们来看一下sceset_all$total_reads
sceset_all$total_reads
> sceset_all$total_reads
NULL
也就是说perCellQCMetrics
之后的sceset_all不包含total_reads这个信息了。这个问题我想了很久才解决,后来发现原来我们不能使用perCellQCMetrics
,应该选择addPerCellQC
,这样就可以把QC的结果附加到SingleCellExperiment对象的colData中。我们改一下。
sceset_all <- addPerCellQC(sceset_all,exprs_values = "exprs",
detection_limit = min_thresh_log_tpm,flatten=FALSE,
subsets= list("regular" = which(qc$pool_H12 == 0),
"pool"=which(qc$pool_H12 == 1)))
接着我们需要去除低质量的细胞,文中以每个samples中位数的中位数绝对偏差MAD为阈值,如果细胞的基因表达值距离其中位数多于4xMAD,则认为该值是异常值。但是这种用法有个前提,需要确保细胞都是由高质量的单元组成。这个过滤的过程背后的基本原理是选择自适应和无偏的数据派生阈值。具体详见(Lun et al.,2016b)。由于数据是经过log的,所以设置log = TRUE,对数变换提高了对小数值的分辨率。isOutlier则可以根据MAD确定数值向量中哪些值是异常值。
代码语言:javascript复制#质控Toral reads(library size,这时候用的指标是sceset_all$total_reads
reads.drop <- isOutlier(as.numeric(sceset_all$total_reads), nmads = 4, type = "lower", log = TRUE)
hist(log10(sceset_all$total_reads), breaks = 100, main = "", col = "grey80", xlab = expression(log[10]~"library size"), ylab = "frequency", cex.lab = 1.4, cex.axis = 1.4)
abline(v = max(log10(sceset_all$total_reads[which(reads.drop == 1)])), col = "blue", lwd = 2, lty = 2)
代码语言:javascript复制# 质控total mRNA,这时候用的指标是sceset_all$sum
mRNA.drop <- isOutlier(as.numeric(sceset_all$sum), nmads = 4, type = "lower", log = TRUE)
hist(log10(sceset_all$sum), breaks = 100, main = "", col = "grey80", xlab = expression(log[10]~"log10 total mRNA"), ylab = "frequency", cex.lab = 1.4, cex.axis = 1.4)
abline(v = max(log10(sceset_all$sum[which(mRNA.drop == 1)])), col = "blue", lwd = 2, lty = 2)
代码语言:javascript复制#质控number of expressed genes,这时候用的指标是sceset_all$detected
feature.drop <- isOutlier(sceset_all$detected, nmads = 4, type = "lower", log = TRUE)
hist(log10(sceset_all$detected), breaks = 100, main = "", col = "grey80", xlab = expression(log[10]~"number of expressed genes"), ylab = "frequency", cex.lab = 1.4, cex.axis = 1.4)
abline(v = max(log10(sceset_all$detected[which(feature.drop == 1)])), col = "blue", lwd = 2, lty = 2)
代码语言:javascript复制#将经过QC之后的低质量异常的细胞剔除掉
keep.samples <- !(reads.drop | feature.drop | mRNA.drop)
keep.samples[which(is.na(keep.samples))] <- FALSE
sceset_all <- sceset_all[ ,keep.samples]
> sceset_all
class: SingleCellExperiment
dim: 21785 1326
metadata(0):
assays(4): counts tpm exprs expressed
rownames(21785): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(7): ensembl_gene_id hgnc_symbol ... gc gene_short_name
colnames(1326): PT089_P1_A01 PT089_P1_A02 ... PT039_P10_H11_S287
PT039_P10_H12_S288
colData names(50): Sample uniquely_mapped_percent ... number_batch depletion_batch
reducedDimNames(0):
altExpNames(0):
这样子我们就做出来了Supplementary Fig 22的3个图了。可以看到去除低质量的细胞后,最后剩下1326个细胞,跟文章结果一模一样。
2.去除低表达量的基因
首先定义低表达量的基因:
- 1. 如果某个基因在超过95%的总细胞中的表达量较低(log2(TPM 1) < 0.1) ,那么定义为低表达量的基因。
- 2. 由于样本之间的异质性,还需要过滤每个样本单独的低表达量基因,即在每个样品中,如果某个基因在超过95%细胞中的表达量较低(log2(TPM 1) < 0.1),那么定义为低表达量基因。
#接下来开始使用monocole包
#创建对象
pd <- new("AnnotatedDataFrame", data = as.data.frame(colData(sceset_all)))
featureNames(pd) <- rownames(pd)
fd <- new("AnnotatedDataFrame", data = as.data.frame(rowData(sceset_all)))
featureNames(fd) <- rowData(sceset_all)$gene_short_name
# 注意原代码 expressionFamily =gaussianff()已经失效,改为expressionFamily =uninormal()
HSMM <- newCellDataSet(assays(sceset_all)$exprs, featureData = fd, phenoData = pd,
lowerDetectionLimit = min_thresh_log_tpm, expressionFamily = uninormal())
HSMM <- detectGenes(HSMM, min_expr = min_thresh_log_tpm)
expr_genes <- row.names(subset(fData(HSMM), num_cells_expressed >= dim(HSMM)[2]*0.05))
再分别过滤每个样本的低表达量基因
代码语言:javascript复制HSMM126 <- HSMM[, row.names(subset(pData(HSMM), patient == "PT126"))]
HSMM126 <- detectGenes(HSMM126, min_expr = min_thresh_log_tpm)#detectGenes检测每个细胞中高于某阈值的基因
remove_genes_126 <- row.names(subset(fData(HSMM126), num_cells_expressed < dim(HSMM126)[2]*0.05)) #5%的细胞
HSMM126 <- HSMM126[setdiff(row.names(HSMM126), remove_genes_126), ]#setdiff求两个向量中不同的元素,即去除低表达量的细胞
HSMM39 <- HSMM[, row.names(subset(pData(HSMM), patient == "PT039"))]
HSMM39 <- detectGenes(HSMM39, min_expr = min_thresh_log_tpm)
remove_genes_39 <- row.names(subset(fData(HSMM39), num_cells_expressed < dim(HSMM39)[2]*0.05))
HSMM39 <- HSMM39[setdiff(row.names(HSMM39), remove_genes_39), ]
HSMM58 <- HSMM[, row.names(subset(pData(HSMM), patient == "PT058"))]
HSMM58 <- detectGenes(HSMM58, min_expr = min_thresh_log_tpm)
remove_genes_58 <- row.names(subset(fData(HSMM58), num_cells_expressed < dim(HSMM58)[2]*0.05))
HSMM58 <- HSMM58[setdiff(row.names(HSMM58), remove_genes_58), ]
HSMM81 <- HSMM[, row.names(subset(pData(HSMM), patient == "PT081"))]
HSMM81 <- detectGenes(HSMM81, min_expr = min_thresh_log_tpm)
remove_genes_81 <- row.names(subset(fData(HSMM81), num_cells_expressed < dim(HSMM81)[2]*0.05))
HSMM81 <- HSMM81[setdiff(row.names(HSMM81), remove_genes_81), ]
HSMM84 <- HSMM[, row.names(subset(pData(HSMM), patient == "PT084"))]
HSMM84 <- detectGenes(HSMM84, min_expr = min_thresh_log_tpm)
remove_genes_84 <- row.names(subset(fData(HSMM84), num_cells_expressed < dim(HSMM84)[2]*0.05))
HSMM84 <- HSMM84[setdiff(row.names(HSMM84), remove_genes_84), ]
HSMM89 <- HSMM[, row.names(subset(pData(HSMM), patient == "PT089"))]
HSMM89 <- detectGenes(HSMM89, min_expr = min_thresh_log_tpm)
remove_genes_89 <- row.names(subset(fData(HSMM89), num_cells_expressed < dim(HSMM89)[2]*0.05))
HSMM89 <- HSMM89[setdiff(row.names(HSMM89), remove_genes_89), ]
remove_genes_all <- intersect_all(remove_genes_126, remove_genes_39, remove_genes_58, remove_genes_81, remove_genes_84, remove_genes_89)
keep.genes <- setdiff(rownames(fData(HSMM)), remove_genes_all)
剩下13280个基因保留下来,与原文一样。
代码语言:javascript复制> length(keep.genes)
[1] 13280
HSMM <- HSMM[keep.genes, ]
我们下一期再看看比较难的标准化这部分。如果你基本的R代码能看懂,我相信也能大概看懂这些代码的,如果还觉得有困难,那么你可能需要这个数据挖掘(GEO,TCGA,单细胞)2021第4期
往期回顾
本是同根生,缘何不同命?
多个组织的成纤维细胞图谱
OSCA单细胞数据分析笔记8—Dimensionality reduction
单细胞数据Seurat包的tSNE三维可视化
如果你对单细胞转录组研究感兴趣,但又不知道如何入门,也许你可以关注一下下面的课程
- 生信爆款入门-2021第4期
- 数据挖掘(GEO,TCGA,单细胞)2021第4期
- 明码标价之共享96线程384G内存服务器