- 参考:
- Splat simulation parameters (bioconductor.org)[1]
前言
在[[02-用splatter模拟单细胞数据]]中,我们提过SplatParams 参数对象中的参数:
- 均值:Mean parameters are estimated by fitting a gamma distribution to the mean expression levels.
- Library size:Library size parameters are estimated by fitting a log-normal distribution to the library sizes.(个人理解为counts 计数的文库大小)
- 异常表达值:Expression outlier parameters are estimated by determining the number of outliers and fitting a log-normal distribution to their difference from the median.
- BCV parameters are estimated using the estimateDisp function from the edgeR package. (混杂变量)
- Dropout parameters are estimated by checking if dropout is present and fitting a logistic function to the relationship between mean expression and proportion of zeros.
这一节来详细介绍一下。
简要说明
Splat 模型过程如下:
不同的参数数值对应不同的模型估计。
- The first stage generates a mean expression level for each gene. These are originally chosen from a Gamma distribution. For some genes that are selected to be outliers with high expression a factor is generated from a log-normal distribution. These factors are then multiplied by the median gene mean to create new means for those genes.
- The next stage incorporates variation in the counts per cell. An expected library size (total counts) is chosen for each cell from a log-normal distribution. The library sizes are then used to scale the gene means for each cell, resulting in a range a counts per cell in the simulated dataset. The gene means are then further adjusted to enforce a relationship between the mean expression level and the variability.
- The final cell by gene matrix of gene means is then used to generate a count matrix using a Poisson distribution. The result is a synthetic dataset consisting of counts from a Gamma-Poisson (or negative-binomial) distribution. An additional optional step can be used to replicate a “dropout” effect. A probability of dropout is generated using a logistic function based on the underlying mean expression level. A Bernoulli distribution is then used to create a dropout matrix which sets some of the generated counts to zero.
另外在官方文档中:
代码语言:javascript复制The simulation involves the following steps:
Set up simulation object
Simulate library sizes
Simulate gene means
Simulate groups/paths
Simulate BCV adjusted cell means
Simulate true counts
Simulate dropout
Create final dataset
不过按照我的个人理解,就是首先生成各个基因的分布,接下来选择一些基因作为异常值。接着按照分组,批次等,对某些细胞的特定基因增加或减少一定的数值。
使用就行,暂时把它们当成黑箱好了。
参数设置
比如splat默认值创建的信息:
代码语言:javascript复制params <- newSplatParams()
params
#> A Params object of class SplatParams
#> Parameters can be (estimable) or [not estimable], 'Default' or 'NOT DEFAULT'
#> Secondary parameters are usually set during simulation
#>
#> Global:
#> (Genes) (Cells) [SEED]
#> 10000 100 704180
#>
#> 29 additional parameters
#>
#> Batches:
#> [Batches] [Batch Cells] [Location] [Scale] [Remove]
#> 1 100 0.1 0.1 FALSE
#>
#> Mean:
#> (Rate) (Shape)
#> 0.3 0.6
#>
#> Library size:
#> (Location) (Scale) (Norm)
#> 11 0.2 FALSE
#>
#> Exprs outliers:
#> (Probability) (Location) (Scale)
#> 0.05 4 0.5
#>
#> Groups:
#> [Groups] [Group Probs]
#> 1 1
#>
#> Diff expr:
#> [Probability] [Down Prob] [Location] [Scale]
#> 0.1 0.5 0.1 0.4
#>
#> BCV:
#> (Common Disp) (DoF)
#> 0.1 60
#>
#> Dropout:
#> [Type] (Midpoint) (Shape)
#> none 0 -1
#>
#> Paths:
#> [From] [Steps] [Skew] [Non-linear] [Sigma Factor]
#> 0 100 0.5 0.1 0.8
参数介绍
基本参数
nGenes
数据集包含的基因数;nCells
细胞数;seed
随机种子;
批次参数
batchCells
,对应各个batch 中的细胞数目;batch.rmEffect
布尔值,移除batch 效应;batch.facLoc
与batch.facScale
用来调控batch 大小,越大batch 效应越强,文档解释为:- fac.loc: Location parameter for the log-normal distribution.
- fac.scale: Scale factor for the log-normal distribution.
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params, batchCells = c(100, 100),
batch.facLoc = 0.05, batch.facScale = 0.05,
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulate(params, batchCells = c(100, 100),
batch.facLoc = 0.5, batch.facScale = 0.5,
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p1 <- plotPCA(sim1, colour_by = "Batch") ggtitle("Small batch effects")
p2 <- plotPCA(sim2, colour_by = "Batch") ggtitle("Big batch effects")
cowplot::plot_grid(p1,p2)
细胞与基因相关参数
mean.shape
与mean.rate
,作者这里说明这个参数最好使用真实数据中估计的结果,而非自定设定;- 类似batch 中,library 也有对应的:
lib.loc
will lead to more counts per cell and increasinglib.scale
will result in more variability in the counts per cell.
lib.norm
布尔值,T 将用正态分布替代log 正态分布;out.prob
异常表达值的基因,控制比例;
# Lots of outliers
sim2 <- splatSimulate(out.prob = 0.2, verbose = FALSE)
ggplot(as.data.frame(rowData(sim2)),
aes(x = log10(GeneMean), fill = OutlierFactor != 1))
geom_histogram(bins = 100)
ggtitle("Lots of outliers")
分组相关参数
group.prob
每个组别细胞的比例;de.prob
控制差异基因的比例:需要注意的是,如果需要控制de.prob
及不同组别的基因差异比例,需要使用函数splatSimulateGroups
:
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulateGroups(params, batchCells = 200,
group.prob = c(0.5, 0.5),
de.prob = 0.01,
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulateGroups(params, batchCells = 200,
group.prob = c(0.5, 0.5),
de.prob = 0.4,
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p1 <- plotPCA(sim1, colour_by = "Group") ggtitle("Small DE")
p2 <- plotPCA(sim2, colour_by = "Group") ggtitle("Big DE")
cowplot::plot_grid(p1,p2)
在差异基因比例较小时,不同的group 并不能很好地分开。
de.downProb
差异基因中,下调的基因比例;de.facLoc
差异基因的强度,推测de.facScale
也对应更大的差异基因间变化的程度(文档中并没有解释);
我们还可以设定多个group,以及对应每个group 独立的参数,比如多个差异基因比例,多个差异基因变化强度:
代码语言:javascript复制# Combination of everything
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim3 <- splatSimulateGroups(params,
group.prob = c(0.05, 0.2, 0.2, 0.2, 0.35),
de.prob = c(0.3, 0.1, 0.2, 0.01, 0.1),
de.downProb = c(0.1, 0.4, 0.9, 0.6, 0.5),
de.facLoc = c(0.6, 0.1, 0.1, 0.01, 0.2),
de.facScale = c(0.1, 0.4, 0.2, 0.5, 0.4),
verbose = FALSE)
sim3 <- logNormCounts(sim3)
sim3 <- runPCA(sim3)
plotPCA(sim3, colour_by = "Group")
labs(title = "Different DE factors",
caption = paste(
"Group 1 is small with many very up-regulated DE genes,",
"Group 2 has the default DE parameters,n",
"Group 3 has many down-regulated DE genes,",
"Group 4 has very few DE genes,",
"Group 5 is large with moderate DE factors")
)
混杂变量与dropout参数
bcv.common
数值越大,混杂变量影响就越大;bcv.df
BCV inverse chi-squared distribution 使用的自由度大小,会影响基因变化的程度受到表达均值的影响;
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulateGroups(params, batchCells = 200,
group.prob = c(0.5, 0.5),
de.prob = 0.4,
verbose = FALSE, bcv.common = 0.1,
bcv.df = 1)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulateGroups(params, batchCells = 200,
group.prob = c(0.5, 0.5),
de.prob = 0.4,
verbose = FALSE, bcv.common = 0.1,
bcv.df = 20)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p1 <- plotPCA(sim1, colour_by = "Group") ggtitle("Small bcv.df")
p2 <- plotPCA(sim2, colour_by = "Group") ggtitle("Big bcv.df")
cowplot::plot_grid(p1,p2)
从试验看,degree of freedom 越大,混杂变量的影响越小:
dropout :
dropouts, where the data only captures a small fraction of the transcriptome of each cell.
ps:个人理解是有些基因feature 可能在某些细胞类型中完全检测不到,或检测水平非常低。有篇文献,详细的讨论了相关的内容:PMC7054558 (Embracing the dropouts in single-cell RNA-seq analysis)
dropout.mid
与dropout.shape
控制某个细胞中zero 计数的比例:- The dropout.mid parameter control the point at which the probability is equal to 0.5
- and the dropout.shape controls how it changes with increasing expression. 这两个参数需要和下面的参数相关,比如group 模式,则mid 与shape 的向量长度就是group 的数目。
dropout.type
设置dropout 情节:- Setting it to "none" means no dropout,
- “experiment” is global dropout using the same set of parameters for every cell,
- "batch" uses the same parameters for every cell in the same batch,
- "group" uses the same parameters for every cell in the same group and
- "cell" uses a different set of parameters for every cell. 这个参数必须首先设置
dropout.mid
与dropout.shape
才可以被设置。
轨迹数据
参考:单细胞系列课程-10 Trajectory inference analysis of scRNA-seq data - 简书[2]
有时候,轨迹分析可以帮助我们发现细胞不同状态变化而产生的微小差异,比如细胞在分化过程中,某些基因可能被激活,而某些基因可能被沉默。这时候,这些细胞可能很难描述它们的差异,它们并不像pbmc 中的细胞那样有成熟的分化,和明显的基因与生物上的差异,因此轨迹分析可以帮助解决这个问题。
ps:对于某些外部微小的扰动(perturbation),不也是一样吗?
我们可能无法直接得到所有的group 都可以像右图那样:
再加上可能这个perturbation 是连续性的。
path
我们可以直接使用splatSimulate
中的method 参数进行简单设置:
sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
verbose = FALSE, batchCells = rep(200,2))
而如果是进一步细节设置,同样对应于分组特别的函数splatSimulateGroups
,轨迹数据生成也需要使用特别函数splatSimulatePaths
,path.from 参数可以创建一个线性(linear)或树状(branch)的模型:
params <- newSplatParams()
sim1 <- splatSimulatePaths(params, batchCells = 200,
group.prob = c(0.25, 0.25, 0.25, 0.25),
de.prob = 0.5, de.facLoc = 0.2,
path.from = c(0, 1, 2, 3),
seed = 1,
verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
p1 <- plotPCA(sim1, colour_by = "Group") ggtitle("Linear paths")
# Branching path
sim2 <- splatSimulatePaths(params, batchCells = 200,
group.prob = c(0.25, 0.25, 0.25, 0.25),
de.prob = 0.5, de.facLoc = 0.2,
path.from = c(0, 1, 1, 3),
seed = 1,
verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p2 <- plotPCA(sim2, colour_by = "Group") ggtitle("Branching path")
cowplot::plot_grid(p1,p2)
比如右图中的 2、3就分叉了,因为参数中设定了相同的数字。
step
如果这种group 是连续变化的,在splat 中定义为step:
Interpolation is then used to create a series of steps between the start and end points. This parameter controls the number of steps along a path and therefore how discrete or smooth it is.
并不是discrete 而是smooth 的变化。
数目越大,数据集连续变化的程度也就越大:
代码语言:javascript复制# Few steps
sim1 <- splatSimulatePaths(params.groups, path.nSteps = 3,
de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Step") ggtitle("Few steps")
代码语言:javascript复制# Lots of steps
sim2 <- splatSimulatePaths(params.groups, path.nSteps = 1000,
de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Step") ggtitle("Lots of steps")
通常来说,不同step 之间的细胞数目是均匀的,我们也可以人为改变这种分布情况。比如我们想要模拟一个情况:存在少量的干细胞以及丰富的完全分化的细胞。
Setting path.skew to 0 will mean that all cells come from the end point while higher values up to 1 will skew them towards the start point.
数值越大靠近起点,越小靠近终点。
path.nonlinearProb
控制基因表达为非线性情况的概率,因为可能某个基因在不同的step 下,并非是线性的。path.sigmaFac
:
Non-linear changes along a path are achieved by building a Brownian bridge between the two end points. A Brownian bridge is Brownian motion controlled in such a way that the end points are fixed. The path.sigmaFac parameter controls how extreme each step in the Brownian motion is and therefore how much the interpolation differs from a linear path. ps:个人觉得path.sigmaFac 是用来设置非线性变化的程度的参数。
参考资料
[1]
Splat simulation parameters (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splat_params.html
[2]
单细胞系列课程-10 Trajectory inference analysis of scRNA-seq data - 简书: https://www.jianshu.com/p/83c532234dc0