Bioinfo03-模拟单细胞数据SplatParams详细参数介绍

2022-05-19 11:25:30 浏览数 (1)

  • 参考:
    • Splat simulation parameters (bioconductor.org)[1]

前言

在[[02-用splatter模拟单细胞数据]]中,我们提过SplatParams 参数对象中的参数:

  • 均值:Mean parameters are estimated by fitting a gamma distribution to the mean expression levels.
  • Library size:Library size parameters are estimated by fitting a log-normal distribution to the library sizes.(个人理解为counts 计数的文库大小)
  • 异常表达值:Expression outlier parameters are estimated by determining the number of outliers and fitting a log-normal distribution to their difference from the median.
  • BCV parameters are estimated using the estimateDisp function from the edgeR package. (混杂变量)
  • Dropout parameters are estimated by checking if dropout is present and fitting a logistic function to the relationship between mean expression and proportion of zeros.

这一节来详细介绍一下。

简要说明

Splat 模型过程如下:

不同的参数数值对应不同的模型估计。

  • The first stage generates a mean expression level for each gene. These are originally chosen from a Gamma distribution. For some genes that are selected to be outliers with high expression a factor is generated from a log-normal distribution. These factors are then multiplied by the median gene mean to create new means for those genes.
  • The next stage incorporates variation in the counts per cell. An expected library size (total counts) is chosen for each cell from a log-normal distribution. The library sizes are then used to scale the gene means for each cell, resulting in a range a counts per cell in the simulated dataset. The gene means are then further adjusted to enforce a relationship between the mean expression level and the variability.
  • The final cell by gene matrix of gene means is then used to generate a count matrix using a Poisson distribution. The result is a synthetic dataset consisting of counts from a Gamma-Poisson (or negative-binomial) distribution. An additional optional step can be used to replicate a “dropout” effect. A probability of dropout is generated using a logistic function based on the underlying mean expression level. A Bernoulli distribution is then used to create a dropout matrix which sets some of the generated counts to zero.

另外在官方文档中:

代码语言:javascript复制
The simulation involves the following steps:
Set up simulation object
Simulate library sizes
Simulate gene means
Simulate groups/paths
Simulate BCV adjusted cell means
Simulate true counts
Simulate dropout
Create final dataset

不过按照我的个人理解,就是首先生成各个基因的分布,接下来选择一些基因作为异常值。接着按照分组,批次等,对某些细胞的特定基因增加或减少一定的数值。

使用就行,暂时把它们当成黑箱好了。

参数设置

比如splat默认值创建的信息:

代码语言:javascript复制
params <- newSplatParams()
params
#> A Params object of class SplatParams 
#> Parameters can be (estimable) or [not estimable], 'Default' or  'NOT DEFAULT' 
#> Secondary parameters are usually set during simulation
#> 
#> Global: 
#> (Genes)  (Cells)   [SEED] 
#>   10000      100   704180 
#> 
#> 29 additional parameters 
#> 
#> Batches: 
#>     [Batches]  [Batch Cells]     [Location]        [Scale]       [Remove] 
#>             1            100            0.1            0.1          FALSE 
#> 
#> Mean: 
#>  (Rate)  (Shape) 
#>     0.3      0.6 
#> 
#> Library size: 
#> (Location)     (Scale)      (Norm) 
#>         11         0.2       FALSE 
#> 
#> Exprs outliers: 
#> (Probability)     (Location)        (Scale) 
#>          0.05              4            0.5 
#> 
#> Groups: 
#>      [Groups]  [Group Probs] 
#>             1              1 
#> 
#> Diff expr: 
#> [Probability]    [Down Prob]     [Location]        [Scale] 
#>           0.1            0.5            0.1            0.4 
#> 
#> BCV: 
#> (Common Disp)          (DoF) 
#>           0.1             60 
#> 
#> Dropout: 
#>     [Type]  (Midpoint)     (Shape) 
#>       none           0          -1 
#> 
#> Paths: 
#>         [From]         [Steps]          [Skew]    [Non-linear]  [Sigma Factor] 
#>              0             100             0.5             0.1             0.8

参数介绍

基本参数

  • nGenes 数据集包含的基因数;
  • nCells 细胞数;
  • seed 随机种子;

批次参数

  • batchCells,对应各个batch 中的细胞数目;
  • batch.rmEffect 布尔值,移除batch 效应;
  • batch.facLocbatch.facScale 用来调控batch 大小,越大batch 效应越强,文档解释为:
    • fac.loc: Location parameter for the log-normal distribution.
    • fac.scale: Scale factor for the log-normal distribution.
代码语言:javascript复制
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulate(params, batchCells = c(100, 100),
                      batch.facLoc = 0.05, batch.facScale = 0.05,
                      verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulate(params, batchCells = c(100, 100),
                      batch.facLoc = 0.5, batch.facScale = 0.5,
                      verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)

p1 <- plotPCA(sim1, colour_by = "Batch")   ggtitle("Small batch effects")
p2 <- plotPCA(sim2, colour_by = "Batch")   ggtitle("Big batch effects")
cowplot::plot_grid(p1,p2)

细胞与基因相关参数

  • mean.shapemean.rate,作者这里说明这个参数最好使用真实数据中估计的结果,而非自定设定;
  • 类似batch 中,library 也有对应的:
    • lib.loc will lead to more counts per cell and increasing
    • lib.scale will result in more variability in the counts per cell.
  • lib.norm 布尔值,T 将用正态分布替代log 正态分布;
  • out.prob 异常表达值的基因,控制比例;
代码语言:javascript复制
# Lots of outliers
sim2 <- splatSimulate(out.prob = 0.2, verbose = FALSE)
ggplot(as.data.frame(rowData(sim2)),
       aes(x = log10(GeneMean), fill = OutlierFactor != 1))  
    geom_histogram(bins = 100)  
    ggtitle("Lots of outliers")

分组相关参数

  • group.prob 每个组别细胞的比例;
  • de.prob 控制差异基因的比例:需要注意的是,如果需要控制de.prob 及不同组别的基因差异比例,需要使用函数splatSimulateGroups
代码语言:javascript复制
set.seed(1)
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulateGroups(params, batchCells = 200,
                      group.prob = c(0.5, 0.5),
                      de.prob = 0.01,
                      verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulateGroups(params, batchCells = 200,
                      group.prob = c(0.5, 0.5),
                      de.prob = 0.4,
                      verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)

p1 <- plotPCA(sim1, colour_by = "Group")   ggtitle("Small DE")
p2 <- plotPCA(sim2, colour_by = "Group")   ggtitle("Big DE")
cowplot::plot_grid(p1,p2)

在差异基因比例较小时,不同的group 并不能很好地分开。

  • de.downProb 差异基因中,下调的基因比例;
  • de.facLoc 差异基因的强度,推测de.facScale 也对应更大的差异基因间变化的程度(文档中并没有解释);

我们还可以设定多个group,以及对应每个group 独立的参数,比如多个差异基因比例,多个差异基因变化强度:

代码语言:javascript复制
# Combination of everything
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim3 <- splatSimulateGroups(params,
                            group.prob = c(0.05, 0.2, 0.2, 0.2, 0.35),
                            de.prob = c(0.3, 0.1, 0.2, 0.01, 0.1),
                            de.downProb = c(0.1, 0.4, 0.9, 0.6, 0.5),
                            de.facLoc = c(0.6, 0.1, 0.1, 0.01, 0.2),
                            de.facScale = c(0.1, 0.4, 0.2, 0.5, 0.4),
                            verbose = FALSE)
sim3 <- logNormCounts(sim3)
sim3 <- runPCA(sim3)
plotPCA(sim3, colour_by = "Group")  
  labs(title = "Different DE factors",
       caption = paste(
         "Group 1 is small with many very up-regulated DE genes,",
         "Group 2 has the default DE parameters,n",
         "Group 3 has many down-regulated DE genes,",
         "Group 4 has very few DE genes,",
         "Group 5 is large with moderate DE factors")
  )

混杂变量与dropout参数

  • bcv.common 数值越大,混杂变量影响就越大;
  • bcv.df BCV inverse chi-squared distribution 使用的自由度大小,会影响基因变化的程度受到表达均值的影响;
代码语言:javascript复制
sce <- mockSCE(ncells = 200, ngenes = 2000, nspikes = 100)
params <- splatEstimate(sce)
sim1 <- splatSimulateGroups(params, batchCells = 200,
                      group.prob = c(0.5, 0.5),
                      de.prob = 0.4,
                      verbose = FALSE, bcv.common = 0.1,
                      bcv.df = 1)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
sim2 <- splatSimulateGroups(params, batchCells = 200,
                      group.prob = c(0.5, 0.5),
                      de.prob = 0.4,
                      verbose = FALSE, bcv.common = 0.1,
                      bcv.df = 20)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)

p1 <- plotPCA(sim1, colour_by = "Group")   ggtitle("Small bcv.df")
p2 <- plotPCA(sim2, colour_by = "Group")   ggtitle("Big bcv.df")
cowplot::plot_grid(p1,p2)

从试验看,degree of freedom 越大,混杂变量的影响越小:

dropout :

dropouts, where the data only captures a small fraction of the transcriptome of each cell.

ps:个人理解是有些基因feature 可能在某些细胞类型中完全检测不到,或检测水平非常低。有篇文献,详细的讨论了相关的内容:PMC7054558 (Embracing the dropouts in single-cell RNA-seq analysis)

  • dropout.middropout.shape控制某个细胞中zero 计数的比例:
    • The dropout.mid parameter control the point at which the probability is equal to 0.5
    • and the dropout.shape controls how it changes with increasing expression. 这两个参数需要和下面的参数相关,比如group 模式,则mid 与shape 的向量长度就是group 的数目。
  • dropout.type 设置dropout 情节:
    • Setting it to "none" means no dropout,
    • “experiment” is global dropout using the same set of parameters for every cell,
    • "batch" uses the same parameters for every cell in the same batch,
    • "group" uses the same parameters for every cell in the same group and
    • "cell" uses a different set of parameters for every cell. 这个参数必须首先设置dropout.middropout.shape 才可以被设置。

轨迹数据

参考:单细胞系列课程-10 Trajectory inference analysis of scRNA-seq data - 简书[2]

有时候,轨迹分析可以帮助我们发现细胞不同状态变化而产生的微小差异,比如细胞在分化过程中,某些基因可能被激活,而某些基因可能被沉默。这时候,这些细胞可能很难描述它们的差异,它们并不像pbmc 中的细胞那样有成熟的分化,和明显的基因与生物上的差异,因此轨迹分析可以帮助解决这个问题。

ps:对于某些外部微小的扰动(perturbation),不也是一样吗?

我们可能无法直接得到所有的group 都可以像右图那样:

再加上可能这个perturbation 是连续性的。

path

我们可以直接使用splatSimulate中的method 参数进行简单设置:

代码语言:javascript复制
sim.paths <- splatSimulate(de.prob = 0.2, nGenes = 1000, method = "paths",
                           verbose = FALSE, batchCells = rep(200,2))

而如果是进一步细节设置,同样对应于分组特别的函数splatSimulateGroups,轨迹数据生成也需要使用特别函数splatSimulatePaths,path.from 参数可以创建一个线性(linear)或树状(branch)的模型:

代码语言:javascript复制
params <- newSplatParams()
sim1 <- splatSimulatePaths(params, batchCells = 200,
                           group.prob = c(0.25, 0.25, 0.25, 0.25),
                           de.prob = 0.5, de.facLoc = 0.2,
                           path.from = c(0, 1, 2, 3),
                           seed = 1,
                           verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
p1 <- plotPCA(sim1, colour_by = "Group")   ggtitle("Linear paths")
# Branching path
sim2 <- splatSimulatePaths(params, batchCells = 200,
                           group.prob = c(0.25, 0.25, 0.25, 0.25),
                           de.prob = 0.5, de.facLoc = 0.2,
                           path.from = c(0, 1, 1, 3),
                           seed = 1,
                           verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
p2 <- plotPCA(sim2, colour_by = "Group")   ggtitle("Branching path")
cowplot::plot_grid(p1,p2)

比如右图中的 2、3就分叉了,因为参数中设定了相同的数字。

step

如果这种group 是连续变化的,在splat 中定义为step:

Interpolation is then used to create a series of steps between the start and end points. This parameter controls the number of steps along a path and therefore how discrete or smooth it is.

并不是discrete 而是smooth 的变化。

数目越大,数据集连续变化的程度也就越大:

代码语言:javascript复制
# Few steps
sim1 <- splatSimulatePaths(params.groups, path.nSteps = 3,
                           de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim1 <- logNormCounts(sim1)
sim1 <- runPCA(sim1)
plotPCA(sim1, colour_by = "Step")   ggtitle("Few steps")
代码语言:javascript复制
# Lots of steps
sim2 <- splatSimulatePaths(params.groups, path.nSteps = 1000,
                           de.prob = 0.5, de.facLoc = 0.2, verbose = FALSE)
sim2 <- logNormCounts(sim2)
sim2 <- runPCA(sim2)
plotPCA(sim2, colour_by = "Step")   ggtitle("Lots of steps")

通常来说,不同step 之间的细胞数目是均匀的,我们也可以人为改变这种分布情况。比如我们想要模拟一个情况:存在少量的干细胞以及丰富的完全分化的细胞。

Setting path.skew to 0 will mean that all cells come from the end point while higher values up to 1 will skew them towards the start point.

数值越大靠近起点,越小靠近终点。

path.nonlinearProb 控制基因表达为非线性情况的概率,因为可能某个基因在不同的step 下,并非是线性的。path.sigmaFac

Non-linear changes along a path are achieved by building a Brownian bridge between the two end points. A Brownian bridge is Brownian motion controlled in such a way that the end points are fixed. The path.sigmaFac parameter controls how extreme each step in the Brownian motion is and therefore how much the interpolation differs from a linear path. ps:个人觉得path.sigmaFac 是用来设置非线性变化的程度的参数。

参考资料

[1]

Splat simulation parameters (bioconductor.org): https://bioconductor.org/packages/devel/bioc/vignettes/splatter/inst/doc/splat_params.html

[2]

单细胞系列课程-10 Trajectory inference analysis of scRNA-seq data - 简书: https://www.jianshu.com/p/83c532234dc0

0 人点赞