bioinfo09-10x数据格式及其导入为seurat对象

2022-05-19 11:27:30 浏览数 (1)

  • 参考:
    • Single-cell RNA-seq: Quality Control Analysis | Introduction to single-cell RNA-seq (hbctraining.github.io)[1]

1-数据下载

单细胞数据可以通过多种方式下载:

  • GEO
    • This dataset is available on GEO (GSE96583[2]), however the available counts matrix lacked mitochondrial reads, so we downloaded the BAM files from the SRA (SRP102802[3]). These BAM files were converted back to FASTQ files, then run through Cell Ranger[4] to obtain the count data that we will be using.
  • 10X
    • The count data for this dataset is also freely available from 10X Genomics and is used in the Seurat tutorial[5]

这里直接下载官方提供的this link[6].

2-文件介绍

使用的数据来自:Kang et al, 2017[7].

除了包括表达矩阵之外,还有元数据信息:

  • 10X Genomics version 2 chemistry 建库;
  • Illumina NextSeq 500 测序;
  • 八个病人的PBMC 样本分成两组;
    • One aliquot of PBMCs was activated by 100 U/mL of recombinant IFN-β for 6 hours.
    • The second aliquot was left untreated.
    • After 6 hours, the eight samples for each condition were pooled together in two final pools (stimulated cells and control cells). We will be working with these two, pooled samples. (We did not demultiplex the samples because SNP genotype information was used to demultiplex in the paper and the barcodes/sample IDs were not readily available for this data. Generally, you would demultiplex and perform QC on each individual sample rather than pooling the samples.)
  • 控制与实验组细胞数分别为:12,138 and 12,167 cells were identified
  • Since the samples are PBMCs, we will expect immune cells, such as:
    • B cells
    • T cells
    • NK cells
    • monocytes
    • macrophages
    • possibly megakaryocytes

首先安装以下包:

代码语言:javascript复制
library(Seurat)
library(Matrix)

3-读取数据

首先是cellranger 给出的10x 的输出结果:

代码语言:javascript复制
> dir("./data/ctrl_raw_feature_bc_matrix/")
[1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"

介绍表达矩阵

  • barcodes.tsv
    • contains all cellular barcodes present for that sample. Barcodes are listed in the order of data presented in the matrix file (i.e. these are the column names).
  • features.tsv
    • 基因名与基因ID

  • matrix.mtx
    • 表达矩阵
    • mtx 是稀疏矩阵格式存储的矩阵数据
    • 详细参见:03b-关于10x输出表达矩阵mtx格式[8]

读取数据

两种常用的函数:

  • readMM(): This function is from the Matrix package and will convert our standard matrix into a sparse matrix. The features.tsv file and barcodes.tsv must first be individually loaded into R and then they can be combined. 详见these additional material[9]
  • Read10X(): This function is from the Seurat package and will use the Cell Ranger output directory as input, directly. With this method individual files do not need to be loaded in, instead the function will load and combine them into a sparse matrix. 这个项目中,我们会使用这个函数。

其中我们最关心的是raw_feature_bc_matrix,通常来说我们并不会使用cell ranger 的过滤方法,而是通过个性化的方案,进行过滤。

因此使用seruat 包的方法Read10X对其读取,我们直接指定cellranger输出的文件夹即可:

代码语言:javascript复制
# How to read in 10X data for a single sample (output is a sparse matrix)
ctrl_counts <- Read10X(data.dir = "data/ctrl_raw_feature_bc_matrix")

# Turn count matrix into a Seurat object (output is a Seurat object)
ctrl <- CreateSeuratObject(counts = ctrl_counts,
                           min.features = 100)

这里我们使用CreateSeuratObject 将矩阵转换为seurat 对象时,就已经对表达矩阵执行一次过滤了:

  • min.features 表示细胞最少被检测到的基因阈值,在该例中,检测基因不足100 的细胞就会被过滤掉了。

不难发现,seruat 对象包含的信息还是蛮多的:

这里我们首先探究一下meta.data:

代码语言:javascript复制
> head(ctrl@meta.data)
                    orig.ident nCount_RNA nFeature_RNA
AAACATACAATGCC-1 SeuratProject       2344          874
AAACATACATTTCC-1 SeuratProject       3125          896
AAACATACCAGAAA-1 SeuratProject       2578          725
AAACATACCAGCTA-1 SeuratProject       3261          979
AAACATACCATGCA-1 SeuratProject        746          362
AAACATACCTCGCT-1 SeuratProject       3519          866

三列含义为:

  • orig.ident: 样本标记,默认名SeuratProject
  • nCount_RNA: 每个细胞的UMI 检测数(每条mRNA的特定标记)
  • nFeature_RNA: 每个细胞的基因数

nFeature_RNA 数值必然是小于 nCount_RNA的,因为存在某个基因有多次被检测的情况。(生物学意义上考虑是,)

作者这里提到了一个批量读取的方法,其实也非常简单,利用assign 函数来批量命名对象:

代码语言:javascript复制
# Create a Seurat object for each sample
for (file in c("ctrl_raw_feature_bc_matrix", "stim_raw_feature_bc_matrix")){
        seurat_data <- Read10X(data.dir = paste0("data/", file))
        seurat_obj <- CreateSeuratObject(counts = seurat_data, 
                                         min.features = 100, 
                                         project = file)
        assign(file, seurat_obj)
}

这里我们就不重复读取了,直接把实验组数据stim_raw_feature_bc_matrix读取进来:

代码语言:javascript复制
stim_counts <- Read10X(data.dir = "data/ctrl_raw_feature_bc_matrix")

# Turn count matrix into a Seurat object (output is a Seurat object)
stim <- CreateSeuratObject(counts = ctrl_counts,
                           min.features = 100)

此外,我们还可以看一下这个数据框的列数:

代码语言:javascript复制
> dim(ctrl@meta.data)
[1] 15688     3
> dim(stim@meta.data)
[1] 15756     3

也就对应了两个样本中检测到的细胞数目(经过初步min.features过滤)。

其他10x分析流程输出读取方式

虽然通常cellranger 是10x 数据上游分析标配,但如果是使用其他分析流程,也有对应的R包:alevin output can be read into R using the tximeta[10] package, while kallisto-bustools output can be read using the BUSpaRse[11] package.

4-合并多个seurat对象

作者这里直接用merge 合并的,好家伙,还是个S3 对象:

代码语言:javascript复制
# Create a merged Seurat object
merged_seurat <- merge(x = ctrl_raw_feature_bc_matrix, 
                       y = stim_raw_feature_bc_matrix, 
                       add.cell.id = c("ctrl", "stim"))

ps:merge 的话,是不是就不好批量了呢?

再来看看合并后的meta.data:

代码语言:javascript复制
> head(merged_seurat@meta.data)
                         orig.ident nCount_RNA nFeature_RNA
ctrl_AAACATACAATGCC-1 SeuratProject       2344          874
ctrl_AAACATACATTTCC-1 SeuratProject       3125          896
ctrl_AAACATACCAGAAA-1 SeuratProject       2578          725
ctrl_AAACATACCAGCTA-1 SeuratProject       3261          979
ctrl_AAACATACCATGCA-1 SeuratProject        746          362
ctrl_AAACATACCTCGCT-1 SeuratProject       3519          866
> tail(merged_seurat@meta.data)
                         orig.ident nCount_RNA nFeature_RNA
stim_TTTGCATGCGACAT-1 SeuratProject        620          295
stim_TTTGCATGCTAAGC-1 SeuratProject       1641          545
stim_TTTGCATGGGACGA-1 SeuratProject       1233          518
stim_TTTGCATGGTGAGG-1 SeuratProject       1084          469
stim_TTTGCATGGTTTGG-1 SeuratProject        818          432
stim_TTTGCATGTCTTAC-1 SeuratProject       1104          438

果然数据框的cell barcode 加上了前缀。

此外,我们在读取数据的时候,就可以直接一次性将全部样本的矩阵读入:

代码语言:javascript复制
> dirA <- "data/ctrl_raw_feature_bc_matrix"
> dirB <- "data/stim_raw_feature_bc_matrix"
> tmp <- Read10X(c(dirA,dirB))

不过需要注意的是,默认下,read10X 仅仅会将barcode 中多个样本存在的cell barcode 加上前缀用以标记,因此并不建议这样合并:

代码语言:javascript复制
> head(merged_tmp@meta.data)
 orig.ident nCount_RNA nFeature_RNA
AAACATACAATGCC-1 SeuratProject       2344          874
AAACATACATTTCC-1 SeuratProject       3125          896
AAACATACCAGAAA-1 SeuratProject       2578          725
AAACATACCAGCTA-1 SeuratProject       3261          979
AAACATACCATGCA-1 SeuratProject        746          362
AAACATACCTCGCT-1 SeuratProject       3519          866 > tail(merged_tmp@meta.data)
 orig.ident nCount_RNA nFeature_RNA
2_TTTGCATGCGACAT-1 SeuratProject        620          295
2_TTTGCATGCTAAGC-1 SeuratProject       1641          545
2_TTTGCATGGGACGA-1 SeuratProject       1233          518
2_TTTGCATGGTGAGG-1 SeuratProject       1084          469
2_TTTGCATGGTTTGG-1 SeuratProject        818          432
2_TTTGCATGTCTTAC-1 SeuratProject       1104          438

5-关于seurat的数据结构

参考:【单细胞测序数据分析-1】认识Seurat对象数据结构/数据格式及操作 - 简书[12]作者:森尼啊

我个人还是比较喜欢seurat 这种结构的,各种槽,用起来也很直观。

参考资料

[1]

Single-cell RNA-seq: Quality Control Analysis | Introduction to single-cell RNA-seq (hbctraining.github.io): https://hbctraining.github.io/scRNA-seq_online/lessons/03_SC_quality_control-setup.html

[2]

GSE96583: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE96583

[3]

SRP102802: https://www-ncbi-nlm-nih-gov.ezp-prod1.hul.harvard.edu/sra?term=SRP102802

[4]

Cell Ranger: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger

[5]

Seurat tutorial: https://satijalab.org/seurat/v3.0/immune_alignment.html

[6]

this link: https://www.dropbox.com/s/we1gmyb9c8jej2u/single_cell_rnaseq.zip?dl=1

[7]

Kang et al, 2017: https://www.nature.com/articles/nbt.4042

[8]

03b-关于10x输出表达矩阵mtx格式: 03b-关于10x输出表达矩阵mtx格式.md

[9]

these additional material: https://hbctraining.github.io/scRNA-seq_online/lessons/readMM_loadData.html

[10]

tximeta: https://bioconductor.org/packages/3.14/tximeta

[11]

BUSpaRse: https://bioconductor.org/packages/3.14/BUSpaRse

[12]

【单细胞测序数据分析-1】认识Seurat对象数据结构/数据格式及操作 - 简书: https://www.jianshu.com/p/0c4bc6a932b2

0 人点赞