Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
当我们拿到单细胞转录组数据的时候,不管结果怎么样,翻来覆去发现其实就是一个基因和细胞的矩阵而已!一般基于这个矩阵,会先过滤细胞均一化,降维,聚类(最重要的一步,因为它可以发现细胞的异质性),聚类结果的差异分析(声称找到了每个亚群的marker基因),最后做一下数据的可视化。
且慢,用非监督的方式聚出来的类(一般叫做
为分的类数),既然是细胞异质性的体现,那么它到底是什么细胞呢,可以叫做什么,是CD4细胞呢还是microglia细胞?因为
是没有意义的,只有起了像CD4这样的名字,这类细胞才有故事可以讲。
在定义细胞类型之前,需要确定就哪种聚类结果来做,是图聚类的结果还是k-means某一类的结果。如何来确定?看看分群的tsne图是一个不错的参考。
那么呢,之前你们家大师推荐过几个数据库single cell marker 基因数据库是基于某亚群细胞的marker基因,根据已经知道marker基因的细胞类型,这些数据库其实就是一个细胞类型和marker基因大表。根据marker基因列表与分析出来的差异基因列表以及基因丰度的关系(往往是做一个热图)来推断某个cluster属于哪一种细胞类型。
另外还有一种定义细胞类型的方法是通过每个cluster与已知细胞类型的表达谱的相似性来定义细胞类型,已有的方法为SingleR、celaref。均是R包,其中celaref是2019年的新品,也是本文主要介绍的一种方法。
celaref 简介
走遍示例流程
安装celaref包的时候,就把示例数据也下载了,首先载入数据,然后观察一下示例数据是怎么组织的,以便我们以后组织自己的数据。在接触一个新包的时候,最好的学习方法就是用示例数据走一遍然后看看有哪些功能(函数以及函数的参数可以用),多用help或者?来查看细节。一个R包封装的越好也就要求我们花时间去了解他的细节。
library(celaref)# Paths to data files.counts_filepath.query <- system.file("extdata", "sim_query_counts.tab", package = "celaref") cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref") counts_filepath.ref <- system.file("extdata", "sim_ref_counts.tab", package = "celaref") cell_info_filepath.ref <- system.file("extdata", "sim_ref_cell_info.tab", package = "celaref")# Load datatoy_ref_se <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref) head(toy_ref_se) toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)
可见我载入了两个数据:一个是toy_ref_se,reference;一个是toy_query_se,query.都是SummarizedExperiment对象:
对数据进行过滤,help一下这个函数,看看都有哪些过滤条件。
# Filter datatoy_ref_se <- trim_small_groups_and_low_expression_genes(toy_ref_se) toy_query_se <- trim_small_groups_and_low_expression_genes(toy_query_se)
对参考和需要鉴定的表达谱分别做差异分析:
# Setup within-experiment differential expressionde_table.toy_ref <- contrast_each_group_to_the_rest(toy_ref_se, dataset_name="ref") de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se, dataset_name="query")
分别得到差异基因结果:
绘制一组小提琴图,显示每个查询组的“top”基因在参考数据库中的分布。
# Plotmake_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)
根据参考数据集的相似性,在查询数据集中构造一些合理的标签或组/集群。
# And get group labelsmake_ref_similarity_names(de_table.toy_query, de_table.toy_ref)
可以看到除了Group2 没有相应的similarity 类似的type之外,其余的都找到了相对应的细胞类型(基于检验的,不是生物学的)。这当然可以为我们定义细胞类型做一个统计上的参考。
准备数据
那么,如何应用我们自己的数据来做呢?官网也给出了数据准备的方法:
在准备数据时需要以下数据:
A.
Counts Matrix Number of gene tags per gene per cell.
B.
Cell information A sample-sheet table of cell-level information. Only two fields are essential: cell_sample: A unique cell identifier group: The cluster/group to which the cell has been assigned.
C.
Gene information Optional. A table of extra gene-level information. ID: A unique gene identifier
输入数据
01
tab键分割的ounts matrix
gene | Cell1 | cell2 | cell3 | cell4 | … | cell954 |
---|---|---|---|---|---|---|
GeneA | 0 | 1 | 0 | 1 | … | 0 |
GeneB | 0 | 3 | 0 | 2 | … | 2 |
GeneC | 1 | 40 | 1 | 0 | … | 0 |
.... |
02
tab键分割的细胞分群信息
CellId | Sample | Cluster |
---|---|---|
cell1 | Control | cluster1 |
cell2 | Control | cluster7 |
… | … | … |
cell954 | KO | cluster8 |
From 10X pipeline output
dataset_se <- load_dataset_10Xdata('~/path/to/data/10X_mydata', dataset_genome = "GRCh38", clustering_set = "kmeans_7_clusters")
拿一个10X的例子试试吧
Prepare 10X query dataset
先要把数据下载好,发现这应用的对象还是cellranger V2 pipeline的输出结果,注意为了和官网数据格式保持一致,我把
文件夹命名为
了,这个前后一致就可以了。
代码语言:javascript复制 --- 10X_pbmc4k| --- analysis
| | --- clustering
| | | --- graphclust| | | | --- clusters.csv| | | --- kmeans_10_clusters
| | | | --- clusters.csv
| | | --- kmeans_2_clusters
···| | --- diffexp| | | --- graphclust
| | | | --- differential_expression.csv
| | | --- kmeans_10_clusters
···| | --- pca| | | --- 10_components
| | | | --- components.csv
| | | | --- dispersion.csv
| | | | --- genes_selected.csv
| | | | --- projection.csv
| | | | --- variance.csv
| | --- tsne
| | | --- 2_components| | | | --- projection.csv| --- filtered_gene_bc_matrices
| | --- GRCh38
| | | --- barcodes.tsv| | | --- genes.tsv
| | | --- matrix.mtx
代码语言:javascript复制datasets_dir <- "D:\Users\Administrator\Desktop\RStudio\single_cell\celaref/celaref_extra_vignette_data/datasets"dir(datasets_dir)
dataset_se.10X_pbmc4k_k7 <- load_dataset_10Xdata(
dataset_path = file.path(datasets_dir,'10X_pbmc4k'),
dataset_genome = "GRCh38",
clustering_set = "kmeans_7_clusters",
id_to_use = "GeneSymbol")
dataset_se.10X_pbmc4k_k7 <- trim_small_groups_and_low_expression_genes(dataset_se.10X_pbmc4k_k7)
可见我选择的是kmeans_7_clusters 的聚类结果来进行细胞定义。处理过之后:
> dataset_se.10X_pbmc4k_k7class: SummarizedExperiment dim: 15407 4340 metadata(0): assays(1): ''rownames(15407): A1BG A1BG-AS1 ... ZZEF1 ZZZ3 rowData names(4): ensembl_ID GeneSymbol ID total_count colnames(4340): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1colData names(2): cell_sample group
下一步是对数据的转化和处理,也是one-to-others的DE分析,非常的慢和消耗内存,有可能跑断的。关键是这个函数做了是什么?其实就是把我们的矩阵文件整理成上面演示的
格式。
de_table.10X_pbmc4k_k7 <- contrast_each_group_to_the_rest(dataset_se.10X_pbmc4k_k7, dataset_name="10X_pbmc4k_k7", num_cores=7) Sorry, multithreading not supported on windows. Use linux, or set num_threads = 1 to suppress this message.Running single threaded # 这个sorry 可以忽略 ,在Linux下才能指定运行进程数。`fData` has no primerid. I'll make something up. `cData` has no wellKey. I'll make something up. Assuming data assay in position 1, with name et is log-transformed.
Done! Combining coefficients and standard errors Calculating log-fold changes Calculating likelihood ratio tests Refitting on reduced model...
Done!`fData` has no primerid. I'll make something up. `cData` has no wellKey. I'll make something up. Assuming data assay in position 1, with name et is log-transformed. Completed [=======>----------------------------------------------------------------------------------------------------------------] 7% with 0 failures
windows下居然跑了一上午
准备 reference数据集
来自响应的数据库,因为我们的是血细胞,选择的是haemosphere .
A suitable PBMC reference (a ‘HaemAtlas’) has been published by
. They purified populations of PBMC cell types and measured gene expression via microarray. The data used here was downloaded in a normalised table from the
website (Graaf et al. 2016).
下载页面
下载完之后由于是微阵列数据需要做一些特殊处理。
- Logged, normalised expression values. Any low expression or poor quality measurements should have already been removed.
- Sample information (see contrast_each_group_to_the_rest_for_norm_ma_with_limma for details)
library(readr) this_dataset_dir <- file.path(datasets_dir, 'haemosphere_datasets','watkins') norm_expression_file <- file.path(this_dataset_dir, "watkins_expression.txt") samples_file <- file.path(this_dataset_dir, "watkins_samples.txt") norm_expression_table.full <- read.table(norm_expression_file, sep="t", header=TRUE, quote="", comment.char="", row.names=1, check.names=FALSE) samples_table <- read_tsv(samples_file, col_types = cols()) samples_table$description <- make.names( samples_table$description) # Avoid group or extra_factor names starting with numbers, for microarrays> samples_table$description [1] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" [8] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X63.years.adult" "X63.years.adult"[15] "X63.years.adult" "X63.years.adult" "X63.years.adult" "X63.years.adult" "X53.years.adult" "X53.years.adult" "X53.years.adult"[22] "X53.years.adult" "X53.years.adult" "X53.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult"[29] "X60.years.adult" "X60.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult"[36] "X48.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult"[43] "NA." "NA." "NA." "NA." "NA." "NA." "NA." [50] "NA."
从样本表中可以看出,这个数据集包含了其他组织,但是作为PBMC的参考,我们只考虑外周血样本。与其他数据加载函数一样,要从分析中删除一个样本(或单元格),从样本表中删除它就ok了。
amples_table <- samples_table[samples_table$tissue == "Peripheral Blood",] > samples_table# A tibble: 42 x 7 sampleId celltype cell_lineage surface_markers tissue description notes <chr> <chr> <chr> <lgl> <chr> <chr> <lgl> 1 1674120023_A B lymphocyte B Cell Lineage NA Peripheral Blood X49.years.adult NA 2 1674120023_B granulocyte Neutrophil Lineage NA Peripheral Blood X49.years.adult NA 3 1674120023_C natural killer cell NK Cell Lineage NA Peripheral Blood X49.years.adult NA 4 1674120023_D Th lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA 5 1674120023_E Tc lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA 6 1674120023_F monocyte Macrophage Lineage NA Peripheral Blood X49.years.adult NA 7 1674120053_A B lymphocyte B Cell Lineage NA Peripheral Blood X49.years.adult NA 8 1674120053_B monocyte Macrophage Lineage NA Peripheral Blood X49.years.adult NA 9 1674120053_C granulocyte Neutrophil Lineage NA Peripheral Blood X49.years.adult NA 10 1674120053_D Tc lymphocyte T Cell Lineage NA Peripheral Blood X49.years.adult NA # ... with 32 more rows
通常情况下,最难的部分是格式化输入。应该使用与查询数据集相同的id,将微阵列表达式值作为规范化的、日志转换的数据。任何探头或样品级的过滤也应事先进行。在这种情况下,从网站获得的数据是正常的,但仍然需要匹配id。
library("tidyverse") library("illuminaHumanv2.db") probes_with_gene_symbol_and_with_data <- intersect(keys(illuminaHumanv2SYMBOL),rownames(norm_expression_table.full))# Get mappings - non NAprobe_to_symbol <- select(illuminaHumanv2.db, keys=rownames(norm_expression_table.full), columns=c("SYMBOL"), keytype="PROBEID") probe_to_symbol <- unique(probe_to_symbol[! is.na(probe_to_symbol$SYMBOL),])# no multimapping probesgenes_per_probe <- table(probe_to_symbol$PROBEID) # How many genes a probe is annotated against?multimap_probes <- names(genes_per_probe)[genes_per_probe > 1] probe_to_symbol <- probe_to_symbol[!probe_to_symbol$PROBEID %in% multimap_probes, ] convert_expression_table_ids<- function(expression_table, the_probes_table, old_id_name, new_id_name){ the_probes_table <- the_probes_table[,c(old_id_name, new_id_name)] colnames(the_probes_table) <- c("old_id", "new_id") # Before DE, just pick the top expresed probe to represent the gene # Not perfect, but this is a ranking-based analysis. # hybridisation issues aside, would expect higher epressed probes to be more relevant to Single cell data anyway. probe_expression_levels <- rowSums(expression_table) the_probes_table$avgexpr <- probe_expression_levels[as.character(the_probes_table$old_id)] the_genes_table <- the_probes_table %>% group_by(new_id) %>% top_n(1, avgexpr) expression_table <- expression_table[the_genes_table$old_id,] rownames(expression_table) <- the_genes_table$new_id return(expression_table) }# Just the most highly expressed probe foreach gene.norm_expression_table.genes <- convert_expression_table_ids(norm_expression_table.full, probe_to_symbol, old_id_name="PROBEID", new_id_name="SYMBOL")# Go...de_table.Watkins2009PBMCs <- contrast_each_group_to_the_rest_for_norm_ma_with_limma( norm_expression_table = norm_expression_table.genes, sample_sheet_table = samples_table, dataset_name = "Watkins2009PBMCs", extra_factor_name = 'description', sample_name = "sampleId", group_name = 'celltype') > de_table.Watkins2009PBMCs# A tibble: 113,376 x 21 ID logFC AveExpr t P.Value adj.P.Val B ci CI_upper CI_lower ci_inner ci_outer log2FC fdr group sig sig_up <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <lgl> <lgl> 1 JCHA~ 6.86 8.65 27.5 7.13e-26 2.70e-23 49.0 0.0952 6.96 6.77 6.77 6.96 6.86 2.70e-23 B ly~ TRUE TRUE 2 FCRLA 6.41 8.77 23.7 1.13e-23 3.29e-21 44.0 0.103 6.51 6.30 6.30 6.51 6.41 3.29e-21 B ly~ TRUE TRUE 3 VPRE~ 6.18 8.66 40.2 1.20e-31 8.70e-29 62.0 0.0587 6.23 6.12 6.12 6.23 6.18 8.70e-29 B ly~ TRUE TRUE 4 TCL1A 6.09 8.52 42.8 1.33e-32 1.14e-29 64.1 0.0544 6.14 6.04 6.04 6.14 6.09 1.14e-29 B ly~ TRUE TRUE 5 CD79A 6.09 9.08 22.7 5.17e-23 1.40e-20 42.4 0.102 6.20 5.99 5.99 6.20 6.09 1.40e-20 B ly~ TRUE TRUE 6 CD19 5.90 8.44 38.4 6.09e-31 4.11e-28 60.5 0.0587 5.96 5.85 5.85 5.96 5.90 4.11e-28 B ly~ TRUE TRUE 7 HLA-~ 5.61 8.35 47.9 2.30e-34 2.56e-31 68.0 0.0447 5.66 5.57 5.57 5.66 5.61 2.56e-31 B ly~ TRUE TRUE 8 OSBP~ 5.48 7.83 53.2 5.60e-36 8.82e-33 71.4 0.0394 5.52 5.45 5.45 5.52 5.48 8.82e-33 B ly~ TRUE TRUE 9 CNTN~ 5.38 7.95 49.3 8.12e-35 9.60e-32 68.9 0.0416 5.42 5.34 5.34 5.42 5.38 9.60e-32 B ly~ TRUE TRUE 10 COBL~ 5.26 7.70 56.4 6.77e-37 1.28e-33 73.3 0.0356 5.30 5.23 5.23 5.30 5.26 1.28e-33 B ly~ TRUE TRUE # ... with 113,366 more rows, and 4 more variables: gene_count <int>, rank <int>, rescaled_rank <dbl>, dataset <chr>
细胞类型定义(Compare 10X query PBMCs to to reference)
数据准备好就和之前的示例是一样的了。
make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs)
Hmm, there’s a few clusters where different the top genes are bunched near the top for a couple of different reference cell types.
Logging the plot will be more informative at the top end for this dataset.
make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs, log10trans = TRUE)
可以打标签啦。
label_table.pbmc4k_k7_vs_Watkins2009PBMCs <- make_ref_similarity_names(de_table.10X_pbmc4k_k7, de_table.Watkins2009PBMCs)label_table.pbmc4k_k7_vs_Watkins2009PBMCs
可以看出,七个群有六个找到了与之相似的群名称,并给出了pval.
make_ranking_violin_plot(de_table.test=de_table.Watkins2009PBMCs, de_table.ref=de_table.10X_pbmc4k_k7, log10trans = TRUE)
行啦,人pbmc细胞的定义工作就完成了。官网也提供了小鼠细胞类型识别的教程。这里就不再复述啦。
完成了细胞类型定义,单细胞的故事才刚刚开始。所以,这不是结束,甚至不是结束的开始,而可能是开始的结束。
本文转载自简书
© 著作权归作者所有