浸润性导管和小叶乳腺癌细胞的单细胞转录组异质性

2021-07-29 11:33:10 浏览数 (1)

考虑到咱们生信技能树粉丝对单细胞数据挖掘的需求,我开通了一个专栏《100个单细胞转录组数据降维聚类分群图表复现》,也亲自示范了几个,不过自己带娃,读博,时间精力有限,所以把剩余的90多个任务安排了学徒,实习生,学员。真的是太棒了,群策群力!

下面七月份学徒的投稿

今天介绍的文章是2021年1月发表在cancer research杂志 :《Single-Cell Transcriptomic Heterogeneity in Invasive Ductal and Lobular Breast Cancer Cells》,链接是 https://pubmed.ncbi.nlm.nih.gov/33148662/

这个数据的研究目标是:To quantify ITH between cell lines, referred to as ICH (Inter-Cellular Heterogeneity), and investigate differences between IDC and ILC,:

  • Invasive lobular breast carcinoma (ILC),
  • Invasive ductal cancer subtype (IDC).

数据链接是:https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE144320

只有一个10X样品样本:

代码语言:javascript复制
GSM4285803 All_CellLines

作者提供了该样本的三个文件,但并不是我们通常说的10x的3个文件:

GEO页面

这个单细胞文章仅仅是单个10X样品,但是测8个细胞系。

  • MCF7 (n=977)
  • T47D WT (n=509)
  • T47D KO (n=491)
  • MM134 (n=439)
  • SUM44 (n=314)
  • BCK4 (n=512)
  • MCF10A (n=491)
  • HEK293T(n=881).

作者使用六个公共数据集来进行建立参考数据集与自己的数据进行相关性分析,进行细胞注释:

代码语言:javascript复制
1,444 genes were present in all the three data sources (breast cell lines, HEK293T as reference  186 cell lines; 
and scRNA-seq (3,000 genes by 4,614 cells) as query cells);
Two external datasets with scRNA-seq data of cell lines were integrated with our data for CNA  inference - MCF7 and T47D cells from Hong et al.(29), and three MCF7 strains (WT-3,4,5)  from Ben-David et al;
A raw count matrix from Hong et al was directly downloaded  from GSE122743 while FASTQ files from Ben-David

本次要复现的文献图表如下所示:

要复现的文献图表

是单细胞转录组数据分析的标准降维聚类分群,并且进行生物学注释后的结果。可以参考前面的例子:人人都能学会的单细胞聚类分群注释 ,我们演示了第一层次的分群。

如果你对单细胞数据分析还没有基础认知,可以看基础10讲:

  • 01. 上游分析流程
  • 02.课题多少个样品,测序数据量如何
  • 03. 过滤不合格细胞和基因(数据质控很重要)
  • 04. 过滤线粒体核糖体基因
  • 05. 去除细胞效应和基因效应
  • 06.单细胞转录组数据的降维聚类分群
  • 07.单细胞转录组数据处理之细胞亚群注释
  • 08.把拿到的亚群进行更细致的分群
  • 09.单细胞转录组数据处理之细胞亚群比例比较

下面我们开始复现:

在运行环境下创建一个GSE144320_RAW文件夹,里面包含下面两个文件:

从GEO下载的文件

step1:读取表达矩阵文件,然后构建Seurat对象

构建Seurat对象后保存为 sce.Rds 文件,供后续使用

代码语言:javascript复制
suppressPackageStartupMessages(library(Seurat))
suppressPackageStartupMessages(library(ggplot2))
suppressPackageStartupMessages(library(clustree)) 
suppressPackageStartupMessages(library(cowplot))
suppressPackageStartupMessages(library(dplyr)) 
suppressPackageStartupMessages(library(data.table)) 

# 构建seurat对象
# 读取rawcounts矩阵
a <- textshape::column_to_rownames(fread("../GSE144320_RAW/GSM4285803_scRNA_RawCounts.csv.gz"),
                                   loc = 1)
sce <- CreateSeuratObject(counts = as.data.frame(t(a)))
# 计算线粒体基因比例
(mito_genes <- rownames(sce[grep("^MT-", rownames(sce))]))
sce <- PercentageFeatureSet(sce, "^MT-", col.name = "percent_Mito_rep")
sce@meta.data$percent_Mito_rep[1:6]
head(sce@meta.data)
#计算周期评分
sce = NormalizeData(sce)
sce <- CellCycleScoring(sce,
                        g2m.features = cc.genes$g2m.genes,
                        s.features = cc.genes$s.genes)
# 读取文章meta信息
meta <- fread("../GSE144320_RAW/GSM4285803_scRNA_metaInfo.csv.gz")
meta <- textshape::column_to_rownames(meta, 1)
#将作者提供表型信息添加到seurat对象中
sce <- AddMetaData(sce, metadata = meta)
# 保存数据
saveRDS(sce, file = "sce.Rds")
step2:降维聚类分群

仍然是面的例子:人人都能学会的单细胞聚类分群注释

代码语言:javascript复制

##### step2 降维聚类分群#####
#读入数据
sce <- readRDS("sce.Rds")
#数据预处理 

sce <- FindVariableFeatures(sce, nfeatures = 3000)
sce
# 默认后续使用 HVGs 进行 特征寻找
sce = ScaleData(sce 
                # features = rownames(sce),
                #vars.to.regress = c("nFeature_RNA", "percent_mito")
)  #加上vars.to.regress参数后运行so slow!!!
# 降维处理
sce <- RunPCA(sce, npcs = 30)
sce <- RunTSNE(sce, dims = 1:30)
sce <- RunUMAP(sce, dims = 1:30)

# 聚类分析
sce <- FindNeighbors(sce, dims = 1:30)
for (res in c(0.01, 0.05, 0.1, 0.2, 0.3, 0.5,0.8,1)) {
  sce <- FindClusters(sce, graph.name = "RNA_snn", resolution = res, algorithm = 1)
}
ls <- apply(sce@meta.data[,grep("RNA_snn_res",colnames(sce@meta.data))],2,table)
#聚类树可视化
colors <- c('#e41a1c','#377eb8','#4daf4a','#984ea3','#ff7f00','#ffff33','#a65628','#f781bf')
names(colors) <- unique(colnames(sce@meta.data)[grep("RNA_snn_res", colnames(sce@meta.data))]) 

p2_tree <- clustree(sce@meta.data, prefix = "RNA_snn_res.") 
p2_tree

不同参数的分群效果

文献里面提到了是8个不同细胞系,所以我们这里使用resolution = 0.2 进行后续分析即可。

代码语言:javascript复制
# 取resolution = 0.2 进行后续分析
colors <- c('#e41a1c','#377eb8','#4daf4a','#984ea3','#ff7f00','#ffff33','#a65628','#f781bf')
p <- DimPlot(sce, reduction = "umap", group.by = "RNA_snn_res.0.2", label = T, cols = colors) 
  scale_y_continuous(expand = c(0, 0))  
  scale_x_discrete(position = "bottom")  
  scale_fill_manual( breaks = rev(names(colors)),values = colors)  
  theme(
        plot.title = element_text(family = "serif", hjust = 0.5, vjust = 0, colour = "black"),
        # 坐标轴标题
        axis.title.x = element_text(family = "serif",face = "bold",colour = "chocolate4", hjust = 0.5, vjust = 0),
        axis.title.y = element_text(family = "serif",face = "bold",colour = "chocolate4", hjust = 0.5, vjust = 0.5),
        # # 坐标轴刻度
        # axis.text.x = element_text(family = "Yahei", vjust = 0.5, angle = 45),
        # axis.text.y = element_text(family = "Yahei" ),
        # axis.ticks = element_line(colour = "gray0", size = 0.9, linetype = "blank"),
        # 图例
        legend.title = element_text(hjust = 0.55,face = "bold", colour = "chocolate4",family = "serif"),
        legend.direction = "vertical",
        legend.text = element_text(family = "serif"),
        legend.key = element_rect(fill = "aliceblue"),
        legend.background = element_blank()
        )
p

第一次分群结果

这个时候每个细胞亚群仅仅是顺序标号而已,后面需要进行生物学注释。

step3:细胞亚群的生物学注释

根据之前的教程:混合到同一个10X样品里面的多个细胞系如何注释 可以获得与每个簇对应的细胞系,之后进行手动注释;

代码语言:javascript复制
# cluster0  MCF7
# cluster1  HEK293
# cluster2  T47D
# cluster3  BCK4 # 排除法
# cluster4  T47D
# cluster5  MCF10A
# cluster6  MM134
# cluster7  SUM44

celltype <- data.frame(ClusterID = 0:7, 
                       celltype = c("MCF7","HEK293T","T47D WT","BCK4","T47D KO","MCF10A","MM134","SUM44"))
sce@meta.data$celltype_rep = "NA"
for(i in 1:nrow(celltype)){
  sce@meta.data[which(sce@meta.data$RNA_snn_res.0.2 == celltype$ClusterID[i])
                ,'celltype_rep'] <- celltype$celltype[i]}
#umap展示
colors <- c('#e41a1c','#377eb8','#4daf4a','#984ea3','#ff7f00','#ffff33','#a65628','#f781bf')
names(colors) <- unique(sce@meta.data$celltype_rep) 

p_rep <- DimPlot(sce, reduction = "umap", group.by = "celltype_rep", label = T, cols = colors)   
  scale_y_continuous(expand = c(0, 0))  
  scale_x_discrete(position = "bottom")  
  scale_fill_manual( breaks = rev(names(colors)),values = colors)  
  theme(
    plot.title = element_text(family = "serif", hjust = 0.5, vjust = 0, colour = "black"),
    # 坐标轴标题
    axis.title.x = element_text(family = "serif",face = "bold",colour = "chocolate4", hjust = 0.5, vjust = 0),
    axis.title.y = element_text(family = "serif",face = "bold",colour = "chocolate4", hjust = 0.5, vjust = 0.5),
    # # 坐标轴刻度
    # axis.text.x = element_text(family = "Yahei", vjust = 0.5, angle = 45),
    # axis.text.y = element_text(family = "Yahei" ),
    # axis.ticks = element_line(colour = "gray0", size = 0.9, linetype = "blank"),
    # 图例
    legend.title = element_text(hjust = 0.55,face = "bold", colour = "chocolate4",family = "serif"),
    legend.direction = "vertical",
    legend.text = element_text(family = "serif"),
    legend.key = element_rect(fill = "aliceblue"),
    legend.background = element_blank()
  )
p_rep 

image-20210723205009035

注释,根据文献里面的基因,检查每个细胞亚群的marker基因:

代码语言:javascript复制
genes_in_paper <- c("BCAS3","PSMD6", "CSDE1", "RBP7" ,"TP53I3" ,"SERPINA6" ,
                    "GNG11" ,"CLEC2B" ,"AZGP1" ,"CCND1", "AAMDC" ,"RSF1", "CNN3", "ARHGAP29" ,"LAPTM4A" ,
                    "SCGB1D2" ,"SCGB2A2" ,"S100A1","KRT81", "KRT17", "KRT6A", "CKB","LDHB","RPL5")
colnames(sce@meta.data)
sce@meta.data$testclass <- factor(sce@meta.data$celltype_rep,
                             levels = rev(c("MCF7","T47D WT","T47D KO","MM134",
                                            "SUM44","BCK4","MCF10A","HEK293T")))
Idents(sce)=sce$testclass
str(Idents(sce))


library(ggplot2)
p <- DotPlot(sce, assay = "RNA", features = genes_in_paper, cols = c("#bababa", "#ca0020"),) 
  #scale_y_continuous(expand = c(0, 0))  
  scale_x_discrete(position = "bottom")  
  scale_fill_manual( breaks = rev(names(colors)),values = colors)  
  theme(plot.margin=margin(t = 0, b = 0, unit="pt"),
        plot.subtitle = element_text(family = "serif",  colour = "gray0"),
        #plot.background = element_rect(fill = "aliceblue"),
        plot.title = element_text(face = "bold", family = "serif", hjust = 0.5, vjust = 0, colour = "black"),
        #  背景板
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        panel.background = element_blank(),
        # 坐标轴标题
        axis.title.x = element_text(family = "serif",face = "bold",colour = "chocolate4", hjust = 0.5, vjust = 0),
        axis.title.y = element_text(family = "serif",face = "bold",colour = "chocolate4", hjust = 0.5, vjust = 0.5),
        # 坐标轴刻度
        axis.text.x = element_text(family = "serif", vjust = 0.5, angle = 90),
        axis.text.y = element_text(family = "serif" ),
        axis.ticks = element_line(colour = "gray0", size = 0.9, linetype = "blank"),
        # 图例
        legend.title = element_text(hjust = 0.55,face = "bold", colour = "chocolate4",family = "serif"),
        legend.direction = "vertical",
        legend.text = element_text(family = "serif"),
        legend.key = element_rect(fill = "aliceblue"),
        #legend.background = element_rect(fill = "aliceblue")
  )
p

每个细胞亚群的marker基因

这样就完成了 《Single-Cell Transcriptomic Heterogeneity in Invasive Ductal and Lobular Breast Cancer Cells》这个文献里面的降维聚类分群和生物学注释图表。

0 人点赞