不知道10x单细胞转录组样品和fastq文件的对应关系

2022-03-03 13:08:29 浏览数 (1)

我们在单细胞天地多次分享过cellranger流程的笔记,大家可以自行前往学习,如下:

  • 单细胞实战(一)数据下载
  • 单细胞实战(二) cell ranger使用前注意事项
  • 单细胞实战(三) Cell Ranger使用初探
  • 单细胞实战(四) Cell Ranger流程概览
  • 单细胞实战(五) 理解cellranger count的结果

因为这个流程其实是需要10X单细胞转录组的fastq文件,而且呢,命名是有规则的!

如果你的样品被分散到了多个library、flowcell,就会出现一个样品有84个fastq文件的情况,恰好我看到了一个文献里面的数据就是这样的情况。该研究于2021年3月发表在《Nature Communications》杂志的文章, 标题是:《Time-resolved single-cell analysis of Brca1 associated mammary tumourigenesis reveals aberrant differentiation of luminal progenitors》,链接是:https://www.nature.com/articles/s41467-021-21783-3

如下所示的一个样品,是SIGAA11, 足足有84个fastq文件,如果你仔细观察这84个fastq文件的名字,就会发现规律,如果以下划线为分隔符,那么

  • 第2列是S37到S40这4种情况
  • 第3列是L003到L009这7种情况
  • 第五列是R1,R1,I1这样的3种情况

总共就是 4x7x3=84个fastq文件。

当然了,并不是每个10X样品都有84个fastq文件哈。甚至绝大多数情况下,就3个文件,或者两个文件,都是可以跑我们前面分享过cellranger流程。

接下来我要介绍的一种特殊情况是,有44个fastq文件,但是却没办法对应到10x的样品:

代码语言:javascript复制
SRR15860129 week 11 MMTV-PyMT Week11-1
SRR15860128 week 11 MMTV-PyMT Week11-2
SRR15860127 week 11 MMTV-PyMT Week11-3
SRR15860126 week 11 MMTV-PyMT Week11-4
SRR15860125 week 17 MMTV-PyMT Week17-1
SRR15860124 week 17 MMTV-PyMT Week17-2
SRR15860123 week 17 MMTV-PyMT Week17-3
SRR15860122 week 17 MMTV-PyMT Week17-4
SRR15860120 week 17 MMTV-PyMT Week17-5
SRR15860119 week 17 MMTV-PyMT Week17-6
SRR15860118 week 17 MMTV-PyMT Week17-7
SRR15860117 week 17 MMTV-PyMT Week17-8
SRR15860155 week 7 MMTV-PyMT Week7-1
SRR15860154 week 7 MMTV-PyMT Week7-2
SRR15860143 week 7 MMTV-PyMT Week7-3
SRR15860132 week 7 MMTV-PyMT Week7-4
SRR15860121 week 9 MMTV-PyMT Week9-1
SRR15860150 week 9 MMTV-PyMT Week9-10
SRR15860149 week 9 MMTV-PyMT Week9-11
SRR15860148 week 9 MMTV-PyMT Week9-12
SRR15860147 week 9 MMTV-PyMT Week9-13
SRR15860146 week 9 MMTV-PyMT Week9-14
SRR15860145 week 9 MMTV-PyMT Week9-15
SRR15860144 week 9 MMTV-PyMT Week9-16
SRR15860142 week 9 MMTV-PyMT Week9-17
SRR15860141 week 9 MMTV-PyMT Week9-18
SRR15860140 week 9 MMTV-PyMT Week9-19
SRR15860116 week 9 MMTV-PyMT Week9-2
SRR15860139 week 9 MMTV-PyMT Week9-20
SRR15860138 week 9 MMTV-PyMT Week9-21
SRR15860137 week 9 MMTV-PyMT Week9-22
SRR15860136 week 9 MMTV-PyMT Week9-23
SRR15860135 week 9 MMTV-PyMT Week9-24
SRR15860134 week 9 MMTV-PyMT Week9-25
SRR15860133 week 9 MMTV-PyMT Week9-26
SRR15860131 week 9 MMTV-PyMT Week9-27
SRR15860130 week 9 MMTV-PyMT Week9-28
SRR15860115 week 9 MMTV-PyMT Week9-3
SRR15860114 week 9 MMTV-PyMT Week9-4
SRR15860113 week 9 MMTV-PyMT Week9-5
SRR15860112 week 9 MMTV-PyMT Week9-6
SRR15860153 week 9 MMTV-PyMT Week9-7
SRR15860152 week 9 MMTV-PyMT Week9-8
SRR15860151 week 9 MMTV-PyMT Week9-9

可以看到,如果是按照第二列,这44个fastq文件,应该是属于4个10x样品,所以跑四次cellranger流程即可,但是如果你这样改名,去运行cellranger流程,就会报错。

为了解决这个问题,我首先对这44个fastq文件各自独立跑cellranger流程,得到的结果批量读取,代码如下所示:

代码语言:javascript复制
rm(list=ls())
library(data.table)
dir='/home/data/jmzeng/scRNA/mice-4-stage/tmp/matrix' 
# 这个文件夹下面是44个fastq文件各自独立跑cellranger流程的矩阵文件夹 
samples=list.files( dir  )
samples 
sceList = lapply(samples,function(pro){ 
  # pro=samples[1]
  folder=file.path( dir ,pro) 
  print(pro)
  print(folder)
  print(list.files(folder))
  sce=CreateSeuratObject(counts = Read10X(folder),
                         project =  pro ,
                         min.cells = 5,
                         min.features = 300)
  
  return(sce)
})
names(sceList) 
 

如果这44个fastq文件属于不同的10x样品,它们独立走cellranger流程的时候的细胞barcodes理论上应该是会有交集,所以我做了一个统计。

代码语言:javascript复制

out = do.call(rbind,
        lapply(1:length(sceList), function(i){
          do.call(rbind,
                  lapply(1:length(sceList), function(j){
                    sample1 <- as.data.frame(sceList[[i]]@meta.data)
                    sample1_barcodes <- as.numeric(length(sample1$orig.ident))
                    sample2 <- as.data.frame(sceList[[j]]@meta.data)
                    sample2_barcodes <- as.numeric(length(sample2$orig.ident))
                    both <- intersect(rownames(sample1),rownames(sample2))
                    both_barcodes <- as.numeric(length(both))
                    rate <- as.numeric(both_barcodes/(sample1_barcodes sample2_barcodes))
                    c(sample1_name=as.character(unique(sample1$orig.ident)),
                      sample1_barcodes=sample1_barcodes,
                      sample2_name=as.character(unique(sample2$orig.ident)),
                      sample2_barcodes=sample2_barcodes,
                      rate=rate)
                  }))
        }))
save(out,file = 'out.Rdata') 

对这44个fastq文件的单细胞矩阵进行组合比较:

代码语言:javascript复制
> head(out)
     sample1_name  sample1_barcodes sample2_name  sample2_barcodes rate                  
[1,] "SRR15860112" "2197"           "SRR15860112" "2197"           "0.5"                 
[2,] "SRR15860112" "2197"           "SRR15860113" "1923"           "0.465776699029126"   
[3,] "SRR15860112" "2197"           "SRR15860114" "2746"           "0.00101153145862836" 
[4,] "SRR15860112" "2197"           "SRR15860115" "2768"           "0.00100704934541793" 
[5,] "SRR15860112" "2197"           "SRR15860116" "2811"           "0.000998402555910543"
[6,] "SRR15860112" "2197"           "SRR15860117" "5100"           "0.00315198026586268"

如果两个结果配对的细胞barcodes重合度很高,就说明是同一个样品。如下所示,可以看到SRR15860112和SRR15860113就是同一个样品。

代码语言:javascript复制
library(tidyr)
outs <- as.data.frame(out[,-c(2,4)] )
library(reshape2)
out_wide <- dcast(outs,sample1_name~sample2_name) 
name=out_wide[,1]
out_wide=out_wide[,-1]
out_wide <- apply(out_wide,2,as.numeric)
rownames(out_wide) <- name

library(pheatmap)

如下所示:

有一些样品的细胞barcodes重合度很高

可以看到我们的44个fastq文件,应该是属于7个样品,所以我输出如下所示的文件列表:

代码语言:javascript复制
head(tmp)
                    srr hc     age
SRR15860126 SRR15860126  5 week 11
SRR15860127 SRR15860127  5 week 11
SRR15860128 SRR15860128  5 week 11
SRR15860129 SRR15860129  5 week 11
SRR15860117 SRR15860117  3 week 17
SRR15860118 SRR15860118  3 week 17

然后去改名即可:

代码语言:javascript复制
1_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860146_1.fastq.gz
1_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860146_2.fastq.gz
1_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860147_1.fastq.gz
1_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860147_2.fastq.gz
1_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860148_1.fastq.gz
1_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860148_2.fastq.gz
1_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860149_1.fastq.gz
1_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860149_2.fastq.gz
1_S1_L005_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860150_1.fastq.gz
1_S1_L005_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860150_2.fastq.gz
1_S1_L006_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860151_1.fastq.gz
1_S1_L006_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860151_2.fastq.gz
1_S1_L007_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860152_1.fastq.gz
1_S1_L007_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860152_2.fastq.gz
1_S1_L008_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860153_1.fastq.gz
1_S1_L008_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860153_2.fastq.gz
1_S2_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860112_1.fastq.gz
1_S2_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860112_2.fastq.gz
1_S2_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860113_1.fastq.gz
1_S2_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860113_2.fastq.gz
1_S2_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860144_1.fastq.gz
1_S2_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860144_2.fastq.gz
1_S2_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860145_1.fastq.gz
1_S2_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860145_2.fastq.gz

2_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860114_1.fastq.gz
2_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860114_2.fastq.gz
2_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860115_1.fastq.gz
2_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860115_2.fastq.gz
2_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860116_1.fastq.gz
2_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860116_2.fastq.gz
2_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860121_1.fastq.gz
2_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860121_2.fastq.gz

3_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860117_1.fastq.gz
3_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860117_2.fastq.gz
3_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860118_1.fastq.gz
3_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860118_2.fastq.gz
3_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860119_1.fastq.gz
3_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860119_2.fastq.gz
3_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860120_1.fastq.gz
3_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860120_2.fastq.gz

4_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860122_1.fastq.gz
4_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860122_2.fastq.gz
4_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860123_1.fastq.gz
4_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860123_2.fastq.gz
4_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860124_1.fastq.gz
4_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860124_2.fastq.gz
4_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860125_1.fastq.gz
4_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860125_2.fastq.gz

5_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860126_1.fastq.gz
5_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860126_2.fastq.gz
5_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860127_1.fastq.gz
5_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860127_2.fastq.gz
5_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860128_1.fastq.gz
5_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860128_2.fastq.gz
5_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860129_1.fastq.gz
5_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860129_2.fastq.gz

6_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860130_1.fastq.gz
6_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860130_2.fastq.gz
6_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860131_1.fastq.gz
6_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860131_2.fastq.gz
6_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860133_1.fastq.gz
6_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860133_2.fastq.gz
6_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860134_1.fastq.gz
6_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860134_2.fastq.gz
6_S1_L005_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860135_1.fastq.gz
6_S1_L005_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860135_2.fastq.gz
6_S1_L006_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860136_1.fastq.gz
6_S1_L006_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860136_2.fastq.gz
6_S1_L007_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860137_1.fastq.gz
6_S1_L007_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860137_2.fastq.gz
6_S1_L008_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860138_1.fastq.gz
6_S1_L008_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860138_2.fastq.gz
6_S2_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860139_1.fastq.gz
6_S2_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860139_2.fastq.gz
6_S2_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860140_1.fastq.gz
6_S2_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860140_2.fastq.gz
6_S2_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860141_1.fastq.gz
6_S2_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860141_2.fastq.gz
6_S2_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860142_1.fastq.gz
6_S2_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860142_2.fastq.gz

7_S1_L001_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860132_1.fastq.gz
7_S1_L001_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860132_2.fastq.gz
7_S1_L002_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860143_1.fastq.gz
7_S1_L002_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860143_2.fastq.gz
7_S1_L003_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860154_1.fastq.gz
7_S1_L003_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860154_2.fastq.gz
7_S1_L004_R1_001.fastq.gz -> /home/PRJNA762594/SRR15860155_1.fastq.gz
7_S1_L004_R2_001.fastq.gz -> /home/PRJNA762594/SRR15860155_2.fastq.gz

然后,这样的文件,就可以跑cellranger流程流程啦。

最后得到的降维聚类分群和生物学命名,如下所示:

image-20220223150214864

如果是单细胞常规分析可以参考前面的例子:人人都能学会的单细胞聚类分群注释 ,我们演示了第一层次的分群。如果你对单细胞数据分析还没有基础认知,可以看基础10讲:

  • 01. 上游分析流程
  • 02.课题多少个样品,测序数据量如何
  • 03. 过滤不合格细胞和基因(数据质控很重要)
  • 04. 过滤线粒体核糖体基因
  • 05. 去除细胞效应和基因效应
  • 06.单细胞转录组数据的降维聚类分群
  • 07.单细胞转录组数据处理之细胞亚群注释
  • 08.把拿到的亚群进行更细致的分群
  • 09.单细胞转录组数据处理之细胞亚群比例比较

0 人点赞