除夕夜我们介绍了 一个10x单细胞转录组项目从fastq到细胞亚群,演示的10x单细胞转录组项目是人类数据,但是也有不少朋友表示他们的物种比较特殊,需要额外构建index而不能从10x的官网下载,我们也演示一下,争取可以帮助大家。
10x单细胞转录组数据分析所需要的参考数据文件主要是基因组的fasta文件和基因注释gtf文件,其官网有详细的例子:https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr
官网有两个例子:
- Reference build instructions: Rhesus macaque
- Reference build instructions: Norwegian rat
下面我们以大鼠和猪为例子来演示:
咦 rat ftp.ensembl.org 为关键词组合,可以进入 不同物种的官网,比如这个大鼠:https://asia.ensembl.org/Rattus_norvegicus/Info/Index
可以看到其基因组的fasta文件和基因注释gtf文件的不同下载地址:
- http://ftp.ensembl.org/pub/release-105/fasta/rattus_norvegicus/dna/
- http://ftp.ensembl.org/pub/release-105/gtf/rattus_norvegicus/
规律非常明显, 对猪来说也是如此,不同物种有不同的拉丁名而已。前单细胞转录组以10X公司为主流,我们也是在单细胞天地公众号详细介绍了cellranger全部使用细节及流程,大家可以自行前往学习,如下:
- 单细胞实战(一)数据下载
- 单细胞实战(二) cell ranger使用前注意事项
- 单细胞实战(三) Cell Ranger使用初探
- 单细胞实战(四) Cell Ranger流程概览
- 单细胞实战(五) 理解cellranger count的结果
但是这个两年前的系列笔记是基于V2,V3版本的cellranger,在2020的7月我看到了其更新到了V4,也里面写了一个总结,见:cellranger更新到4啦(全新使用教程)
如果是从头开始构建index,每个物种的两个文件(基因组的fasta文件和基因注释gtf文件)都需要下载,代码如下所示:
代码语言:javascript复制nohup wget http://ftp.ensembl.org/pub/release-105/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz &
nohup wget http://ftp.ensembl.org/pub/release-105/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.105.gtf.gz &
# 两个gz文件都需要 解压, 命令是 gunzip
如果物种是猪,下载基因组及注释文件的代码如下所示:
代码语言:javascript复制wget -c http://ftp.ensembl.org/pub/release-105/fasta/sus_scrofa/dna/Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
wget -c http://ftp.ensembl.org/pub/release-105/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.105.gtf.gz
wget -c http://ftp.ensembl.org/pub/release-105/gtf/sus_scrofa/Sus_scrofa.Sscrofa11.1.105.chr.gtf.gz
gunzip Sus_scrofa.Sscrofa11.1.dna.toplevel.fa.gz
gunzip Sus_scrofa.Sscrofa11.1.105.gtf.gz
gunzip Sus_scrofa.Sscrofa11.1.105.chr.gtf.gz
其它物种类似,都是去 ftp.ensembl.org 搜索各个物种对应的两个文件(基因组的fasta文件和基因注释gtf文件),下载并且解压即可。
首先过滤gtf文件
参考官网的的标准代码即可,如下所示:
代码语言:javascript复制cellranger=/home/data/bylin/cellranger-6.1.2/cellranger
$cellranger mkgtf
Rattus_norvegicus.mRatBN7.2.105.gtf Rattus_norvegicus.mRatBN7.2.105.filtered.gtf
--attribute=gene_biotype:protein_coding
--attribute=gene_biotype:lincRNA
--attribute=gene_biotype:antisense
--attribute=gene_biotype:IG_LV_gene
--attribute=gene_biotype:IG_V_gene
--attribute=gene_biotype:IG_V_pseudogene
--attribute=gene_biotype:IG_D_gene
--attribute=gene_biotype:IG_J_gene
--attribute=gene_biotype:IG_J_pseudogene
--attribute=gene_biotype:IG_C_gene
--attribute=gene_biotype:IG_C_pseudogene
--attribute=gene_biotype:TR_V_gene
--attribute=gene_biotype:TR_V_pseudogene
--attribute=gene_biotype:TR_D_gene
--attribute=gene_biotype:TR_J_gene
--attribute=gene_biotype:TR_J_pseudogene
--attribute=gene_biotype:TR_C_gene
可以看到这个过滤步骤前后其实gtf文件变化并不多。
然后构建参考基因组index
标准代码即可, 需要使用上一步过滤好的gtf文件,加上前面下载的参考基因组fasta文件哦,如下所示:
代码语言:javascript复制$cellranger mkref
--genome=mRatBN7
--fasta=Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa
--genes=Rattus_norvegicus.mRatBN7.2.105.filtered.gtf
--ref-version=1.0.0
如下所示的log日志:
代码语言:javascript复制['/home/data/bylin/cellranger-6.1.2/bin/rna/mkref', '--genome=mRatBN7', '--fasta=Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa', '--genes=Rattus_norvegicus.mRatBN7.2.105.filtered.gtf', '--ref-version=1.0.0']
Creating new reference folder at /home/data/x9/zhao/rat_ref/mRatBN7
...done
Writing genome FASTA file into reference folder...
...done
Indexing genome FASTA file...
...done
Writing genes GTF file into reference folder...
...done
Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
Feb 01 09:40:54 ..... started STAR run
Feb 01 09:40:54 ... starting to generate Genome files
Feb 01 09:42:08 ... starting to sort Suffix Array. This may take a long time...
Feb 01 09:42:14 ... sorting Suffix Array chunks and saving them to disk...
Feb 01 10:17:13 ... completed Suffix Array index
Feb 01 10:17:13 ..... processing annotations GTF
Feb 01 10:17:23 ..... inserting junctions into the genome indices
Feb 01 10:24:00 ... writing Genome to disk ...
Feb 01 10:24:03 ... writing Suffix Array to disk ...
Feb 01 10:24:10 ... writing SAindex to disk
Feb 01 10:24:12 ..... finished successfully
...done.
Writing genome metadata JSON file into reference folder...
Computing hash of genome FASTA file...
...done
Computing hash of genes GTF file...
...done
...done
>>> Reference successfully created! <<<
You can now specify this reference on the command line:
cellranger --transcriptome=/home/data/x9/zhao/rat_ref/mRatBN7
是不是很简单啊, 下次我们跑cellranger流程的时候就可以直接使用 /home/data/x9/zhao/rat_ref/mRatBN7 这个index的路径地址哦。参考前面的 一个10x单细胞转录组项目从fastq到细胞亚群,
查看构建好的index
如下所示:
代码语言:javascript复制 tree -h /home/data/x9/zhao/rat_ref/mRatBN7
/home/data/x9/zhao/rat_ref/mRatBN7
├── [4.0K] fasta
│ ├── [2.5G] genome.fa
│ └── [6.5K] genome.fa.fai
├── [4.0K] genes
│ └── [ 19M] genes.gtf.gz
├── [ 456] reference.json
└── [4.0K] star
├── [1.1K] chrLength.txt
├── [3.6K] chrNameLength.txt
├── [2.5K] chrName.txt
├── [1.9K] chrStart.txt
├── [ 17M] exonGeTrInfo.tab
├── [7.5M] exonInfo.tab
├── [970K] geneInfo.tab
├── [2.5G] Genome
├── [ 815] genomeParameters.txt
├── [7.1G] SA
├── [1.5G] SAindex
├── [6.2M] sjdbInfo.txt
├── [6.1M] sjdbList.fromGTF.out.tab
├── [4.9M] sjdbList.out.tab
└── [3.0M] transcriptInfo.tab
3 directories, 19 files
大家也可以去看看自己的物种index文件构建情况哦!
如果是猪,全部的流程如下所示:
代码语言:javascript复制bin=/home/x10/pipeline/cellranger-6.0.0/bin/cellranger
$bin mkref
--genome=Sus_scrofa_genome
--fasta=Sus_scrofa.Sscrofa11.1.dna.toplevel.fa
--genes=Sus_scrofa.Sscrofa11.1.105.chr.gtf
输出结果logo日志如下所示:
代码语言:javascript复制['/home/data/x10/pipeline/cellranger-6.0.0/bin/rna/mkref', '--genome=Sus_scrofa_genome', '--fasta=Sus_scrofa.Sscrofa11.1.dna.toplevel.fa', '--genes=Sus_scrofa.Sscrofa11.1.105.chr.gtf']
Jan 23 23:09:57 ..... started STAR run
Jan 23 23:09:57 ... starting to generate Genome files
Jan 23 23:11:08 ... starting to sort Suffix Array. This may take a long time...
Jan 23 23:11:13 ... sorting Suffix Array chunks and saving them to disk...
Jan 23 23:36:05 ... loading chunks from disk, packing SA...
Jan 23 23:36:29 ... finished generating suffix array
Jan 23 23:36:29 ... generating Suffix Array index
Jan 23 23:38:43 ... completed Suffix Array index
Jan 23 23:38:43 ..... processing annotations GTF
Jan 23 23:38:55 ..... inserting junctions into the genome indices
Jan 23 23:45:22 ... writing Genome to disk ...
Jan 23 23:45:24 ... writing Suffix Array to disk ...
Jan 23 23:45:31 ... writing SAindex to disk
Jan 23 23:45:33 ..... finished successfully
Creating new reference folder at /home/data/x10/run/pig/database/Sus_scrofa_genome
...done
Writing genome FASTA file into reference folder...
...done
Indexing genome FASTA file...
...done
Writing genes GTF file into reference folder...
...done
Generating STAR genome index (may take over 8 core hours for a 3Gb genome)...
...done.
Writing genome metadata JSON file into reference folder...
Computing hash of genome FASTA file...
...done
Computing hash of genes GTF file...
...done
...done
>>> Reference successfully created! <<<
You can now specify this reference on the command line:
cellranger --transcriptome=/home/data/x10/run/pig/database/Sus_scrofa_genome ...
大同小异的,所以这个数据分析流程主要是取决于物种的两个文件(基因组的fasta文件和基因注释gtf文件),如果物种都没有比较好的注释,就麻烦了。