拟南芥有参转录组数据处理,基本流程是
- 1 转录组数据过滤
- 2 转录组数据与参考基因组比对
- 3 基于比对结果计算count值
- 4 基于count值进行差异表达分析
准备数据
参考基因组 拟南芥基因组 来源于 论文 The genetic and epigenetic landscape of the Arabidopsis centromeres。下载链接 https://github.com/schatzlab/Col-CEN/tree/main/v1.2
只保留染色体序列,我使用Heliex这个工具对蛋白编码基因进行了注释,获得蛋白编码基因注释的gff格式文件
转录组数据来源于论文 Rapid and Dynamic Alternative Splicing Impacts the Arabidopsis Cold Response Transcriptome
下载 时间点8 和时间点 14 的转录组数据
ERR1886187,ERR1886352,ERR1886283 timepoint8 ERR1886289,ERR1886358,ERR1886193 timepoint14
这里的数据值选取了总数据的百分之一
准备好三个输入数据
- 1 参考基因组的fasta文件
- 2 参考基因组的蛋白编码基因注释文件 gff格式
- 3 转录组测序数据 fastq 格式
安装软件
这里安装软件都用conda来安装,需要用到的软件有
- fastp
- hisat2
- samtools
- stringtie
- R语言 tidyverse包(整理数据)DEseq2 包 (差异表达分析)
数据处理
第一步:原始数据过滤
代码语言:javascript复制fastp -i reads/ERR1886187_subset_1.fastq.gz -I reads/ERR1886187_subset_2.fastq.gz -o 01.clean.reads/ERR1886187_1.fq.gz -O 01.clean.reads/ERR1886187_2.fq.gz -j 01.fastp.output/ERR1886187.json -h 01.fastp.output/ERR1886187.html -w 1
fastp -i reads/ERR1886193_subset_1.fastq.gz -I reads/ERR1886193_subset_2.fastq.gz -o 01.clean.reads/ERR1886193_1.fq.gz -O 01.clean.reads/ERR1886193_2.fq.gz -j 01.fastp.output/ERR1886193.json -h 01.fastp.output/ERR1886193.html -w 1
fastp -i reads/ERR1886283_subset_1.fastq.gz -I reads/ERR1886283_subset_2.fastq.gz -o 01.clean.reads/ERR1886283_1.fq.gz -O 01.clean.reads/ERR1886283_2.fq.gz -j 01.fastp.output/ERR1886283.json -h 01.fastp.output/ERR1886283.html -w 1
fastp -i reads/ERR1886289_subset_1.fastq.gz -I reads/ERR1886289_subset_2.fastq.gz -o 01.clean.reads/ERR1886289_1.fq.gz -O 01.clean.reads/ERR1886289_2.fq.gz -j 01.fastp.output/ERR1886289.json -h 01.fastp.output/ERR1886289.html -w 1
fastp -i reads/ERR1886352_subset_1.fastq.gz -I reads/ERR1886352_subset_2.fastq.gz -o 01.clean.reads/ERR1886352_1.fq.gz -O 01.clean.reads/ERR1886352_2.fq.gz -j 01.fastp.output/ERR1886352.json -h 01.fastp.output/ERR1886352.html -w 1
fastp -i reads/ERR1886358_subset_1.fastq.gz -I reads/ERR1886358_subset_2.fastq.gz -o 01.clean.reads/ERR1886358_1.fq.gz -O 01.clean.reads/ERR1886358_2.fq.gz -j 01.fastp.output/ERR1886358.json -h 01.fastp.output/ERR1886358.html -w 1
第二步:对参考基因组构建索引
代码语言:javascript复制hisat2-build fa/at.col0.chr.fna index/at
第三步:转录组数据与参考基因组进行比对
代码语言:javascript复制hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886187_1.fq.gz -2 01.clean.reads/ERR1886187_2.fq.gz -S 02.sam/ERR1886187.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886193_1.fq.gz -2 01.clean.reads/ERR1886193_2.fq.gz -S 02.sam/ERR1886193.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886283_1.fq.gz -2 01.clean.reads/ERR1886283_2.fq.gz -S 02.sam/ERR1886283.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886289_1.fq.gz -2 01.clean.reads/ERR1886289_2.fq.gz -S 02.sam/ERR1886289.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886352_1.fq.gz -2 01.clean.reads/ERR1886352_2.fq.gz -S 02.sam/ERR1886352.sam
hisat2 -p 1 --dta -x ref/index/at -1 01.clean.reads/ERR1886358_1.fq.gz -2 01.clean.reads/ERR1886358_2.fq.gz -S 02.sam/ERR1886358.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886187.sorted.bam 02.sam/ERR1886187.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886193.sorted.bam 02.sam/ERR1886193.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886283.sorted.bam 02.sam/ERR1886283.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886289.sorted.bam 02.sam/ERR1886289.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886352.sorted.bam 02.sam/ERR1886352.sam
samtools sort -@ 1 -O BAM -o 03.sorted.bam/ERR1886358.sorted.bam 02.sam/ERR1886358.sam
第四步:计算基因表达量
代码语言:javascript复制stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886187/ERR1886187.gtf -A 04.stringtie.output/ERR1886187/ERR1886187.gene_abund.tsv 03.sorted.bam/ERR1886187.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886193/ERR1886193.gtf -A 04.stringtie.output/ERR1886193/ERR1886193.gene_abund.tsv 03.sorted.bam/ERR1886193.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886283/ERR1886283.gtf -A 04.stringtie.output/ERR1886283/ERR1886283.gene_abund.tsv 03.sorted.bam/ERR1886283.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886289/ERR1886289.gtf -A 04.stringtie.output/ERR1886289/ERR1886289.gene_abund.tsv 03.sorted.bam/ERR1886289.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886352/ERR1886352.gtf -A 04.stringtie.output/ERR1886352/ERR1886352.gene_abund.tsv 03.sorted.bam/ERR1886352.sorted.bam
stringtie -p 1 -G ref/at.col0.chr.gff -e -B -o 04.stringtie.output/ERR1886358/ERR1886358.gtf -A 04.stringtie.output/ERR1886358/ERR1886358.gene_abund.tsv 03.sorted.bam/ERR1886358.sorted.bam
第五步:合并基因表达量
代码语言:javascript复制Rscript getCounts.R
getCounts.R 的内容
代码语言:javascript复制library(tidyverse)
myfun<-function(x){
return(read_tsv(x) %>%
select(`Gene ID`,`Coverage`) %>%
mutate(sampleName=str_extract(x,pattern = "ERR[0-9] ")))
}
list.files("04.stringtie.output",
pattern = "gene_abund.tsv",
recursive = TRUE,
full.names = TRUE) %>%
map(myfun) %>%
bind_rows() %>%
pivot_wider(names_from = "sampleName",values_from = "Coverage") %>%
write_tsv("counts.tsv")
第六步:差异表达分析
代码语言:javascript复制library(tidyverse)
library(DESeq2)
mycounts<-read_tsv("D:/Jupyter/practice/RNAseqAT/counts.tsv") %>%
column_to_rownames("Gene ID") %>%
round()
mycounts %>%
head()
mycounts %>% dim()
mycounts_1<-mycounts[rowSums(mycounts) != 0,]
dim(mycounts_1)
mymeta<-read_csv("D:/Jupyter/practice/RNAseqAT/mymeta.csv")
colnames(mycounts_1) == mymeta$id
mycounts_1 %>%
select(mymeta$id) -> mycounts_1
colnames(mycounts_1) == mymeta$id
dds <- DESeqDataSetFromMatrix(countData=mycounts_1,
colData=mymeta,
design=~dex)
dds <- DESeq(dds)
res <- results(dds)
res
res %>%
as.data.frame() %>%
mutate(group = case_when(
log2FoldChange >= 1 & padj <= 0.05 ~ "UP",
log2FoldChange <= -1 & padj <= 0.05 ~ "DOWN",
TRUE ~ "NOT_CHANGE"
)) %>%
pull(group) %>% table()
res %>%
as.data.frame() %>% head()
res %>%
as.data.frame() %>%
mutate(group = case_when(
log2FoldChange >= 1 & padj <= 0.05 ~ "UP",
log2FoldChange <= -1 & padj <= 0.05 ~ "DOWN",
TRUE ~ "NOT_CHANGE"
)) %>%
ggplot(aes(x=log2FoldChange,y=-log10(padj)))
geom_point(aes(fill=group),size=8,shape=21,
color="gray",
alpha=0.6)
scale_fill_manual(values = c("UP"="red",
"DOWN"="blue",
"NOT_CHANGE"="gray"))
theme_bw(base_size = 20)
theme(panel.grid = element_blank(),
legend.position = "top")