目录
1.Module 1 - Introduction to RNA sequencing
- Installation
- Reference Genomes
- Annotations
- Indexing
- RNA-seq Data
- Pre-Alignment QC
2.Module 2 - RNA-seq Alignment and Visualization
- Adapter Trim
- Alignment
- IGV
- Alignment Visualization
- Alignment QC
3.Module 3 - Expression and Differential Expression
- Expression
- Differential Expression
- DE Visualization
- Kallisto for Reference-Free Abundance Estimation
4.Module 4 - Isoform Discovery and Alternative Expression
- Reference Guided Transcript Assembly
- de novo Transcript Assembly
- Transcript Assembly Merge
- Differential Splicing
- Splicing Visualization
5.Module 5 - De novo transcript reconstruction
- De novo RNA-Seq Assembly and Analysis Using Trinity
6.Module 6 - Functional Annotation of Transcripts
- Functional Annotation of Assembled Transcripts Using Trinotate
1.2 Reference Genomes
从Ensembl、iGenomes、NCBI或UCSC获得参考基因组。在本例分析中,我们将使用人GRCh38版本的Ensembl基因组。此外,我们实际上将只使用单个染色体(chr22)和ERCC spikein来执行分析,以使它运行得更快……
创建必要的工作目录
代码语言:javascript复制mkdir RNA_ref
这些s数据可以在ftp://ftp.ensembl.org/pub/release-86/fasta/homo_sapiens/dna/找到。你可以使用wget下载homo_sapien . grch38 .dna_sm.primary_assembly.fa.gz文件,然后解压缩/解压。
代码语言:javascript复制cd RNA_ref
wget http://genomedata.org/rnaseq-tutorial/fasta/GRCh38/chr22_with_ERCC92.fa
ls
查看该文件的前10行。为什么会是这个样子
代码语言:javascript复制head chr22_with_ERCC92.fa
这个文件中有多少行和字符?这条染色体有多长(碱基和Mbp)
代码语言:javascript复制wc chr22_with_ERCC92.fa
848761 848764 51751056 chr22_with_ERCC92.fa
查看大约从该文件中间开始的10行。大小写字符的意义是什么?
代码语言:javascript复制head -n 425000 chr22_with_ERCC92.fa | tail
ggaggctgaggcaggagaatcgcttgaacatgggaggtggaagttgcagtgagccgaaac
tgcgccattgcactatagcctgggcaacaagagtgaaagtctgtcttgaaaaaaaaaaaT
CAGATGTTCTATGTAAAAATGCTATCTAtgattgaagtataaaactttacctccctttat
gttcctttgccctccccactatttattattgtcttgattatatcttctatatgcattgag
aggtgttataacttttgtatcaatcaccaaatttaatttagaaaatataagaggagaaga
aaagtctattacatttactcatatttttgcttactgtgttctttcttccttcttgatgtt
ccagaatttcttttattgcttcttttctgcttagaaaactttatctttttctttcatctt
tcttttttcctcctcctcctcctcctcctttttttttttttttttttttttttttttaat
aaagagacagggtctcactctatcacccagactggagttcagtgatgcaatcatagctca
ttgcaaccttgaactcctgggctcaagtgatcctcccacctcagcctcctgagtagctgg
在整个参考基因组文件中每个碱基的计数是多少(跳过每个序列的标题行)?
代码语言:javascript复制cat chr22_with_ERCC92.fa | grep -v ">" | perl -ne 'chomp $_; $bases{$_} for split //; if (eof){print "$_ $bases{$_}n" for sort keys �ses}'
A 4455938
C 4406493
G 4411768
N 10710000
T 4445994
Y 1
a 5950524
c 4772185
g 4853055
n 948691
t 5946575
请记住引用序列(染色体)的名称必须与注释gtf文件(在下一节中描述)中匹配。
练习2
22号染色体上有多少个碱基对应于重复的元素?整个长度占的百分比是多少
代码语言:javascript复制cat chr22_with_ERCC92.fa | perl -ne 'if ($_ =~ />22/){$chr22=1}; if ($_ =~ />ERCC/){$chr22=0}; if ($chr22){print "$_";}' > chr22_only.fa
cat chr22_only.fa | grep -v ">" | perl -ne 'chomp $_; $r = $_ =~ tr/a/A/; $r = $_ =~ tr/c/C/; $r = $_ =~ tr/g/G/; $r = $_ =~ tr/t/T/; $l = length($_); if (eof){$p = sprintf("%.2f", ($r/$l)*100); print "nrepeat bases = $rntotal bases = $lnpercent repeat bases = $p%nn"}'
repeat bases = 21522339
total bases = 50818468
percent repeat bases = 42.35%
22号染色体序列中出现多少个EcoRI限制位点?EcoRI限制性内切酶识别序列为5'- GAATTC -'3。
代码语言:javascript复制cat chr22_only.fa | grep -v ">" | perl -ne 'chomp $_; $s = uc($_); print $_;' | perl -ne '$c = $_ =~ s/GAATTC/XXXXXX/g; if (eof){print "nEcoRI site (GAATTC) count = $cnn";}'
EcoRI site (GAATTC) count = 3935