真核生物的基因大都为断裂基因,编码序列通常被内含子隔开。内含子和外显子边界和周围序列是前体mRNA内的有保守性的一些特殊核苷酸序列。
内含子的5'端剪切位点以GU开始,叫donor 内含子的3'端剪切位点以AG结束,叫acceptor, 还包括位于内含子内,靠近3'端的分支位点,通常为A,后面是多聚嘧啶区
在分析基因组数据时,通常需要预测基因的RNA选择性剪切方式,也就是内含子和外显子的位置和数量。 而基于的就是RNA剪接的保守型序列GU-AG规则,据此,再辅以ORF,Blast等数据可以对未知基因的成熟mRNA进行预测。
预测的工具
- 基因组核苷酸序列的包含剪切位点和内含子可用NetGene2,Splice View直接预测
- mRNA/cDNA需要借助Splign,SIM4,BLAS,BLAST等从相应基因组序列推断基因结构
- The Human Splicing Finder (HSF)
NCBI的Splign预测实例
参考手册
1 用Splign识别mRNA的外显子组成
或者
image.png
- Navigate to the Online page using the menu at the top of the page Navigate to the Online page using the menu at the top of the page
- Type or copy/paste you input sequences in the cDNA and Genomic text areas. Sequences in each box can be specified as identifiers (accessions or GIs), or in FASTA format. Entering both FASTA data and identifiers in same entry will generate an error. You can specify up to five cDNA sequences at a time, but only one genomic sequence.
- Check "Reverse and complement the query" box if you want your cDNA be aligned in antisense. E.g. EST sequences are often not guaranteed to have a sense orientation.
- Check "Cross-species mode" if your cDNA and genomic sequences are from different species. Internally, the cross-species mode means less stringent blast hits.
- Upon job submission, results will appear in a few seconds or more, depending primarily on the lengths and the number of sequences being spligned. Since fetching large chromosomal sequences (like whole-length human chromosomes) and running blast on them can be time-consuming, consider specifying shorter genomic sequences such as contigs. Smaller chromosomal sequences (e.g. Drosophila chromosomes) are ok.
结果如下:
image.png
结果解释
详细请参考https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi?textpage=documentation
- Plus (sense) and minus signs next to accessions indicate orientations in which the sequences were aligned. The remaining columns are explained below:
image.png
image.png