软件首页:http://asia.ensembl.org/info/docs/tools/vep/script/vep_tutorial.html
安装
软件为 Perl 脚本,依赖:
•gcc, g and make•Perl version 5.10 or above recommended (tested on 5.10, 5.14, 5.18, 5.22, 5.26)•Perl packages:•Archive::Zip•DBD::mysql•DBI
从 GitHub 下载即可使用:
git clone https://github.com/Ensembl/ensembl-vep.gitcd ensembl-vep
建议直接用 docker 安装(下面的步骤也以 docker 为例):
docker pull ensemblorg/ensembl-vep
创建文件夹以存放数据库:
# 在宿主机上创建目录mkdir $HOME/vep_data
# 确保 docker 容器有读写权限so the docker container can write in the directory (VEP output):chmod a rwx $HOME/vep_data
下载数据库
•下载 GRCh38 的注释数据和 fasta:
docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cf -s homo_sapiens -y GRCh38
•下载插件:
# 下载所有插件docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y GRCh38 -g all# 只下载特定插件docker run -t -i -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep perl INSTALL.pl -a cfp -s homo_sapiens -y GRCh38 -g dbNSFP,CADD,G2P
注释数据保存于容器内 /opt/vep/.vep/homo_sapiens
文件夹下,插件保 存于容器内 /opt/vep/.vep/Plugins
文件夹下,分别对于宿主机中的 $HOME/vep_data/homo_sapiens
和 $HOME/vep_data/Plugins
。
所以除了直接用 perl INSTALL.pl
之外,我们也可以手动下载解压数据库和插件放到 $HOME/vep_data
:
cd $HOME/vep_datacurl -O ftp://ftp.ensembl.org/pub/release-99/variation/vep/homo_sapiens_vep_99_GRCh38.tar.gztar xzf homo_sapiens_vep_99_GRCh38.tar.gz
curl -O ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
数据库中包含的内容:
插件下载地址:https://github.com/Ensembl/VEP_plugins
运行 VEP
运行示例数据
创建容器:
docker run -it --rm -v $HOME/vep_data:/opt/vep/.vep ensemblorg/ensembl-vep /bin/bash
运行示例数据
./vep -i examples/homo_sapiens_GRCh38.vcf --cache --force_overwrite
默认输出为 VEP 格式的文本 variant_effect_output.txt
:
head variant_effect_output.txt
## ENSEMBL VARIANT EFFECT PREDICTOR v99.0## Output produced at 2017-03-21 14:51:27## Connected to homo_sapiens_core_99_38 on ensembldb.ensembl.org## Using cache in /homes/user/.vep/homo_sapiens/99_GRCh38## Using API version 99, DB version 99## polyphen version 2.2.2## sift version sift5.2.2## COSMIC version 78## ESP version 20141103## gencode version GENCODE 25## genebuild version 2014-07## HGMD-PUBLIC version 20162## regbuild version 16## assembly version GRCh38.p7## ClinVar version 201610## dbSNP version 147## Column descriptions:## Uploaded_variation : Identifier of uploaded variant## Location : Location of variant in standard coordinate format (chr:start or chr:start-end)## Allele : The variant allele used to calculate the consequence## Gene : Stable ID of affected gene## Feature : Stable ID of feature## Feature_type : Type of feature - Transcript, RegulatoryFeature or MotifFeature## Consequence : Consequence type## cDNA_position : Relative position of base pair in cDNA sequence## CDS_position : Relative position of base pair in coding sequence## Protein_position : Relative position of amino acid in protein## Amino_acids : Reference and variant amino acids## Codons : Reference and variant codon sequence## Existing_variation : Identifier(s) of co-located known variants## Extra column keys:## IMPACT : Subjective impact classification of consequence type## DISTANCE : Shortest distance from variant to transcript## STRAND : Strand of the feature (1/-1)## FLAGS : Transcript quality flags#Uploaded_variation Location Allele Gene Feature Feature_type Consequence ...rs7289170 22:17181903 G ENSG00000093072 ENST00000262607 Transcript synonymous_variant ...rs7289170 22:17181903 G ENSG00000093072 ENST00000330232 Transcript synonymous_variant ...
标准结果文本为 14 列,由 TAB 分割:
1.Uploaded variation :突变 ID2.Location :位置3.Allele4.Gene :基因 Ensembl ID5.Feature :Ensembl stable ID of feature6.Feature type :feature 类型,包括 Transcript,RegulatoryFeature,MotifFeature三种7.Consequence :突变造成的结果8.Position in cDNA :在 cDNA 序列中的相对位置9.Position in CDS :在 CDS 中的相对位置10.Position in protein :对应氨基酸在蛋白中的相对位置11.Amino acid change :氨基酸的改变12.Codon change :the alternative codons with the variant base in upper case13.Co-located variation :注释突变 rsid14.Extra :其他等等
运行时可加上 -vcf
参数,输出 VCF 格式; -o
指定输出文件名。
如何提高运行速度
1. 用 convert_cache.pl
脚本转换数据库文件,建立索引以大幅提高数据库检索速度:
# 转换所有数据库perl convert_cache.pl -species all -version all
2. 用 --fork
参数多线程运行 VEP:
./vep -i my_input.vcf --fork 4 --offline
3. 用 tmpfs
将数据库写入内存。4. 安装 Set::IntervalTree
Perl package,提高注释速度。5. 安装 Ensembl::XS
Perl package,它包含 VEP 中某些关键子程序的编译版本,运行速度可提高 5-10% 。6. 将输入文件按染色体进行排序。7. 对于非常大的 VCF 文件(如 WGS)建议将文件拆分运行,例如:
tabix -h variants.vcf.gz 12:1000000-20000000 | ./vep --cache --vcf
结果过滤
得到输出结果后,可用 filter_vep
进行结果过滤。
./vep -i in.vcf -o out.txt -cache -everything./filter_vep -i out.txt -o out_filtered.txt -filter "[filter_text]"
也可通过管道使用:
./vep -i in.vcf -o stdout -cache -check_existing | ./filter_vep -filter "not Existing_variation" -o out.txt
-filter
后的内容即是过滤条件,主要包括三个部分:Field
检索域(用 --list
参数可以查看结果中所有可用的检索域),Operator
操作符以及 Value
值。举几个例子:
# match entries where Feature (Transcript) is "ENST00000307301"--filter "Feature is ENST00000307301"
# match entries where Protein_position is less than 10--filter "Protein_position < 10"
# match entries where Consequence contains "stream" (this will match upstream and downstream)--filter "Consequence matches stream"
在一些情况下,你可能还想探究某些字段是否被注释到,如:
# match entries where the gene symbol is defined--filter "SYMBOL"
基于两个 Field
值的比较来筛选数据(需要在作为 Value
的 Field
前加上 #
):
# match entries where AFR_AF is greater than EUR_AF--filter "AFR_AF > #EUR_AF"
基于字符串过滤:
# filter for missense variants in CCDS transcripts where the variant falls in a protein domain--filter "Consequence is missense_variant and CCDS and DOMAINS"
# find variants where the allele frequency is greater than 10% in either AFR or EUR populations--filter "AFR_AF > 0.1 or EUR_AF > 0.1"
# filter out known variants--filter "not Existing_variation"
操作符
•is (相当于 = ,eq) : 完全相同
# get only transcript consequences--filter "Feature_type is Transcript"
•!= (相当于 ne) : 非
# filter out tolerated SIFT predictions--filter "SIFT != tolerated"
•match (相当于 matches , re , regex) : 可用正则表达式匹配文本
# match stop_gained, stop_lost and stop_retained--filter "Consequence match stop"
•< (相当于 lt) : 小于,注意这里缺失值不视作 0
# find SIFT scores less than 0.1--filter "SIFT < 0.1"
•> (相当于 gt) : 大于
# find variants not in the first exon--filter "Exon > 1"
•<= (相当于 lte) : 小于等于•>= (相当于 gte) : 大于等于•exists (相当于 ex , defined) : 存在•in : 基于列表或文件进行查找
# find variants in a list of gene names--filter "SYMBOL in BRCA1,BRCA2"
# filter using a file of MotifFeatures--filter "Feature in /data/files/motifs_list.txt"
用 VEP 注释 rsid
可用 --check_existing
参数根据 Ensembl 突变数据库(包括 dbSNP)注释突变名称。默认情况下,VEP 用基于归一化的等位基因来匹配数据库以识别与输入突变相匹配的已知突变。
VEP 等位基因匹配算法示意图,该算法会将一条带有多个 ALT 的 VCF 记录解析为三种不同的突变类型和坐标。