GATK4 最佳实践-生殖细胞突变的检测与识别

GATK4 对于体细胞突变和生殖细胞突变的检测分别给出了对应的pipeline:

Germline SNPs Indels
Somatic SNVs Indels

本篇主要关注生殖细胞突变的分析流程Germline SNPs Indels。示意图如下：

图中红色方框部分的从Analysis-Ready Bam 到，主要包括以下4步

HaplotyperCaller in GVCF mode
ImportGenomicsDB Consolidate GVCFs
GenotypeGVCFs
Filter Variants by Variabt Recalibration

官网给出了6套用于参考的pipeline

主要分析步骤都差不多，这里我选择第4个通用的流程，网址如下

https://github.com/gatk-workflows/gatk4-germline-snps-indels

1. HaplotyperCaller in GVCF mode

对于每个样本，采用HaplotyperCaller计算突变位点，命令如下

代码语言：javascript复制

gatk --java-options "-Xmx6G -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10" 
   HaplotypeCaller 
   -R ${ref_fasta} 
   -I ${input_bam} 
   -L ${interval_list} 
   -O ${output_filename} 
   -contamination 0 -ERC GVCF

ref_fasta代表参考基因组的fasta文件;input_bam代表预处理阶段产生的 bam文件；interval代表interval list文件，如果指定这个参数，只会输出指定区域的突变信息。对于全基因组测序，不需要这个参数，对于外显子和目的区域捕获测序, 则需要这个参数；output_filename代表每个样本输出的gvcf文件的名字。

2. ImportGenomicsDB Consolidate GVCFs

将所有样本的gvcf文件合并，产生一个总的gvcf文件，命令如下：

代码语言：javascript复制

gatk --java-options -Xmx2G  
   MergeVcfs 
   --INPUT ${sep=' --INPUT ' input_vcfs} 
   --OUTPUT ${output_filename}

3. GenotypeGVCFs

包括两个步骤，第一步，导入MergeVcfs合并好的gvcf文件，命令如下

代码语言：javascript复制

gatk --java-options "-Xmx4g -Xms4g" 
   GenomicsDBImport 
   --genomicsdb-workspace-path ${workspace_dir_name} 
   --batch-size ${batch_size} 
   -L ${interval} 
   --sample-name-map ${sample_name_map} 
   --reader-threads 5 
   -ip 500

workspace_dir_name代表输出目录;batch_size指定同时访问的最大文件数，默认值为0，表示同时访问所有样本的文件；interval代表interval list文件，如果指定这个参数，只会输出指定区域的突变信息。对于全基因组测序，不需要这个参数，对于外显子和目的区域捕获测序, 则需要这个参数；sampple_name_map是一个文件，记录了样本名称和每个样本对应的gvcf文件的路径。格式如下

sample1 sample1.vcf.gz sample2 sample2.vcf.gz sample3 sample3.vcf.gz

第二步，运行GenotypeGVCFs，命令如下

代码语言：javascript复制

gatk --java-options "-Xmx5g -Xms5g" 
   GenotypeGVCFs 
   -R ${ref_fasta} 
   -O ${output_vcf_filename} 
   -D ${dbsnp_vcf} 
   -G StandardAnnotation 
   --only-output-calls-starting-in-intervals 
   --use-new-qual-calculator 
   -V gendb://$WORKSPACE 
   -L ${interval}

需要注意-V 参数，指定的是GenomicsDBImport输出目录的路径。

4. Filter Variants by Variabt Recalibration

第一步，过滤vcf文件，条件为ExcessHet大于给定的阈值，命令如下：

代码语言：javascript复制

gatk --java-options "-Xmx3g -Xms3g" 
   VariantFiltration 
   --filter-expression "ExcessHet > ${excess_het_threshold}" 
   --filter-name ExcessHet 
   -O ${variant_filtered_vcf_filename} 
   -V ${vcf}

excess_het_threshold指定ExcessHet的阈值；variant_filtered_vcf_filename代表输出的vcf文件的名字；vcf代表GenotypeGVCFs 生成的vcf文件的名字。注意，不满足条件的记录也会出现在最终生成的vcf文件中, 只不过对应的Filter字段的信息不是PASS。

第二步，删除vcf文件中的基因型信息，命令如下

代码语言：javascript复制

gatk --java-options "-Xmx3g -Xms3g" 
   MakeSitesOnlyVcf 
   --INPUT ${variant_filtered_vcf_filename} 
   --OUTPUT ${sites_only_vcf_filename}

第三步，合并不同区间的vcf文件，并建立索引

代码语言：javascript复制

gatk --java-options "-Xmx6g -Xms6g" 
 GatherVcfsCloud 
 --ignore-safety-checks 
 --gather-type BLOCK 
 --input ${inputs.list} 
 --output ${output_vcf_name}
gatk --java-options "-Xmx6g -Xms6g" 
  IndexFeatureFile 
  --feature-file ${output_vcf_name}

output_vcf_name代表输出的vcf文件；inputs.list指定不同区间的vcf文件的路径，格式如下

cohortA_chr1.vcf.gz cohortA_chr2.vcf.gz

第四步，分别评估SNP和INDEL突变位点的质量, 命令如下

代码语言：javascript复制

gatk --java-options "-Xmx24g -Xms24g" 
   VariantRecalibrator 
   -V ${sites_only_variant_filtered_vcf} 
   -O ${recalibration_filename} 
   --tranches-file ${tranches_filename} 
   --trust-all-polymorphic 
   -tranche ${sep=' -tranche ' recalibration_tranche_values} 
   -an ${sep=' -an ' recalibration_annotation_values} 
   -mode INDEL 
   --max-gaussians 4 
   -resource mills,known=false,training=true,truth=true,prior=12:${mills_resource_vcf} 
   -resource axiomPoly,known=false,training=true,truth=false,prior=10:${axiomPoly_resource_vcf} 
   -resource dbsnp,known=true,training=false,truth=false,prior=2:${dbsnp_resource_vcf}
gatk --java-options "-Xmx100g -Xms100g" 
     VariantRecalibrator 
     -V ${sites_only_variant_filtered_vcf} 
     -O ${recalibration_filename} 
     --tranches-file ${tranches_filename} 
     --trust-all-polymorphic 
     -tranche ${sep=' -tranche ' recalibration_tranche_values} 
     -an ${sep=' -an ' recalibration_annotation_values} 
     -mode SNP 
     --sample-every-Nth-variant ${downsampleFactor} 
     --output-model ${model_report_filename} 
     --max-gaussians 6 
     -resource hapmap,known=false,training=true,truth=true,prior=15:${hapmap_resource_vcf} 
     -resource omni,known=false,training=true,truth=true,prior=12:${omni_resource_vcf} 
     -resource 1000G,known=false,training=true,truth=false,prior=10:${one_thousand_genomes_resource_vcf} 
     -resource dbsnp,known=true,training=false,truth=false,prior=7:${dbsnp_resource_vcf}

resource指定建模时参考的vcf文件，可以看到对于indel和snp, 参考的数据库不一样。这一步只是建模，会输出一个recalibration table文件，用于ApplyVQSR命令。第五步，运行VQSR, 命令如下

代码语言：javascript复制

gatk --java-options "-Xmx5g -Xms5g" 
   ApplyVQSR 
   -O tmp.indel.recalibrated.vcf 
   -V ${input_vcf} 
   --recal-file ${indels_recalibration} 
   --tranches-file ${indels_tranches} 
   --truth-sensitivity-filter-level ${indel_filter_level} 
   --create-output-variant-index true 
   -mode INDEL
gatk_path --java-options "-Xmx5g -Xms5g" 
   ApplyVQSR 
   -O ${recalibrated_vcf_filename} 
   -V tmp.indel.recalibrated.vcf 
   --recal-file ${snps_recalibration} 
   --tranches-file ${snps_tranches} 
   --truth-sensitivity-filter-level ${snp_filter_level} 
   --create-output-variant-index true 
   -mode SNP

input_vcf文件为GatherVcfsCloud生成的vcf文件，先校正INDEL位点，后校正SNP位点。

filter fs list output pipeline

0 人点赞