GATK4 官方针对不同的变异类型,给出了好几套用于参考的pipeline
。所有的pipeline
有一个共同点,就是数据预处理部分。数据预处理的目的,是将原始的fastq或者ubam 文件,经过一系列处理,得到用于变异识别的bam
文件,具体的示意图如下:
从示意图可以看出,预处理部分包含了3个主要步骤:
- map to reference
- Mark Duplicates
- Base(Quality Score) Recalibration
之前版本的最佳实践都是给出了具体的gatk的代码,但是GATK4 给出的是用wdl
这种workflow 语言编写的流程。对于预处理部分,对应的链接如下:
https://github.com/gatk-workflows/gatk4-data-processing
共给出了3套流程用于参考:
对于没有编程基础的人来说,理解一个wdl
脚本的逻辑是比较头疼的事情。我从3条流程中抽取出最关键的几个部分,整理如下:
1. map to reference
将原始序列和参考基因组进行比对, 官方推荐使用bwa
比对软件,主要针对
DNA测序的数据。原始数据的格式为ubam
。命令如下
java -jar picard.jar SamToFastq
INPUT=${input_bam}
FASTQ=/dev/stdout
INTERLEAVE=true
NON_PF=true |
bwa mem -K 100000000 -p -v 3 -t 16 -Y ${bash_ref_fasta} /dev/stdin - |
java -jar picard.jar MergeBamAlignment
VALIDATION_STRINGENCY=SILENT
EXPECTED_ORIENTATIONS=FR
ATTRIBUTES_TO_RETAIN=X0
ATTRIBUTES_TO_REMOVE=NM
ATTRIBUTES_TO_REMOVE=MD
ALIGNED_BAM=/dev/stdin
UNMAPPED_BAM=${input_bam}
OUTPUT=${output_bam_basename}.bam
REFERENCE_SEQUENCE=${ref_fasta}
PAIRED_RUN=true
SORT_ORDER="unsorted"
IS_BISULFITE_SEQUENCE=false
ALIGNED_READS_ONLY=false
CLIP_ADAPTERS=false
MAX_RECORDS_IN_RAM=2000000
ADD_MATE_CIGAR=true
MAX_INSERTIONS_OR_DELETIONS=-1
PRIMARY_ALIGNMENT_STRATEGY=MostDistant
PROGRAM_RECORD_ID="bwamem"
PROGRAM_GROUP_VERSION="${bwa_version}"
PROGRAM_GROUP_COMMAND_LINE="bwa mem -K 100000000 -p -v 3 -t 16 -Y ${bash_ref_fasta}"
PROGRAM_GROUP_NAME="bwamem"
UNMAPPED_READ_STRATEGY=COPY_TO_TAG
ALIGNER_PROPER_PAIR_FLAGS=true
UNMAP_CONTAMINANT_READS=true
ADD_PG_TAG_TO_READS=false
比对参考基因组部分共涉及到3个步骤,第一步picard
将ubam转换成fastq; 第二步bwa 比对参考基因组;第三步picard
将原始数据ubam和比对产生的aligned bam 合并,产生一个最终的bam文件。每个样本都对应生成一个这样的bam文件。
上面的代码中,有几个需要替换的变量。${input_bam}
代表输入的原始序列的ubam
格式的文件,${bash_ref_fasta}
代表参考基因组序列的bwa 索引,${ref_fasta}
代表参考基因组的fasta格式的序列;${output_bam_basename}
代表最终输出的bam文件的名字,${bwa_version}
代表bwa软件的版本号。
2. Mark Duplicates
标记bam文件中的重复序列,使用picard
的MarkDuplicates命令,代码如下:
java -jar picard.jar
MarkDuplicates
INPUT=${input_bams}
OUTPUT=${output_bam_basename}.bam
METRICS_FILE=${metrics_filename}
VALIDATION_STRINGENCY=SILENT
OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500
ASSUME_SORT_ORDER="queryname"
CREATE_MD5_FILE=true
${input_bams}
代表比对参考基因组之后生成的所有的bam文件,${output_bam_basename}
代表产生的bam文件的名字,${metrics_filename}
代表产生的metrics文件的名称。
标记完重复序列之后,需要对产生的bam文件进行排序,命令如下
代码语言:javascript复制java -jar picard.jar
SortSam
INPUT=${input_bam}
OUTPUT=/dev/stdout
SORT_ORDER="coordinate"
CREATE_INDEX=false
CREATE_MD5_FILE=false
MAX_RECORDS_IN_RAM=300000 |
java -jar picard.jar
SetNmAndUqTags
INPUT=/dev/stdin
OUTPUT=${output_bam_basename}.bam
CREATE_INDEX=true
CREATE_MD5_FILE=true
REFERENCE_SEQUENCE=${ref_fasta}
${input_bam}
代表生成标记重复序列生成的bam文件;{$output_bam_basename}
代表输出的排序之后的bam文件的名称,
3. Base(Quality Score) Recalibration
运行BQSR包括三步,第一步的命令如下:
代码语言:javascript复制gatk BaseRecalibrator
-R ${ref_fasta}
-I ${input_bam}
--use-original-qualities
-O ${recalibration_report_filename}
--known-sites ${dbSNP_vcf}
--known-sites ${sep=” —known-sites “ known_indels_sites_VCFs}
${ref_fasta}
代表参考基因组的fasta序列;${input_bam}
代表mark duplicates生成的排序好的bam文件;${recalibration_report_filename}
代表产生的report文件的名字;${dbSNP_vcf}
代表NCBI SNP数据库中物种对应的vcf文件;${known_indels_sites_VCFs}
代表其他数据库中已知的indel位点的vcf文件。
第二步的命令如下:
代码语言:javascript复制gatk GatherBQSRReports
-I ${sep=' -I ' input_bqsr_reports}
-O ${output_report_filename}
${input_bqsr_reports}
代表第一步生成的report文件,每个样本一个,多个样本就指定多次 , 比如 -I 1.bqsr.report -I 2.bqsr.report
; ${output_report_filename}
代表输出的report文件的名字。
第三步的命令如下:
代码语言:javascript复制gatk ApplyBQSR
-R ${ref_fasta}
-I ${input_bam}
-O ${output_bam_basename}.bam
-bqsr ${recalibration_report}
--static-quantized-quals 10 --static-quantized-quals 20 --static-quantized-quals 30
--add-output-sam-program-record
--create-output-bam-md5
--use-original-qualities
${ref_fasta}
代表参考基因组的fasta序列;${input_bam}
代表mark duplicates生成的排序好的bam文件;{$output_bam_basename}
代表输出的bam文件的名称; ${recalibration_report}
代表第二步生成的report文件。
当然你也可以直接运行别人的写好的wdl
文件,以下面的这个预处理流程为例
https://github.com/gatk-workflows/gatk4-data-processing
首先得到整个脚本所有的参数列表,命令如下
java -jar womtools.jar inputs processing-for-variant-discovery-gatk4.wdl > processing_inputs.json
这个流程内置了一个hg38的参数模板,内容如下
代码语言:javascript复制{
"##_COMMENT1": "SAMPLE NAME AND UNMAPPED BAMS",
"PreProcessingForVariantDiscovery_GATK4.sample_name": "NA12878",
"PreProcessingForVariantDiscovery_GATK4.ref_name": "hg38",
"PreProcessingForVariantDiscovery_GATK4.flowcell_unmapped_bams_list": "gs://gatk-test-data/wgs_ubam/NA12878_24RG/NA12878_24RG_small.txt",
"PreProcessingForVariantDiscovery_GATK4.unmapped_bam_suffix": ".bam", "##_COMMENT2": "REFERENCE FILES",
"PreProcessingForVariantDiscovery_GATK4.ref_dict": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.dict",
"PreProcessingForVariantDiscovery_GATK4.ref_fasta": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta",
"PreProcessingForVariantDiscovery_GATK4.ref_fasta_index": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.fai",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_alt": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.alt",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_sa": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.sa",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_amb": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.amb",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_bwt": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.bwt",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_ann": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.ann",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.ref_pac": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.fasta.64.pac", "##_COMMENT3": "KNOWN SITES RESOURCES",
"PreProcessingForVariantDiscovery_GATK4.dbSNP_vcf": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf",
"PreProcessingForVariantDiscovery_GATK4.dbSNP_vcf_index": "gs://broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf.idx",
"PreProcessingForVariantDiscovery_GATK4.known_indels_sites_VCFs": [
"gs://broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz",
"gs://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz"
],
"PreProcessingForVariantDiscovery_GATK4.known_indels_sites_indices": [
"gs://broad-references/hg38/v0/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi",
"gs://broad-references/hg38/v0/Homo_sapiens_assembly38.known_indels.vcf.gz.tbi"
],
"##_COMMENT4": "MISC PARAMETERS",
"PreProcessingForVariantDiscovery_GATK4.bwa_commandline": "bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta",
"PreProcessingForVariantDiscovery_GATK4.compression_level": 5,
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.num_cpu": "16",
"##_COMMENT5": "DOCKERS",
"PreProcessingForVariantDiscovery_GATK4.gotc_docker": "broadinstitute/genomes-in-the-cloud:2.3.0-1501082129",
"PreProcessingForVariantDiscovery_GATK4.gatk_docker": "broadinstitute/gatk:4.0.0.0",
"PreProcessingForVariantDiscovery_GATK4.picard_docker": "broadinstitute/genomes-in-the-cloud:2.3.0-1501082129",
"PreProcessingForVariantDiscovery_GATK4.python_docker": "python:2.7",
"##_COMMENT6": "PATHS",
"PreProcessingForVariantDiscovery_GATK4.gotc_path": "/usr/gitc/",
"PreProcessingForVariantDiscovery_GATK4.picard_path": "/usr/gitc/",
"PreProcessingForVariantDiscovery_GATK4.gatk_path": "/gatk/gatk",
"##_COMMENT7": "JAVA OPTIONS",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.java_opt": "-Xms3000m",
"PreProcessingForVariantDiscovery_GATK4.MergeBamAlignment.java_opt": "-Xms3000m",
"PreProcessingForVariantDiscovery_GATK4.MarkDuplicates.java_opt": "-Xms4000m",
"PreProcessingForVariantDiscovery_GATK4.SortAndFixTags.java_opt_sort": "-Xms4000m",
"PreProcessingForVariantDiscovery_GATK4.SortAndFixTags.java_opt_fix": "-Xms500m",
"PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator.java_opt": "-Xms4000m",
"PreProcessingForVariantDiscovery_GATK4.GatherBqsrReports.java_opt": "-Xms3000m",
"PreProcessingForVariantDiscovery_GATK4.ApplyBQSR.java_opt": "-Xms3000m",
"PreProcessingForVariantDiscovery_GATK4.GatherBamFiles.java_opt": "-Xms2000m",
"##_COMMENT8": "MEMORY ALLOCATION",
"PreProcessingForVariantDiscovery_GATK4.GetBwaVersion.mem_size": "1 GB",
"PreProcessingForVariantDiscovery_GATK4.SamToFastqAndBwaMem.mem_size": "14 GB",
"PreProcessingForVariantDiscovery_GATK4.MergeBamAlignment.mem_size": "3500 MB",
"PreProcessingForVariantDiscovery_GATK4.MarkDuplicates.mem_size": "7 GB",
"PreProcessingForVariantDiscovery_GATK4.SortAndFixTags.mem_size": "5000 MB",
"PreProcessingForVariantDiscovery_GATK4.CreateSequenceGroupingTSV.mem_size": "2 GB",
"PreProcessingForVariantDiscovery_GATK4.BaseRecalibrator.mem_size": "6 GB",
"PreProcessingForVariantDiscovery_GATK4.GatherBqsrReports.mem_size": "3500 MB",
"PreProcessingForVariantDiscovery_GATK4.ApplyBQSR.mem_size": "3500 MB",
"PreProcessingForVariantDiscovery_GATK4.GatherBamFiles.mem_size": "3 GB",
"##_COMMENT9": "DISK SIZE ALLOCATION",
"PreProcessingForVariantDiscovery_GATK4.agg_small_disk": 200,
"PreProcessingForVariantDiscovery_GATK4.agg_medium_disk": 300,
"PreProcessingForVariantDiscovery_GATK4.agg_large_disk": 400,
"PreProcessingForVariantDiscovery_GATK4.flowcell_small_disk": 100,
"PreProcessingForVariantDiscovery_GATK4.flowcell_medium_disk": 200,
"##_COMMENT10": "PREEMPTIBLES",
"PreProcessingForVariantDiscovery_GATK4.preemptible_tries": 3,
"PreProcessingForVariantDiscovery_GATK4.agg_preemptible_tries": 3
}
大部分的参数还是很好配置的,但是要注意一点,DOCKERS
下配置的是软件对应的docker镜像,是需要安装的。
参数列表配置好之后,运行下面命令即可
java -jar Cromwell.jar run processing-for-variant-discovery-gatk4.wdl —inputs processing_inputs.json