GATK Germline_SNP_INDEL_2.0 分析遗传病(耳聋)

2022-12-13 23:13:44 浏览数 (1)

GATK Germline_SNP_INDEL_2.0 分析遗传病(耳聋)

一、本文是GATK Germline spns-indels Pipeline 分析遗传病(耳聋)的升级版,目的是提供开箱即用的分析流程,尽可能简化部署和迁移。

更新内容如下:

  • 人类参考基因组以及其他引用数据库文件版本由GRCh37(hg19)升级为GRCh38(hg38)
  • 数据注释软件annovar更换为Ensemble vep(108.2),Annovar需要商业授权,vep为apache2.0 licence,可以随意使用
  • Pipeline用到的软件由预先安装改为docker conda首次使用时安装,初次运行初始化环境下载必要文件,迁移更方便

二、 流程概览图如下

流程概览流程概览

环境搭建: 为了快速完成环境搭建,节省95%以上时间。

本文使用docker conda (mamba) 作为基础分析环境,镜像获取:docker/docker-compoes 的安装及镜像构建见《基于docker的生信基础环境镜像构建》,docker镜像基于ubuntu21.04构建,并安装有conda/mamba,ssh服务。并尝试初次运行时初始化安装所需软件下载所需文件(作为代价首次运行时间会较长,切需网络通畅),即实现自动初始化的分析流程。

备注:docker运行的操作系统,推荐为Linux,windows,macOS系统改下docker可能部分功能(网络)不能正常运行

代码语言:shell复制
# 拉取docker镜像
docker     pull     doujiangbaozi/sliverworkspace:latest

# 查看docker 镜像
docker     images
基础环境配置,docker-compose.yml 配置文件,可以根据需要自行修改调整
代码语言:yaml复制
version: "3"
services:
  SarsCov2:
    image: doujiangbaozi/sliverworkspace:latest
    container_name: Germline
    volumes:
      - /media/sliver/Data/data:/opt/data:rw                               #挂载原始数据,放SC2目录下
      - /media/sliver/Manufacture/GL2/envs:/root/mambaforge-pypy3/envs:rw  #挂载envs conda环境目录
      - /media/sliver/Manufacture/GL2/config:/opt/config:rw                #挂载config conda配置文件目录
      - /media/sliver/Manufacture/GL2/ref:/opt/ref:rw                      #挂载reference目录
      - /media/sliver/Manufacture/GL2/result:/opt/result:rw                #挂载中间文件和输出结果目录
    ports:
      - "9024:9024"                                                        #ssh连接端口可以按需修改
    environment:
      - TZ=Asia/Shanghai                                                   #设置时区
      - PS=20191124                                                        #修改默认ssh密码
      - PT=2024                                                            #修改默认ssh连接端口

基础环境运行

代码语言:shell复制
# docker-compose.yml 所在目录下运行
docker-compose up -d

# 或者 
docker-compose up -d -f /路径/docker-compose.yaml

# 查看docker是否正常运行,docker-compose.yaml目录下运行
docker-compose ps

# 或者
docker ps

docker 容器使用,类似于登录远程服务器

代码语言:shell复制
# 登录docker,使用的是ssh服务,可以本地或者远程部署使用
ssh root@192.168.6.6 -p9024

# 看到如下,显示如下提示即正常登录
(base) root@SliverWorkstation:~# 

三. 分析流程

  • 1. 变量设置:
代码语言:shell复制
#样本编号 
export sn=SRR9993255
#数据输入目录 
export data=/opt/data 
#数据输出、中间文件目录 
export result=/opt/result 
#conda安装的环境目录 
export envs=/root/mambaforge-pypy3/envs 
#vep cache 版本 
export vep_version=108 
#设置可用线程数 
export threads=8 
#与耳聋相关基因及突变文件 
export whitelist=/opt/ref/whitelist.csv

  • 2. 数据过滤:
代码语言:shell复制
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件 
if [ ! -d "${envs}/fastp" ]; then
   mamba env create -f /opt/config/fastp.yaml 
fi

source activate fastp  

mkdir -p ${result}/${sn}/clean 
mkdir -p ${result}/${sn}/qc  
#使用默认参数 
fastp 
 	-w ${threads} 
  	-i ${data}/Germline/${sn}_R1.fastq.gz 
   	-I ${data}/Germline/${sn}_R2.fastq.gz 
    -o ${result}/${sn}/clean/${sn}_fastp_R1.fastq.gz 
    -O ${result}/${sn}/clean/${sn}_fastp_R2.fastq.gz 
    -j ${result}/${sn}/qc/${sn}_fastp.json 
    -h ${result}/${sn}/qc/${sn}_fastp.html  
conda  deactivate

  • 3. 数据比对、排序、标记重复;质控
代码语言:shell复制
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件 
if [ ! -d "${envs}/align" ]; then   
    mamba env create -f /opt/config/align.yaml 
fi  

source activate align 
#参考基因组文件如果不存在,去broadinstitute下载 
if [ ! -f "/opt/ref/hg38/hg38.fasta" ]; then
    mkdir -p /opt/ref/hg38 	
    aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -d /opt/ref/hg38 -o hg38.fasta.gz 	
    cd /opt/ref/hg38 && gzip -d /opt/ref/hg38/hg38.fasta.gz 	
    #如果索引文件不存在,创建索引 	
    if  [ ! -f /opt/ref/hg38.fasta.amb ] ||
     	[ ! -f /opt/ref/hg38/hg38.fasta.ann ] ||
      	[ ! -f /opt/ref/hg38/hg38.fasta.bwt ] ||
       	[ ! -f /opt/ref/hg38/hg38.fasta.pac ] ||
        [ ! -f /opt/ref/hg38/hg38.fasta.sa ]; then
        bwa index /opt/ref/hg38/hg38.fasta 	
    fi 
fi 
#如果samtools索引不存在,创建 
if [ ! -f "/opt/ref/hg38/hg38.fasta.fai" ]; then
    samtools faidx /opt/ref/hg38/hg38.fasta 
fi  
cd       ${result} 
mkdir -p ${result}/${sn}/align 
#数据比对、排序 
bwa mem 
    -t ${threads} -M 
    -R "@RG\tID:${sn}\tLB:${sn}\tPL:Illumina\tPU:NextSeq550\tSM:${sn}" 
    /opt/ref/hg38/hg38.fasta  
    ${result}/${sn}/clean/${sn}_fastp_R1.fastq.gz 
    ${result}/${sn}/clean/${sn}_fastp_R2.fastq.gz 
    | sambamba view -S -f bam -l 0 /dev/stdin 
    | sambamba sort -t ${threads} -m 2G 
        --tmpdir=${result}/${sn} 
        -o ${result}/${sn}/align/${sn}_sorted.bam /dev/stdin  

ulimit -n 10240  
#标记重复 
sambamba  markdup 
    --tmpdir ${result}/${sn} 
    -t ${threads} 
    ${result}/${sn}/align/${sn}_sorted.bam 
    ${result}/${sn}/align/${sn}_marked.bam  
mv  ${result}/${sn}/align/${sn}_marked.bam.bai ${result}/${sn}/align/${sn}_marked.bai 
rm  -f ${result}/${sn}/align/${sn}_sorted.bam  
#数据文件没有附带bed文件,这里用bedtools反向计算一个 
bedtools genomecov -ibam ${result}/${sn}/align/${sn}_marked.bam -bg > ${result}/${sn}/align/${sn}_bedtools.bed 
bedtools merge     -i    ${result}/${sn}/align/${sn}_bedtools.bed   > ${result}/${sn}/align/${sn}_merged.bed  
samtools flagstat --threads ${threads} ${result}/${sn}/align/${sn}_marked.bam > ${result}/${sn}/qc/${sn}_marked.flagstat

conda  deactivate

if [ ! -f "${envs}/bamdst/bamdst" ]; then 	
    apt-get update && apt-get install -y git make gcc zlib1g-dev 	
    git clone https://github.com/shiquan/bamdst.git "${envs}/bamdst" 	
    cd ${envs}/bamdst 	
    make 	
    cd ${result} 
fi  
${envs}/bamdst/bamdst -p ${result}/${sn}/align/${sn}_merged.bed -o ${result}/${sn}/qc --cutoffdepth 20 ${sn}/align/${sn}_marked.bam

  • 4. Gatk 获取碱基质量校准table
代码语言:shell复制
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件 
if [ ! -f "${envs}/gatk/bin/gatk" ]; then 	
    mkdir -p ${envs}/gatk/bin 	
    #替代下载地址 	
    #https://github.com/broadinstitute/gatk/releases/download/4.3.0.0/gatk-4.3.0.0.zip 	
    aria2c https://download.yzuu.cf/broadinstitute/gatk/releases/download/4.3.0.0/gatk-4.3.0.0.zip -d ${envs}/gatk/bin -o gatk.zip 	
    apt-get install -y unzip 	
    cd ${envs}/gatk/bin  	
    unzip -o gatk.zip  	
    mv ${envs}/gatk/bin/gatk-4.3.0.0/* ${envs}/gatk/bin/ 	
    rm -rf ${envs}/gatk/bin/gatk-4.3.0.0 	
    #chmod  x ${envs}/bin/gatk 	
    cd ${result} 
fi

if [ ! -f "/opt/ref/hg38/dbsnp_146.hg38.vcf.gz" ]; then 	
    aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz -d /opt/ref/hg38 
fi 
if [ ! -f "/opt/ref/hg38/dbsnp_146.hg38.vcf.gz.tbi" ]; then 	
    aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/dbsnp_146.hg38.vcf.gz.tbi -d /opt/ref/hg38 
fi 
if [ ! -f "/opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz" ]; then 	
    aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz -d /opt/ref/hg38 
fi 
if [ ! -f "/opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi" ]; then 	
    aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz.tbi -d /opt/ref/hg38 
fi 
if [ ! -f "/opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz" ]; then 	
    aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz -d /opt/ref/hg38 
fi 
if [ ! -f "/opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi" ]; then 	
    aria2c ftp://gsapubftp-anonymous@ftp.broadinstitute.org/bundle/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz.tbi -d /opt/ref/hg38 
fi

source activate gatk

if [ ! -x "$(command -v python)" ]; then 	
    mamba env create -f ${envs}/gatk/bin/gatkcondaenv.yml 
fi  

if [ ! -x "$(command -v java)" ]; then 	
    mamba install -y openjdk=8.0.332 
fi  
 
if [ ! -f "/opt/ref/hg38/hg38.dict" ]; then 	
    gatk CreateSequenceDictionary -R /opt/ref/hg38/hg38.fasta -O /opt/ref/hg38/hg38.dict 
fi  

if [ ! -f "/opt/ref/hg38.exon.interval_list" ]; then 	
    gatk BedToIntervalList 
     	-I /opt/ref/hg38.exon.bed 
      	-SD /opt/ref/hg38/hg38.dict 
       	-O /opt/ref/hg38.exon.interval_list 
fi    

gatk BaseRecalibrator 
 	--known-sites /opt/ref/hg38/dbsnp_146.hg38.vcf.gz 
  	--known-sites /opt/ref/hg38/Mills_and_1000G_gold_standard.indels.hg38.vcf.gz 
   	--known-sites /opt/ref/hg38/1000G_phase1.snps.high_confidence.hg38.vcf.gz 
    -L /opt/ref/hg38.exon.interval_list 
    -R /opt/ref/hg38/hg38.fasta 
    -I ${result}/${sn}/align/${sn}_marked.bam 
    -O ${result}/${sn}/align/${sn}_recal.table  
conda  deactivate
  • 5. Gatk 应用碱基校准、获取insert size 统计信息
代码语言:shell复制
source activate gatk

gatk ApplyBQSR 
     --bqsr-recal-file ${result}/${sn}/align/${sn}_recal.table 
     -L /opt/ref/hg38.exon.interval_list 
     -R /opt/ref/hg38/hg38.fasta 
     -I ${result}/${sn}/align/${sn}_marked.bam 
     -O ${result}/${sn}/align/${sn}_bqsr.bam &  

gatk CollectInsertSizeMetrics 
     -I ${result}/${sn}/align/${sn}_marked.bam 
     -O ${result}/${sn}/qc/${sn}_insertsize_metrics.txt 
     -H ${result}/${sn}/qc/${sn}_insertsize_histogram.pdf &  
wait  
conda  deactivate

  • 6. Gatk HaplotypeCaller 获取snp/indel突变:
代码语言:shell复制
source activate gatk  

mkdir -p ${result}/${sn}/vcf  

gatk HaplotypeCaller 
     -R /opt/ref/hg38/hg38.fasta 
     -L /opt/ref/hg38.exon.interval_list 
     -I ${result}/${sn}/align/${sn}_bqsr.bam 
     -D /opt/ref/hg38/dbsnp_146.hg38.vcf.gz 
     -O ${result}/${sn}/vcf/${sn}.vcf

conda  deactivate

  • 7. 单个样本数据不够运行VQSR,直接执行硬过滤,过滤参考数值见#https://gatk.broadinstitute.org/hc/en-us/articles/360035532412?id=11097
代码语言:shell复制
source activate gatk
#https://gatk.broadinstitute.org/hc/en-us/articles/360035532412?id=11097
gatk VariantFiltration 
    -R /opt/ref/hg38/hg38.fasta 
    -V ${result}/${sn}/vcf/${sn}.vcf 
    -O ${result}/${sn}/vcf/${sn}_snp.vcf 
    --filter-name "SNP_DQ" 
    --filter-expression "DQ < 2.0" 
    --filter-name "SNP_MQ" 
    --filter-expression "MQ <40.0" 
    --filter-name "SNP_FS" 
    --filter-expression "FS > 60.0" 
    --filter-name "SNP_SOR" 
    --filter-expression "SOR > 3.0" 
    --filter-name "SNP_MQRankSum" 
    --filter-expression "MQRankSum < -12.5" 
    --filter-name "SNP_ReadPosRankSum" 
    --filter-expression "ReadPosRankSum < -8.0"

gatk VariantFiltration 
    -R /opt/ref/hg38/hg38.fasta 
    -V ${result}/${sn}/vcf/${sn}_snp.vcf 
    -O ${result}/${sn}/vcf/${sn}_filtered.vcf 
    --filter-name "INDEL_DQ" 
    --filter-expression "DQ < 2.0" 
    --filter-name "INDEL_FS" 
    --filter-expression "FS > 200.0" 
    --filter-name "INDEL_SOR" 
    --filter-expression "SOR > 10.0" 
    --filter-name "INDEL_ReadPosRankSum" 
    --filter-expression "ReadPosRankSum < -20.0" 
    --filter-name "INDEL_InbreedingCoeff" 
    --filter-expression "InbreedingCoeff < -0.8"
	
conda  deactivate
  • 8. 使用vep注释突变结果:
代码语言:shell复制
#conda检测环境是否存在,首次运行不存在创建该环境并安装软件
if [ ! -d "${envs}/vep" ]; then
	mamba env create -f /opt/config/vep.yaml
fi

mkdir -p /opt/result/${sn}/vcf

if [ ! -d "/opt/ref/vep-cache/homo_sapiens/${vep_version}_GRCh38" ]; then
	aria2c https://ftp.ensembl.org/pub/release-${vep_version}/variation/indexed_vep_cache/homo_sapiens_vep_${vep_version}_GRCh38.tar.gz -d /opt/ref/
	tar -zxvf /opt/ref/homo_sapiens_vep_${vep_version}_GRCh38.tar.gz -C /opt/ref/vep-cache/
fi

source activate vep

vep 
  -i /opt/result/${sn}/vcf/${sn}_filtered.vcf  
  -o /opt/result/${sn}/vcf/${sn}_filtered_vep.xls 
  --offline 
  --cache 
  --cache_version ${vep_version} 
  --everything 
  --dir_cache /opt/ref/vep-cache 
  --dir_plugins /opt/ref/vep-cache/Plugins 
  --species homo_sapiens 
  --assembly GRCh38 
  --hgvs 
  --fasta /opt/ref/hg38/hg38.fasta 
  --force_overwrite 
  --format vcf 
  --tab 
  --fork 8 
  --offline

conda  deactivate
  • 9. 编写脚本匹配whitelist基因,突变过滤后vcf文件,vep注释后的文件,得到最终结果
代码语言:shell复制
#需借用gatk环境中的python来运行 
source activate gatk

python ${envs}/GermlineVepAnnotationUtil.py 
    --whitelist=${whitelist} 
  	-v ${result}/${sn}/vcf/${sn}_filtered.vcf 
   	-a ${result}/${sn}/vcf/${sn}_filtered_vep.xls 
    -o ${result}/${sn}/${sn}.result.variants.xls  

conda  deactivate

  • 10. 编写脚本从fastp、bamdst、gatk CollectInsertSize 输出获取数据质控信息
代码语言:shell复制
source activate gatk  

python ${envs}/GermlineQcUtil.py 
 	--out=/opt/result/${sn}/${sn}.result.qc.xls 
  	--sample-fastp=/opt/result/${sn}/qc/${sn}_fastp.json  
   	--sample-bamdst=/opt/result/${sn}/qc/coverage.report 
    --sample-insertsize=/opt/result/${sn}/qc/${sn}_insertsize_metrics.txt  

conda  deactivate

  • 11. 结果确认:IGV bam文件和突变位置:
  • 12. 输出报告(报告模板见流程压缩包):

0 人点赞