新冠病毒分型和突变分析(SARS-CoV2_ARTIC_Illumina)

2023-02-03 16:45:59 浏览数 (1)

新冠病毒分型和突变分析(SARS-CoV2_ARTIC_Illumina)

一. 本文适用于使用Artic扩增子扩增,Illumina双端测序,用于分析新冠病毒突变及分型鉴定

二. 概览:按照惯例,先上一张概览图

概览图概览图

流程输入

SRR22216743_1.fastq.gz SRR22216743_2.fastq.gz 测试数据下载 NYC SARS-CoV-2 genome sequencing from human nasopharyngeal swabs** 1 ILLUMINA (Illumina HiSeq 2000) run: 28,721 spots, 5.8M bases, 1.8Mb downloads 使用NCBI官方工具sra-toolkit拆分成fastq.gz文件 fastq-dump SRR22216743 --split-3 --gzip 得到SRR22216743_1.fastq.gz SRR22216743_2.fastq.gz 参考文件,默认路径/opt/ref下 Artic-ncov2019 artic-ncov2019 primer&参考序列 GCF_009858895.2_ASM985889v3_genomic.gff GFF文件 分析流程文件(可一键导入sliverworkspace运行)及报告文件,conda环境文件下载,导入操作

运行环境

docker image based on ubuntu21.04 Conda Mamba(默认使用清华源) ssh

分析软件

- fastp=0.23.2 - fastqc=0.11.9 - multiqc=1.13 - bwa=0.7.17 - minimap2=2.24 - samtools=1.16.1 - ivar=1.3.1 - quast=5.2.0 - pangolin=4.1.3

输出结果

multiqc_report.html 测序数据trim前后质量数据 SRR22216743.samcov.tsv reads数量,覆盖度,测序深度,baseQ mapQ quast输出分析结果:alignment_viewer.html contig_size_viewer.html (拼接序列情况)report.html 按照序列一致性组装的新冠病毒序列 SRR22216743.consensus.fa Panglin 根据组装的序列分析得出病毒分型信息 lineage_report.csv samtools,ivar 根据primertrim.bam获的新冠病毒突变信息,过滤后得到 SRR22216743_variantsfiltered.tsv

环境搭建: 为了快速完成环境搭建,节省95%以上时间。

本文使用docker conda (mamba) 作为基础分析环境,镜像获取:docker/docker-compoes 的安装及镜像构建见《基于docker的生信基础环境镜像构建》,docker镜像基于ubuntu21.04构建,并安装有conda/mamba,ssh服务。并尝试初次运行时初始化安装所需软件下载所需文件(作为代价首次运行时间会较长,切需网络通畅),即实现自动初始化的分析流程。

备注:docker运行的操作系统,推荐为Linux,windows,macOS系统改下docker可能部分功能(网络)不能正常运行

代码语言:javascript复制
 # 拉取docker镜像
 docker     pull     doujiangbaozi/sliverworkspace:latest
 ​
 # 查看docker 镜像
 docker     images
基础环境配置,docker-compose.yml 配置文件,可以根据需要自行修改调整
代码语言:javascript复制
 version: "3"
 services:
   SarsCov2:
     image: doujiangbaozi/sliverworkspace:latest
     container_name: SarsCov2
     volumes:
       - /media/sliver/Data/data:/opt/data:rw                               #挂载原始数据,放SC2目录下
       - /media/sliver/Manufacture/SC2/envs:/root/mambaforge-pypy3/envs:rw  #挂载envs conda环境目录
       - /media/sliver/Manufacture/SC2/config:/opt/config:rw                #挂载config conda配置文件目录
       - /media/sliver/Manufacture/SC2/ref:/opt/ref:rw                      #挂载reference目录
       - /media/sliver/Manufacture/SC2/result:/opt/result:rw                #挂载中间文件和输出结果目录
     ports:
       - "9024:9024"                                                        #ssh连接端口可以按需修改
     environment:
       - TZ=Asia/Shanghai                                                   #设置时区
       - PS=20191124                                                        #修改默认ssh密码
       - PT=9024                                                            #修改默认ssh端口
 ​

基础环境运行

代码语言:javascript复制
 # docker-compose.yml 所在目录下运行
 docker-compose up -d
 ​
 # 或者 
 docker-compose up -d -f /路径/docker-compose.yaml
 ​
 # 查看docker是否正常运行,docker-compose.yaml目录下运行
 docker-compose ps
 ​
 # 或者
 docker ps

docker 容器使用,类似于登录远程服务器

代码语言:javascript复制
 # 登录docker,使用的是ssh服务,可以本地或者远程部署使用
 ssh root@192.168.6.6 -p9024
 ​
 # 看到如下,显示如下提示即正常登录
 (base) root@SliverWorkstation:~# 

三. 分析流程

1. 变量设置

代码语言:shell复制
 #样本编号
 export sn=SRR22216743
 #数据输入目录
 export data=/opt/data
 #数据输出、中间文件目录
 export result=/opt/result
 #conda安装的环境目录
 export envs=/root/mambaforge-pypy3/envs 
 #artic primer 版本V1,V2,V3,V4,V4.1
 export artic_primer_version=V4.1
 #设置可用线程数
 export threads=8

2. 分析前QC,看下数据质量

代码语言:shell复制
 #conda检测环境是否存在,首次运行不存在创建该环境并安装软件
 if [ ! -d "${envs}/qc" ]; then
   mamba env create -f /opt/config/qc.yaml
 fi
 ​
 source activate qc
 ​
 mkdir -p ${result}/${sn}/clean mkdir -p ${result}/${sn}/qc
 ​
 fastqc ${data}/SC2/${sn}_1.fastq.gz ${data}/SC2/${sn}_2.fastq.gz -o ${result}/${sn}/qc
 ​
 fastp -i ${data}/SC2/${sn}_1.fastq.gz -I ${data}/SC2/${sn}_2.fastq.gz 
   -o ${result}/${sn}/clean/${sn}_1_clean.fastq.gz -O ${result}/${sn}/clean/${sn}_2_clean.fastq.gz 
   -h ${result}/${sn}/qc/${sn}_fastp.html -j ${result}/${sn}/qc/${sn}_fastp.json 
 ​
 fastqc ${result}/${sn}/clean/${sn}_1_clean.fastq.gz ${result}/${sn}/clean/${sn}_2_clean.fastq.gz 
   -o ${result}/${sn}/qc
 ​
 #汇总一下之前结果,得到一个总体报告
 multiqc ${result}/${sn}/qc/ -f -o ${result}/${sn}/qc
 ​
 conda  activate

3. 比对到参考基因组上,得到bam文件并排序

代码语言:shell复制
 mkdir -p ${result}/${sn}/aligned
 ​
 if  [ ! -d "/opt/ref/artic-ncov2019" ]; then
     apt-get install -y git
     git clone https://github.com/artic-network/artic-ncov2019.git "/opt/ref/artic-ncov2019"
 fi
 ​
 #conda检测环境是否存在,首次运行不存在创建该环境并安装软件
 if [ ! -d "${envs}/SC2" ]; then
   mamba env create -f /opt/config/SC2.yaml
 fi
 ​
 source activate SC2
 ​
 #如果没有索引,创建索引
 if  [ ! -f /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta.amb ] ||
     [ ! -f /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta.ann ] ||
     [ ! -f /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta.bwt ] ||
     [ ! -f /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta.pac ] ||
     [ ! -f /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta.sa ]; then
     if [ ! -f /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta ]; then
         cp  -f /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/SARS-CoV-2.reference.fasta 
             /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta
     fi
     bwa index /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta
 fi
 ​
 bwa mem -t ${threads} 
   /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta 
   ${result}/${sn}/clean/${sn}_1_clean.fastq.gz ${result}/${sn}/clean/${sn}_2_clean.fastq.gz | 
   samtools sort | samtools view -F 4 -o ${result}/${sn}/aligned/${sn}.sorted.bam
 ​
 samtools index ${result}/${sn}/aligned/${sn}.sorted.bam
 mv  ${result}/${sn}/aligned/${sn}.sorted.bam.bai ${result}/${sn}/aligned/${sn}.sorted.bai
 
 conda  deactivate

4. 去除artic primer (primer trim)

代码语言:shell复制
 source activate SC2
 ​
 if  [ ! -f /opt/ref/artic-ncov2019/ARTIC-${artic_primer_version}.bed ]; then
   if [ ! -f /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.scheme.bed ]; then
     cp  -r /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/SARS-CoV-2.scheme.bed 
         /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.scheme.bed
   fi
   perl -ne 'my @x=split m/t/; print join("t",@x[0..3], 60, $x[3]=~m/LEFT/?" ":"-"),"n";' 
     < /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.scheme.bed  
     > /opt/ref/artic-ncov2019/ARTIC-${artic_primer_version}.bed
 fi
 ​
 ivar trim -e -i ${result}/${sn}/aligned/${sn}.sorted.bam 
   -b /opt/ref/artic-ncov2019/ARTIC-${artic_primer_version}.bed 
   -p ${result}/${sn}/aligned/${sn}.primertrim
   
 conda  deactivate

5. primer trim 排序

代码语言:shell复制
 source activate SC2
 ​
 samtools sort ${result}/${sn}/aligned/${sn}.primertrim.bam 
   -o ${result}/${sn}/aligned/${sn}.primertrim.sorted.bam
   
 conda  deactivate

6. 获取拼接后一致性序列

代码语言:shell复制
 source activate SC2
 ​
 samtools mpileup -A -d 1000 -B -Q 0 
   --reference /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta 
   ${result}/${sn}/aligned/${sn}.primertrim.sorted.bam | ivar consensus -p ${result}/${sn}/${sn}.consensus -n N -i ${sn}
 ​
 grep -v ">" ${result}/${sn}/${sn}.consensus.fa | grep -o -E "C|A|T|G" | wc -l
 ​
 conda  deactivate

7. 使用Pangolin获取序列分型信息

代码语言:shell复制
 #conda检测环境是否存在,首次运行不存在创建该环境并安装软件
 if [ ! -d "${envs}/pangolin" ]; then
   mamba env create -f /opt/config/pangolin.yaml
 fi
 ​
 source activate pangolin
 ​
 pangolin ${result}/${sn}/${sn}.consensus.fa --outdir ${result}/${sn} 
 ​
 conda  deactivate

8. 获取突变信息

代码语言:shell复制
 source activate SC2
 ​
 if [ ! -f "/opt/ref/GCF_009858895.2_ASM985889v3_genomic.gff" ]; then
   aria2c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gff.gz -d /opt/ref
   gzip -f  -d /opt/ref/GCF_009858895.2_ASM985889v3_genomic.gff.gz
 fi
 ​
 samtools mpileup -aa -A -d 0 -B -Q 0 
   --reference /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta 
   ${result}/${sn}/aligned/${sn}.primertrim.sorted.bam | 
   ivar variants -p ${result}/${sn}/${sn}_variants -q 30 -t 0.01 -m 0 
   -r /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta 
   -g /opt/ref/GCF_009858895.2_ASM985889v3_genomic.gff
 ​
 ivar filtervariants -p ${result}/${sn}/${sn}_variantsfiltered ${result}/${sn}/${sn}_variants.tsv
   
 conda  deactivate

9. 获取quast报告及bam覆盖度、测序深度、baseQ、mapQ等质控信息

代码语言:shell复制
 source activate SC2
 ​
 if [ ! -f "/opt/ref/GCF_009858895.2_ASM985889v3_genomic.gff" ]; then
   aria2c https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/009/858/895/GCF_009858895.2_ASM985889v3/GCF_009858895.2_ASM985889v3_genomic.gff.gz -d /opt/ref
   gzip -f  -d /opt/ref/GCF_009858895.2_ASM985889v3_genomic.gff.gz
 fi
 ​
 quast ${result}/${sn}/${sn}.consensus.fa 
   -r /opt/ref/artic-ncov2019/primer_schemes/nCoV-2019/${artic_primer_version}/nCoV-2019.reference.fasta 
   --features /opt/ref/GCF_009858895.2_ASM985889v3_genomic.gff 
   --ref-bam  ${result}/${sn}/aligned/${sn}.sorted.bam 
   --output-dir ${result}/${sn}/quast
 ​
 samtools coverage ${result}/${sn}/aligned/${sn}.sorted.bam -o ${result}/${sn}/aligned/${sn}.samcov.tsv
 ​
 conda  deactivate

10. 报告(可选)

11. 使用IGV Browser查看突变信息(可选)

四. 参考链接

https://github.com/artic-network/artic-ncov2019

https://github.com/CDCgov/SARS-CoV-2_Sequencing/tree/master/protocols/BFX-UT_ARTIC_Illumina

0 人点赞