使用snakemake编写生信分析流程

The Snakemake workflow management system is a tool to create reproducible and scalable data analyses. Workflows are described via a human readable, Python based language. They can be seamlessly scaled to server, cluster, grid and cloud environments, without the need to modify the workflow definition. Finally, Snakemake workflows can entail a description of required software, which will be automatically deployed to any execution environment.

通过官网的介绍，可知snakemake是一个python包，所以可以在snakemake脚本中使用任何python语法。下边是snakemake中的一些概念。

rule

脚本中的一步小的分析叫做rule，名字可以随便起，但是不能重名，也要符合python变量命名规范。

比如这一步使用fastp软件对fastq文件去接头，因为是单端测序，所以可以命名为fastp_se，但是这不是强制的，完全可以命名为abcd。

代码语言：Python复制

rule fastp_se:
    input:
        sample=get_fastq
    output:
        trimmed=temp("results/trimmed/{s}_{u}.fastq.gz"),
        html=temp("report/{s}_{u}.fastp.html"),
        json=temp("report/{s}_{u}.fastp.json"),
    log:
        "logs/fastp/{s}_{u}.log"
    threads: 16
    wrapper:
        config["warpper_mirror"] "bio/fastp"

运行上边的脚本后的日志文件

代码语言：Python复制

rule fastp_se:
    input: raw/GSM6001951_L3.fastq.gz
    output: results/trimmed/GSM6001951_L3.fastq.gz, report/GSM6001951_L3.fastp.html, report/GSM6001951_L3.fastp.json
    log: logs/fastp/GSM6001951_L3.log
    jobid: 30
    reason: Missing output files: report/GSM6001951_L3.fastp.json, results/trimmed/GSM6001951_L3.fastq.gz
    wildcards: s=GSM6001951, u=L3
    threads: 8
    resources: tmpdir=/tmp

temp

fastpse中的输出文件`trimmed=temp("results/trimmed/{s}{u}.fastq.gz")`，表示生成的fastq.gz输出的文件是临时文件，当所有rule用完这个文件后，就会被删除，这样做可以节约空间。

wildcard

snakemake使用正则表达式匹配文件名，比如下边的代码fastpse脚本中，我们使用{s}{u}去代替两个字符串，而且我们也可以对这两个字符串的内容进行限制。s只能是GSM6001951或GSM6001952，|就是正则表达式中或的意思；u只能是L1-L4，如果你的样本分成了多个fastq文件那么可以用u指定样本后边的lane等信息。

s和u，是我随便写的，你完全可以写成a和b

这一步也就相当于我们用了for循环对GSM6001951和GSM6001952两个样本8个文件执行fastp。

代码语言：Python复制

wildcard_constraints:
    s="|".join(["GSM6001951","GSM6001952"]),
    u="|".join(["L1","L2""L3""L4"])

所以fastp_se中的输入文件只能匹配到如下结果，这也刚好是我raw文件夹下的4个需要分析的文件。

代码语言：Python复制

GSM6001951_L1.fastq.gz
GSM6001951_L2.fastq.gz
GSM6001951_L3.fastq.gz
GSM6001951_L4.fastq.gz
GSM6001952_L1.fastq.gz
GSM6001952_L2.fastq.gz
GSM6001952_L3.fastq.gz
GSM6001952_L4.fastq.gz

在log日志中可以看wildcard匹配到的内容是否与自己所设计的一致

wrapper

wrapper是snakemake官方仓库中写好的分析代码，比如上边的fastp软件，我们不需要写fastp的命令行代码，只需要用下边的代码就可以。

代码语言：Python复制

wrapper:
    "v1.29.0/bio/fastp"

其实这一步相当于从github下载了作者写好的环境文件environment.yaml，conda会建一个虚拟环境，仅提供给fastp使用。

代码语言：YAML复制

channels:
  - conda-forge
  - bioconda
  - nodefaults
dependencies:
  - fastp =0.23.2

接下来下载从github下载了作者写好的wrapper.py文件，虽然很长，其实就是一个判断你输入内容，然后交给fastp去执行的python脚本，所以我们需要按照作者的要求提供输入和输出文件名字，以及适当的额外参数。

代码语言：Python复制

from snakemake.shell import shell
import re

extra = snakemake.params.get("extra", "")
adapters = snakemake.params.get("adapters", "")
log = snakemake.log_fmt_shell(stdout=True, stderr=True)

#######省略很多行#######

shell(
    "(fastp --thread {snakemake.threads} "
    "{extra} "
    "{adapters} "
    "{reads} "
    "{trimmed} "
    "{json} "
    "{html} ) {log}"
)

虽然这两个文本文件都很小，但是因为github不稳定，可能流程就会中断，因此我把github的snakemake-wrappers镜像到了中国的极狐jihulab.com，只需要在原来的wrapper前面写上我的仓库地址就OK了。

代码语言：Python复制

wrapper:
    "https://jihulab.com/BioQuest/snakemake-wrappers/raw/" "v1.29.0/bio/fastp"

reason

我第一写完流程跑的时候发现日志文件中写着reason: Missing output files，我以为是因为我的语法不标准或者错误，导致报错，但是后边的流程都执行了，这一步的输出文件也正常。

后来才知道，reason不是推测的意思，而是名词原因的意思，这一步为什么会执行，因为输出文件不在指定的位置，换言之，如果我们跑完fastp_se后中断了snakemake流程，下次在接着跑流程，是不会跑fastp_se这一步的，因为这一步运行后输出了正确的文件results/trimmed/GSM6001951_L3.fastq

代码语言：Python复制

reason: Missing output files: results/trimmed/GSM6001951_L3.fastq.gz

rule all

snakemake的rules的执行顺序是：如果rule1的输出是rule2的输入那么，他们是串联关系，如果没有这种输入和输出依赖关系，那么rules可以并联同时执行。

所以如果rule1的输出在之后的rule中没有用到，那么就应该写在rule all中，否则，rule1不会被执行。

代码语言：Python复制

rule all:
    input:
        genome_prefix "STAR",
        genome_prefix "genome.fa.fai",
        "results/star/star.csv.gz",
        "report/qc_plot.pdf",
        "report/align_multiqc.html",
        "report/fastp_multiqc.html",
        expand("results/DEA/DEA_res_{_}.csv.gz",_=get_contrast())

retries

因为有些文件很大，下载过程可能出错，所以可以用retries多尝试几次

代码语言：Python复制

rule get_genome:
  output:
    genome_prefix "genome.fa"
  log:
    "logs/genome/get_genome.log"
  retries: 50
  params:
    species=config["genome"].get("species"),
    datatype=config["genome"].get("datatype"),
    build=config["genome"].get("build"),
    release=config["genome"].get("release"),
  cache:
    "omit-software"
  wrapper:
    config["warpper_mirror"] "bio/reference/ensembl-sequence"

config

一般情况下需要把配置参数写在config/config.yaml文件中，在snakemake流程中，读入的config是一个嵌套字典，而且config是全局变量

代码语言：YAML复制

samples: config/samples.tsv

genome:
    dir: /home/victor/DataHub/Genomics
    datatype: dna
    species: homo_sapiens
    build: GRCh38
    release: "109"
    flavor: chr_patch_hapl_scaff

fastqs:
    pe: false # are they pair end?
    dir: raw
qc_plot:
    n_genes: 50
    ellipses: true
    ellipse_size: 0.8
dea:
    active: true
    dea_vs:
        - [hSETDB1_sg1,Control_sg2]
        - [Control_sg2,hSETDB1_sg1]
    logFC: 0.585
    pvalue: 0.05

warpper_mirror: https://jihulab.com/BioQuest/snakemake-wrappers/raw/v1.29.0/

snakemake读取config/config.yaml文件

代码语言：Python复制

configfile: "config/config.yaml"

env

创建smk环境，用于运行snakemake流程

.condarc文件设置

代码语言：YAML复制

channel_priority: strict
channels:
  - defaults
show_channel_urls: true
default_channels:
  - https://mirrors.sustech.edu.cn/anaconda/pkgs/main
  - https://mirrors.sustech.edu.cn/anaconda/pkgs/free
  - https://mirrors.sustech.edu.cn/anaconda/pkgs/r
  - https://mirrors.sustech.edu.cn/anaconda/pkgs/pro
  - https://mirrors.sustech.edu.cn/anaconda/pkgs/msys2
custom_channels:
  conda-forge: https://mirrors.sustech.edu.cn/anaconda/cloud
  msys2: https://mirrors.sustech.edu.cn/anaconda/cloud
  bioconda: https://mirrors.sustech.edu.cn/anaconda/cloud
  menpo: https://mirrors.sustech.edu.cn/anaconda/cloud
  pytorch: https://mirrors.sustech.edu.cn/anaconda/cloud
  simpleitk: https://mirrors.sustech.edu.cn/anaconda/cloud
  nvidia: https://mirrors.sustech.edu.cn/anaconda-extra/cloud

smk.yaml环境文件

代码语言：YAML复制

channels:
  - conda-forge
  - bioconda
dependencies:
  - python=3.10
  - ipython
  - graphviz
  - numpy
  - pandas
  - snakemake

创建虚拟环境smk

代码语言：shell复制

mamba env create --name smk --file smk.yaml

生物基因

0 人点赞