流程管理工具snakemake学习笔记杂记02

snakemake学习笔记007~slurm的cluster提交任务

主要参考

https://eriqande.github.io/eca-bioinf-handbook/snakemake-chap.html

内容是fastp对原始数据进行过滤

snakemake文件的内容

代码语言：javascript复制


input_folder = "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/samples/"

output_folder = "/home/myan/scratch/private/practice_data/RNAseq/20220511/"

SRR,FRR = glob_wildcards(input_folder   "{srr}_chrX_{frr}.fastq.gz")

rule all:
    input:
        expand(output_folder   "outputfastq/{srr}_chrX_{frr}.fastq",srr=SRR,frr=FRR)


rule first:
    input:
        read01 = input_folder   "{srr}_chrX_1.fastq.gz",
        read02 = input_folder   "{srr}_chrX_2.fastq.gz"
    output:
        read01 = output_folder   "outputfastq/{srr}_chrX_1.fastq",
        read02 = output_folder   "outputfastq/{srr}_chrX_2.fastq",
        json = output_folder   "fastpreport/{srr}.json",
        html = output_folder   "fastpreport/{srr}.html"
    threads:
        8
    shell:
        """
        fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} 
        --thread {threads} --html {output.html} --json {output.json}
        """

运行命令

代码语言：javascript复制

snakemake --cluster 'sbatch --cpus-per-task={threads}' --jobs 12 -s snakemake_hpc.py

唰一下就结束了

试了下更长的命令

代码语言：javascript复制

snakemake --cluster 'sbatch --cpus-per-task={threads} -o slurm_outputs/{rule}_{wildcards}_%j.out -e logs_errors/{rule}/{rule}_{wildcards}_%j.err --mail-type=ALL --mail-user=mingyan24@126.com' --jobs 4 -s snakemake_hpc.py

这个命令一直没有成功

下面这个命令是可以的加上邮箱通知

代码语言：javascript复制

snakemake --cluster 'sbatch --cpus-per-task={threads} --mail-type=ALL --mail-user=mingyan24@126.com' --jobs 4 -s snakemake_hpc.py

这里没有遇到内存超出的问题

但是我运行真实数据的时候会遇到内存超出问题

image.png

snakemake学习笔记007~slurm的cluster提交任务

image.png

我的文件存储层级如上，按照之前的通配符的写法，他会组合出PRJNA001/SRR0002_1.fastq.gz的文件

这里的问题是如何指定expand()函数的组合

流程处理的问题还是 fastp 过滤原始测序数据

代码语言：javascript复制

import os
import glob

raw_fastq_folder = "/mnt/sdc/xiaoming/MingYan/snakemake_20220513/00.raw.fastq/"
output_folder = "/mnt/sdc/xiaoming/MingYan/snakemake_20220513/"

fq_list = {}

print(os.listdir(raw_fastq_folder))

experiment = os.listdir(raw_fastq_folder)


for i in experiment:
    fq_list[i] = [fq.split("_")[0] for fq in os.listdir(os.path.join(raw_fastq_folder,i))]

print(fq_list)

inputs = [(dir,file) for dir,files in fq_list.items() for file in files]

#glob_wildcards(raw_fastq_folder   "{exper}/{srr}_{frr}.fastq")

rule all:
    input:
        expand(output_folder   "01.fastp.report/"    "{exper}/{srr}.html",zip,exper=[row[0] for row in inputs],srr=[row[1] for row in inputs])

rule firstrule:
    input:
        read01 = raw_fastq_folder   "{exper}/{srr}_1.fastq.gz",
        read02 = raw_fastq_folder   "{exper}/{srr}_2.fastq.gz"
    output:
        read01 = output_folder   "01.fastp.filter/"    "{exper}/{srr}_clean_1.fastq.gz",
        read02 = output_folder   "01.fastp.filter/"    "{exper}/{srr}_clean_2.fastq.gz",
        html =  output_folder   "01.fastp.report/"    "{exper}/{srr}.html",
        json = output_folder   "01.fastp.report/"    "{exper}/{srr}.json"
    threads:
        2
    shell:
        """
        fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} --json {output.json} --html {output.html} -w {threads}
        """

这里主要参考的链接是 https://stackoverflow.com/questions/68648311/snakemake-6-0-5-input-a-list-of-folders-and-multiple-files-from-each-folder-to

前面组合文件夹和文件的命令还是有点多的，不知道有没有简单的的方法

看到有的解决办法里还用到了lambda函数，还得仔细看一下lambda的用法

这里换成我真实的数据集后会遇到内存不够的情况，需要再snakemake里写resources

这里默认情况下用多少内存呢？还需要仔细看snakemake的文档

我真实数据的代码

代码语言：javascript复制

import os

raw_fastq_folder = "/mnt/shared/scratch/myan/private/pomeRTD/00.raw.fastq/"
output_folder = "/home/myan/scratch/private/pomeRTD/"

#Folder,SRR,FRR = glob_wildcards(raw_fastq_folder   "{folder}/{srr}_{frr}.fq.gz")

#print(Folder)
#experiment = os.listdir(raw_fastq_folder)
list_fastq = {}
for experiment in os.listdir(raw_fastq_folder):
    list_fastq[experiment] = [x.split("_")[0] for x in os.listdir(raw_fastq_folder   experiment)]

print(list_fastq)

inputs = [(dir,file) for dir,files in list_fastq.items() for file in files]

#glob_wildcards(raw_fastq_folder   "{exper}/{srr}_{frr}.fastq")

rule all:
    input:
        expand(output_folder   "01.fastp.report/"    "{exper}/{srr}.html",zip,exper=[row[0] for row in inputs],srr=[row[1] for row in inputs])

rule runfastp:
    input:
        read01 = os.path.join(raw_fastq_folder,"{exper}","{srr}_1.fq.gz"),
        read02 = os.path.join(raw_fastq_folder,"{exper}","{srr}_2.fq.gz")
    output:
        read01 = output_folder   "01.fastp.filtered.reads/{exper}/{srr}_clean_1.fq.gz",
        read02 = output_folder   "01.fastp.filtered.reads/{exper}/{srr}_clean_2.fq.gz",
        html = output_folder   "01.fastp.report/{exper}/{srr}.html",
        json = output_folder   "01.fastp.report/{exper}/{srr}.json"
    threads:
        8
    resources:
        mem = 8000
    params:
        "-q 20 --cut_front --cut_tail -l 30"
    shell:
        """
        fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} 
        -w {threads} -h {output.html} -j {output.json} {params}
        """

8000后的单位是MB，暂时不知道GB如何写

运行这个代码的命令

代码语言：javascript复制

snakemake --cluster 'sbatch --cpus-per-task={threads} --mem={resources.mem} --mail-type=FAIL --mail-user=mingyan24@126.com' --jobs 8 -s pomeRTD_snakemake_v01.py

这种写法会在当前目录下生成一大堆任务提交的日志文件，如何将这些文件输出到指定文件夹呢？

image.png

还有一个问题是 slurm 管理的HPC 通常可以用sbatch scripts.sh提交任务，这里可以把

代码语言：javascript复制

snakemake --cluster 'sbatch --cpus-per-task={threads} --mem={resources.mem} --mail-type=FAIL --mail-user=mingyan24@126.com' --jobs 8 -s pomeRTD_snakemake_v01.py

这个命令写到.sh文件中吗？然后用sbatch提交，可以试试

如果不是计算机集群有办法设置jobs吗？

还有好多基础需要看

https 网络安全 serverless bash

0 人点赞