snakemake学习笔记007~slurm的cluster提交任务
主要参考
https://eriqande.github.io/eca-bioinf-handbook/snakemake-chap.html
内容是fastp对原始数据进行过滤
snakemake文件的内容
代码语言:javascript复制
input_folder = "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/samples/"
output_folder = "/home/myan/scratch/private/practice_data/RNAseq/20220511/"
SRR,FRR = glob_wildcards(input_folder "{srr}_chrX_{frr}.fastq.gz")
rule all:
input:
expand(output_folder "outputfastq/{srr}_chrX_{frr}.fastq",srr=SRR,frr=FRR)
rule first:
input:
read01 = input_folder "{srr}_chrX_1.fastq.gz",
read02 = input_folder "{srr}_chrX_2.fastq.gz"
output:
read01 = output_folder "outputfastq/{srr}_chrX_1.fastq",
read02 = output_folder "outputfastq/{srr}_chrX_2.fastq",
json = output_folder "fastpreport/{srr}.json",
html = output_folder "fastpreport/{srr}.html"
threads:
8
shell:
"""
fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02}
--thread {threads} --html {output.html} --json {output.json}
"""
运行命令
代码语言:javascript复制snakemake --cluster 'sbatch --cpus-per-task={threads}' --jobs 12 -s snakemake_hpc.py
唰一下就结束了
试了下更长的命令
代码语言:javascript复制snakemake --cluster 'sbatch --cpus-per-task={threads} -o slurm_outputs/{rule}_{wildcards}_%j.out -e logs_errors/{rule}/{rule}_{wildcards}_%j.err --mail-type=ALL --mail-user=mingyan24@126.com' --jobs 4 -s snakemake_hpc.py
这个命令一直没有成功
下面这个命令是可以的 加上邮箱通知
代码语言:javascript复制snakemake --cluster 'sbatch --cpus-per-task={threads} --mail-type=ALL --mail-user=mingyan24@126.com' --jobs 4 -s snakemake_hpc.py
这里没有遇到内存超出的问题
但是我运行真实数据的时候会遇到内存超出问题
image.png
snakemake学习笔记007~slurm的cluster提交任务
image.png
我的文件存储层级如上,按照之前的通配符的写法,他会组合出PRJNA001/SRR0002_1.fastq.gz
的文件
这里的问题是如何指定expand()
函数的组合
流程处理的问题还是 fastp 过滤原始测序数据
代码语言:javascript复制import os
import glob
raw_fastq_folder = "/mnt/sdc/xiaoming/MingYan/snakemake_20220513/00.raw.fastq/"
output_folder = "/mnt/sdc/xiaoming/MingYan/snakemake_20220513/"
fq_list = {}
print(os.listdir(raw_fastq_folder))
experiment = os.listdir(raw_fastq_folder)
for i in experiment:
fq_list[i] = [fq.split("_")[0] for fq in os.listdir(os.path.join(raw_fastq_folder,i))]
print(fq_list)
inputs = [(dir,file) for dir,files in fq_list.items() for file in files]
#glob_wildcards(raw_fastq_folder "{exper}/{srr}_{frr}.fastq")
rule all:
input:
expand(output_folder "01.fastp.report/" "{exper}/{srr}.html",zip,exper=[row[0] for row in inputs],srr=[row[1] for row in inputs])
rule firstrule:
input:
read01 = raw_fastq_folder "{exper}/{srr}_1.fastq.gz",
read02 = raw_fastq_folder "{exper}/{srr}_2.fastq.gz"
output:
read01 = output_folder "01.fastp.filter/" "{exper}/{srr}_clean_1.fastq.gz",
read02 = output_folder "01.fastp.filter/" "{exper}/{srr}_clean_2.fastq.gz",
html = output_folder "01.fastp.report/" "{exper}/{srr}.html",
json = output_folder "01.fastp.report/" "{exper}/{srr}.json"
threads:
2
shell:
"""
fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} --json {output.json} --html {output.html} -w {threads}
"""
这里主要参考的链接是 https://stackoverflow.com/questions/68648311/snakemake-6-0-5-input-a-list-of-folders-and-multiple-files-from-each-folder-to
前面组合文件夹和文件的命令还是有点多的,不知道有没有简单的的方法
看到有的解决办法里还用到了lambda函数,还得仔细看一下lambda的用法
这里换成我真实的数据集后会遇到内存不够的情况,需要再snakemake里写resources
这里默认情况下用多少内存呢?还需要仔细看snakemake的文档
我真实数据的代码
代码语言:javascript复制import os
raw_fastq_folder = "/mnt/shared/scratch/myan/private/pomeRTD/00.raw.fastq/"
output_folder = "/home/myan/scratch/private/pomeRTD/"
#Folder,SRR,FRR = glob_wildcards(raw_fastq_folder "{folder}/{srr}_{frr}.fq.gz")
#print(Folder)
#experiment = os.listdir(raw_fastq_folder)
list_fastq = {}
for experiment in os.listdir(raw_fastq_folder):
list_fastq[experiment] = [x.split("_")[0] for x in os.listdir(raw_fastq_folder experiment)]
print(list_fastq)
inputs = [(dir,file) for dir,files in list_fastq.items() for file in files]
#glob_wildcards(raw_fastq_folder "{exper}/{srr}_{frr}.fastq")
rule all:
input:
expand(output_folder "01.fastp.report/" "{exper}/{srr}.html",zip,exper=[row[0] for row in inputs],srr=[row[1] for row in inputs])
rule runfastp:
input:
read01 = os.path.join(raw_fastq_folder,"{exper}","{srr}_1.fq.gz"),
read02 = os.path.join(raw_fastq_folder,"{exper}","{srr}_2.fq.gz")
output:
read01 = output_folder "01.fastp.filtered.reads/{exper}/{srr}_clean_1.fq.gz",
read02 = output_folder "01.fastp.filtered.reads/{exper}/{srr}_clean_2.fq.gz",
html = output_folder "01.fastp.report/{exper}/{srr}.html",
json = output_folder "01.fastp.report/{exper}/{srr}.json"
threads:
8
resources:
mem = 8000
params:
"-q 20 --cut_front --cut_tail -l 30"
shell:
"""
fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02}
-w {threads} -h {output.html} -j {output.json} {params}
"""
8000后的单位是MB,暂时不知道GB如何写
运行这个代码的命令
代码语言:javascript复制snakemake --cluster 'sbatch --cpus-per-task={threads} --mem={resources.mem} --mail-type=FAIL --mail-user=mingyan24@126.com' --jobs 8 -s pomeRTD_snakemake_v01.py
这种写法会在当前目录下生成一大堆任务提交的日志文件,如何将这些文件输出到指定文件夹呢?
image.png
还有一个问题是 slurm 管理的HPC 通常可以用sbatch scripts.sh
提交任务,这里可以把
snakemake --cluster 'sbatch --cpus-per-task={threads} --mem={resources.mem} --mail-type=FAIL --mail-user=mingyan24@126.com' --jobs 8 -s pomeRTD_snakemake_v01.py
这个命令写到.sh
文件中吗?然后用sbatch提交,可以试试
如果不是计算机集群有办法设置jobs吗?
还有好多基础需要看