我的需求是:
我有10个基因组,然后又12个转录组数据,然后将这个12个基因组数据分别比对到这个10个基因组,每个基因组得到12个bam文件,然后将每个基因组的12个bam文件合并 ,最终得到10个合并的bam文件
前面比对的步骤没有遇到问题
代码语言:javascript复制SAMPLES, = glob_wildcards("../../rnaseq/01.clean.reads/{sample}_clean_1.fq")
GENOMES, = glob_wildcards("{genome}.fa.masked")
GENOMES = [genome.split("/")[0] for genome in GENOMES if len(genome.split("/")) == 3]
GENOMES.remove("D1")
print(GENOMES)
print(SAMPLES)
print("Total genome size: ",len(GENOMES))
print("Total sample size: ",len(SAMPLES))
rule all:
input:
expand("{genome}/{genome}/ref_index.1.ht2",genome=GENOMES),
expand("{genome}/{genome}/{sample}.sam",genome=GENOMES,sample=SAMPLES),
expand("{genome}/{genome}/{sample}.sorted.bam.bai",genome=GENOMES,sample=SAMPLES),
expand("{genome}/{genome}/merged.bam",genome=GENOMES),
expand("{genome}/{genome}/merged.gtf",genome=GENOMES)
rule hisat2_build:
input:
ref = "{genome}/{genome}/{genome}.fa.masked"
output:
"{genome}/{genome}/ref_index.1.ht2"
threads:
8
resources:
mem_mb = 48000
params:
"{genome}/{genome}/ref_index"
shell:
"""
hisat2-build {input.ref} {params} -p {threads}
"""
rule hisat2_align:
input:
index = rules.hisat2_build.output,
r1 = "../../rnaseq/01.clean.reads/{sample}_clean_1.fq",
r2 = "../../rnaseq/01.clean.reads/{sample}_clean_2.fq"
output:
sam = "{genome}/{genome}/{sample}.sam"
threads:
8
resources:
mem_mb = 64000
params:
ref_index = rules.hisat2_build.params
shell:
"""
hisat2 -p {threads} --dta -x {params.ref_index}
-1 {input.r1} -2 {input.r2} -S {output.sam}
"""
rule samtools_sort:
input:
sam = rules.hisat2_align.output.sam
output:
bam = "{genome}/{genome}/{sample}.sorted.bam"
threads:
8
resources:
mem_mb = 64000
shell:
"""
samtools sort -@ {threads} -O bam -o {output.bam} {input.sam}
"""
rule samtools_index:
input:
rules.samtools_sort.output.bam
output:
"{genome}/{genome}/{sample}.sorted.bam.bai"
threads:
8
resources:
mem_mb = 24000
shell:
"""
samtools index {input}
"""
到合并的步骤最开始的写法是
代码语言:javascript复制rule samtools_merge:
input:
bams = expand(rules.samtools_sort.output.bam,genome=GENOMES,sample=SAMPLES),
bai = expand(rules.samtools_index.output,genome=GENOMES,sample=SAMPLES)
output:
"{genome}/{genome}/merged.bam"
threads:
16
resources:
mem_mb = 64000
shell:
"""
samtools merge -@ {threads} {output} {input.bams}
"""
这样写的问题是合并的时候每个基因组对应的是120个bam文件
然后换成下面的写法
代码语言:javascript复制rule samtools_merge:
input:
bams = expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES),
bai = expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)
output:
"{genome}/{genome}/merged.bam"
threads:
16
resources:
mem_mb = 64000
shell:
"""
samtools merge -@ {threads} {output} {input.bams}
"""
这个报错提示
name 'wildcards' is not defined
然后又换成下面的写法
代码语言:javascript复制def getGenomes(wildcards):
return wildcards.genome
rule samtools_merge:
input:
bams = expand(rules.samtools_sort.output.bam,genome=getGenome,sample=SAMPLES),
bai = expand(rules.samtools_index.output,genome=getGenome,sample=SAMPLES)
output:
"{genome}/{genome}/merged.bam"
threads:
16
resources:
mem_mb = 64000
shell:
"""
samtools merge -@ {threads} {output} {input.bams}
"""
这个还是报错,报错内容忘记截图了,而且报错很诡异
然后以关键词 snakemake lambda wildcards expand 搜索,找到了一个链接
https://stackoverflow.com/questions/45508579/snakemake-wildcards-or-expand-command
这里面提到
'wildcards' is not "directly" defined in the input of a rule. You need to use a [function of 'wildcards'](http://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#functions-as-input-files) instead. I'm not sure I understand exactly what you want to do, but you may try something like that.
这里提供了一个写法
代码语言:javascript复制def condition2tumorsamples(wildcards):
return expand(
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
sample=config['conditions'][wildcards.condition]['tumor'])
def condition2normalsamples(wildcards):
return expand(
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
sample=config['conditions'][wildcards.condition]['normal'])
rule gatk_RealignerTargetCreator:
input:
tumor = condition2tumorsamples,
normal = condition2normalsamples,
output:
"mapped_reads/merged_samples/{condition}.realign.intervals"
# remainder of the rule here...
安装这个思路我写我自己的
代码语言:javascript复制def getGenome01(wildcards):
return expand(rules.samtools_sort.output.bam,genome=wildcards.genome,sample=SAMPLES)
def getGenome02(wildcards):
return expand(rules.samtools_index.output,genome=wildcards.genome,sample=SAMPLES)
rule samtools_merge:
input:
bams = getGenome01,
bai = getGenome02
output:
"{genome}/{genome}/merged.bam"
threads:
16
resources:
mem_mb = 64000
shell:
"""
samtools merge -@ {threads} {output} {input.bams}
"""
这个是可以正常运行的
推文记录的是自己的学习笔记,内容可能会存在错误,请大家批判着看,欢迎大家指出其中的错误
欢迎大家关注我的公众号
小明的数据分析笔记本
小明的数据分析笔记本 公众号 主要分享:1、R语言和python做数据分析和数据可视化的简单小例子;2、园艺植物相关转录组学、基因组学、群体遗传学文献阅读笔记;3、生物信息学入门学习资料及自己的学习笔记!