Snakemake:规则的数据依赖条件执行,IndexError
Posted
技术标签:
【中文标题】Snakemake:规则的数据依赖条件执行,IndexError【英文标题】:Snakemake: Data-dependent conditional execution of rules, IndexError 【发布时间】:2021-08-26 20:50:35 【问题描述】:当执行下面的snakemake管道时,我得到一个错误:IndexError: list index out of range。我认为这是因为所有 SAMPLE 都在执行 fastqc_pretrim。但是,并非所有样本都通过碱基检出 QC,因此这里只需要处理一些文件。我正在尝试使用检查点来运行它。查看日志,我们可以看到它正在尝试为示例“FAQ20773_pass_barcode01_68fda206_1”运行 fastqc_pretrim。但是,如果您查看 LOG 中该行的上方,FAQ20773_fail_barcode03_68fda206_0 实际上是唯一通过 .fastq.gz 文件传递的样本。我不确定为什么没有运行正确的示例。
日志:
snakemake --use-conda --jobs 1 -pr
['FAQ20773_fail_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_2', 'FAQ20773_fail_barcode03_68fda206_0', 'FAQ20773_fail_barcode02_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_1']
The flag 'directory' used in rule guppy_basecall_persample is only valid for outputs, not inputs.
Building DAG of jobs...
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode01_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_2
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode03_68fda206_0
['basecall/FAQ20773_fail_barcode03_68fda206_0/pass/fastq_runid_68fda20603fe08e9e2a4eef8718997203b603497_0_0.fastq.gz']
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_fail_barcode02_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_0
[]
Updating job all.
Updating job fastqc_pretrim.
basecall/FAQ20773_pass_barcode01_68fda206_1
[]
Updating job all.
Using shell: /usr/bin/bash
[Thu Aug 26 13:13:51 2021]
rule fastqc_pretrim:
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log
jobid: 19
reason: Missing output files: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
wildcards: sample=FAQ20773_pass_barcode01_68fda206_1
resources: tmpdir=/tmp
/home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py
Activating conda environment: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4
Traceback (most recent call last):
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py", line 41, in <module> shell( File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/shell.py", line 130, in __new__ cmd = format(cmd, *args, stepout=2, **kwargs) File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/utils.py", line 427, in format return fmt.format(_pattern, *args, **variables) File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 161, in format return self.vformat(format_string, args, kwargs)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 165, in vformat result, _ = self._vformat(format_string, args, kwargs, used_args, 2)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 205, in _vformat obj, arg_used = self.get_field(field_name, args, kwargs)
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 278, in get_field obj = obj[i]
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/io.py", line 1536, in __getitem__ return super().__getitem__(key)
IndexError: list index out of range
[Thu Aug 26 13:13:52 2021]
Error in rule fastqc_pretrim:
jobid: 19
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log (check log file(s) for error message)
conda-env: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4
RuleException:
CalledProcessError in line 60 of /mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile:
Command 'source /home/hvasquezgross/miniconda3/bin/activate '/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4'; /home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py' returned non-zero exit status 1.
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile", line 60, in __rule_fastqc_pretrim
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/concurrent/futures/thread.py", line 52, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
蛇形
import glob
configfile: "config.yaml"
inputdirectory=config["directory"]
SAMPLES, = glob_wildcards(inputdirectory+"/sample.fast5", followlinks=True)
print(SAMPLES)
wildcard_constraints:
sample="\w+\d+_\w+_\w+\d+_.+_\d"
##### target rules #####
rule all:
input:
expand('basecall/sample/sequencing_summary.txt', sample=SAMPLES),
"qc/multiqc.html"
rule make_indvidual_samplefiles:
input:
inputdirectory+"/sample.fast5",
output:
"lists/sample.txt",
shell:
"basename input > output"
checkpoint guppy_basecall_persample:
input:
directory=directory(inputdirectory),
samplelist="lists/sample.txt",
output:
summary="basecall/sample/sequencing_summary.txt",
directory=directory("basecall/sample/"),
params:
config["basealgo"]
shell:
"guppy_basecaller -i input.directory --input_file_list input.samplelist -s output.directory -c params --compress_fastq -x \"auto\" --gpu_runners_per_device 3 --num_callers 2 --chunks_per_runner 200"
def aggregate_input(wildcards):
checkpoint_output = checkpoints.guppy_basecall_persample.get(**wildcards).output[1]
print(checkpoint_output)
exparr = expand("basecall/sample/pass/runid.fastq.gz", sample=wildcards.sample,
runid=glob_wildcards(os.path.join(checkpoint_output, "pass/", "runid.fastq.gz")).runid)
print(exparr)
return exparr
rule fastqc_pretrim:
input:
aggregate_input
output:
html="qc/fastqc_pretrim/sample.html",
zip="qc/fastqc_pretrim/sample_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
params: ""
log:
"logs/fastqc_pretrim/sample.log"
threads: 1
wrapper:
"0.77.0/bio/fastqc"
rule multiqc:
input:
#expand("basecall/sample.fastq.gz", sample=SAMPLES)
expand("qc/fastqc_pretrim/sample_fastqc.zip", sample=SAMPLES)
output:
"qc/multiqc.html"
params:
"" # Optional: extra parameters for multiqc.
log:
"logs/multiqc.log"
wrapper:
"0.77.0/bio/multiqc"
【问题讨论】:
【参考方案1】:我认为您使用checkpoint
和wrapper
使事情变得比必要的复杂。这就是我会做的,或多或少:
rule guppy_basecall_persample:
input:
...
output:
summary="basecall/sample/sequencing_summary.txt",
directory=directory("basecall/sample/"),
shell:
r"""
guppy ...
"""
rule fastqc_pretrim:
input:
directory= directory("basecall/sample/"),
output:
html="qc/fastqc_pretrim/sample.html",
zip="qc/fastqc_pretrim/sample_fastqc.zip"
shell:
r"""
fastqc input.directory/pass/*.fastq.gz
"""
【讨论】:
我认为这种方法的问题在于并非所有样本都会通过过滤阶段。所以并不是每个样本文件最终都有一个 pass 文件夹,所以我需要有条件地使用通过的样本。我希望使用检查点来重新评估 SAMPLE 名称,并且只处理通过的名称。以上是关于Snakemake:规则的数据依赖条件执行,IndexError的主要内容,如果未能解决你的问题,请参考以下文章
无法在Snakemake规则中使用conda环境导入python模块
Snakemake Checkpoints 聚合 Skipping 中间规则