Snakemake:规则的数据依赖条件执行,IndexError

Posted

技术标签:

【中文标题】Snakemake:规则的数据依赖条件执行,IndexError【英文标题】:Snakemake: Data-dependent conditional execution of rules, IndexError 【发布时间】:2021-08-26 20:50:35 【问题描述】:

当执行下面的snakemake管道时,我得到一个错误:IndexError: list index out of range。我认为这是因为所有 SAMPLE 都在执行 fastqc_pretrim。但是,并非所有样本都通过碱基检出 QC,因此这里只需要处理一些文件。我正在尝试使用检查点来运行它。查看日志,我们可以看到它正在尝试为示例“FAQ20773_pass_barcode01_68fda206_1”运行 fastqc_pretrim。但是,如果您查看 LOG 中该行的上方,FAQ20773_fail_barcode03_68fda206_0 实际上是唯一通过 .fastq.gz 文件传递​​的样本。我不确定为什么没有运行正确的示例。

日志:

snakemake --use-conda --jobs 1 -pr
['FAQ20773_fail_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_2', 'FAQ20773_fail_barcode03_68fda206_0', 'FAQ20773_fail_barcode02_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_1']
The flag 'directory' used in rule guppy_basecall_persample is only valid for outputs, not inputs.        
Building DAG of jobs...                                                                                                                                                                                  
Updating job fastqc_pretrim.                                                                                                                                                                           
basecall/FAQ20773_fail_barcode01_68fda206_0                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_pass_barcode01_68fda206_2                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_fail_barcode03_68fda206_0                                                                                                                                                               
['basecall/FAQ20773_fail_barcode03_68fda206_0/pass/fastq_runid_68fda20603fe08e9e2a4eef8718997203b603497_0_0.fastq.gz']                                                                                    
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_fail_barcode02_68fda206_0                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_pass_barcode01_68fda206_0                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_pass_barcode01_68fda206_1                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Using shell: /usr/bin/bash                   

[Thu Aug 26 13:13:51 2021]                                                                                                                                                                                
rule fastqc_pretrim:                                                                                                                                                                                          
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip                                                                        
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log                                                                                                                                           
jobid: 19                                                                                                                                                                                                 
reason: Missing output files: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip                                                                                                             
wildcards: sample=FAQ20773_pass_barcode01_68fda206_1                                                                                                                                                      
resources: tmpdir=/tmp                                                                                                                                                                                                                                                                                                                                                                                          
/home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py                                               
Activating conda environment: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4                                                                     
Traceback (most recent call last):                                                                                                                                                                          
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py", line 41, in <module>                                                                                shell(                                                                                                                                                                                                  File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/shell.py", line 130, in __new__                                                                                   cmd = format(cmd, *args, stepout=2, **kwargs)                                                                                                                                                           File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/utils.py", line 427, in format                                                                                    return fmt.format(_pattern, *args, **variables)                                                                                                                                                         File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 161, in format                                                                                                           return self.vformat(format_string, args, kwargs)                                                                                                                                                        
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 165, in vformat                                                                                                          result, _ = self._vformat(format_string, args, kwargs, used_args, 2)                                                                                                                                    
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 205, in _vformat                                                                                                         obj, arg_used = self.get_field(field_name, args, kwargs)                                                                                                                                                
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 278, in get_field                                                                                                        obj = obj[i]                                                                                                                                                                                            
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/io.py", line 1536, in __getitem__                                                                                 return super().__getitem__(key)                                                                                                                                                                       
IndexError: list index out of range                                                                                                                                                                       
[Thu Aug 26 13:13:52 2021]                                                                                                                                                                                
Error in rule fastqc_pretrim:                                                                                                                                                                                 
jobid: 19                                                                                                                                                                                                 
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip                                                                        
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log (check log file(s) for error message)                                                                                                     
conda-env: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4                                                                                                                                                                                                                                                                                              
RuleException:                                                                                                                                                                                            
CalledProcessError in line 60 of /mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile:                                                                                                         
Command 'source /home/hvasquezgross/miniconda3/bin/activate '/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4'; /home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py' returned non-zero exit status 1.                                                  
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile", line 60, in __rule_fastqc_pretrim                                                                                                 
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/concurrent/futures/thread.py", line 52, in run                                                                                        
Shutting down, this might take some time.                                                                                                                                                                 
Exiting because a job execution failed. Look above for error message  

蛇形

import glob                                                                                                                                                                                                                                                                                                                                                                                                         
configfile: "config.yaml"                                                                                                                                                                                                                                                                                                                                                                                           
inputdirectory=config["directory"]                                                                                                                                                                        
SAMPLES, = glob_wildcards(inputdirectory+"/sample.fast5", followlinks=True)                                                                                                                             
print(SAMPLES)                                                                                                                                                                                                                                                                                                                                                                                                      
wildcard_constraints:                                                                                                                                                                                         
sample="\w+\d+_\w+_\w+\d+_.+_\d"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##### target rules #####                                                                                                                                                                                  
rule all:                                                                                                                                                                                                     
input:                                                                                                                                                                                                       
   expand('basecall/sample/sequencing_summary.txt', sample=SAMPLES),                                                                                                                                       
   "qc/multiqc.html"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

rule make_indvidual_samplefiles:                                                                                                                                                                              
input:                                                                                                                                                                                                        
   inputdirectory+"/sample.fast5",                                                                                                                                                                     
output:                                                                                                                                                                                                       
   "lists/sample.txt",                                                                                                                                                                                 
shell:                                                                                                                                                                                                        
   "basename input  > output"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        


checkpoint guppy_basecall_persample:                                                                                                                                                                          
input:                                                                                                                                                                                                        
   directory=directory(inputdirectory),                                                                                                                                                                      
   samplelist="lists/sample.txt",                                                                                                                                                                      
output:                                                                                                                                                                                                       
   summary="basecall/sample/sequencing_summary.txt",                                                                                                                                                       
   directory=directory("basecall/sample/"),                                                                                                                                                            
params:                                                                                                                                                                                                       
   config["basealgo"]                                                                                                                                                                                    
shell:                                                                                                                                                                                                        
   "guppy_basecaller -i input.directory --input_file_list input.samplelist -s output.directory -c params --compress_fastq -x \"auto\" --gpu_runners_per_device 3 --num_callers 2 --chunks_per_runner 200"                                                                                                                                                                                                                                                                                                                                                                                                        


def aggregate_input(wildcards):                                                                                                                                                                               
   checkpoint_output = checkpoints.guppy_basecall_persample.get(**wildcards).output[1]                                                                                                                       
   print(checkpoint_output)                                                                                                                                                                                  
   exparr = expand("basecall/sample/pass/runid.fastq.gz", sample=wildcards.sample, 
   runid=glob_wildcards(os.path.join(checkpoint_output, "pass/", "runid.fastq.gz")).runid)                             
   print(exparr)                                                                                                                                                                                             
   return exparr    

rule fastqc_pretrim:
    input:
        aggregate_input
    output:
        html="qc/fastqc_pretrim/sample.html",
        zip="qc/fastqc_pretrim/sample_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params: ""
    log:
        "logs/fastqc_pretrim/sample.log"
    threads: 1
    wrapper:
        "0.77.0/bio/fastqc"

rule multiqc:
    input:
        #expand("basecall/sample.fastq.gz", sample=SAMPLES)
        expand("qc/fastqc_pretrim/sample_fastqc.zip", sample=SAMPLES)
    output:
        "qc/multiqc.html"
    params:
        ""  # Optional: extra parameters for multiqc.
    log:
        "logs/multiqc.log"
    wrapper:
        "0.77.0/bio/multiqc"

【问题讨论】:

【参考方案1】:

我认为您使用checkpointwrapper 使事情变得比必要的复杂。这就是我会做的,或多或少:

rule guppy_basecall_persample:
    input:
        ...
    output:
        summary="basecall/sample/sequencing_summary.txt",                                                                                                                                                       
        directory=directory("basecall/sample/"),
    shell:
        r"""
        guppy ...
        """

rule fastqc_pretrim:
    input:
        directory= directory("basecall/sample/"),
    output:
        html="qc/fastqc_pretrim/sample.html",
        zip="qc/fastqc_pretrim/sample_fastqc.zip"
    shell:
        r"""
        fastqc input.directory/pass/*.fastq.gz
        """

【讨论】:

我认为这种方法的问题在于并非所有样本都会通过过滤阶段。所以并不是每个样本文件最终都有一个 pass 文件夹,所以我需要有条件地使用通过的样本。我希望使用检查点来重新评估 SAMPLE 名称,并且只处理通过的名称。

以上是关于Snakemake:规则的数据依赖条件执行,IndexError的主要内容,如果未能解决你的问题,请参考以下文章

无法在Snakemake规则中使用conda环境导入python模块

Snakemake Checkpoints 聚合 Skipping 中间规则

Snakemake,RNA-seq:如何根据所分析样本的特征执行管道的一个子部分或另一个子部分?

Snakemake:将命令行参数传递给脚本。

makefile学习笔记

初识makefile