KeyError：使用 Huggingface Transformers 使用 BioASQ 数据集时出现“答案”错误

Posted 2023-03-29

技术标签:

【中文标题】KeyError：使用 Huggingface Transformers 使用 BioASQ 数据集时出现“答案”错误【英文标题】：KeyError: 'answers' error when using BioASQ dataset using Huggingface Transformers 【发布时间】：2020-07-11 12:04:17 【问题描述】：

我正在使用 Huggingface Transformers 的 run_squad.py https://github.com/huggingface/transformers/blob/master/examples/run_squad.py 对 BioASQ 问答数据集进行微调。

我已将 BioBERT https://github.com/dmis-lab/bioasq-biobert 的作者提供的 tensorflow 权重转换为 Pytorch，如下所述 https://github.com/huggingface/transformers/issues/312。

此外，我正在使用 BioASQ https://github.com/dmis-lab/bioasq-biobert 的预处理数据，该数据转换为 SQuAD 形式。但是，当我使用以下参数运行 run_squad.py 脚本时

 --model_type bert \
  --model_name_or_path /scratch/oe7/uk1594/BioBERT/BioBERT-PyTorch/BioBERTv1.1-SQuADv1.1-Factoid-PyTorch/ \
  --do_train \
  --do_eval \
  --save_steps 1000 \
  --train_file $data/BioASQ-train-factoid-6b.json \
  --predict_file $data/BioASQ-test-factoid-6b-1.json \
  --per_gpu_train_batch_size 12 \
  --learning_rate 3e-5 \
  --num_train_epochs 2.0 \
  --max_seq_length 384 \
  --doc_stride 128 \
  --output_dir /scratch/oe7/uk1594/BioBERT/BioBERT-PyTorch/QA_output_squad/BioASQ-factoid-6b/BioASQ-factoid-6b-1-issue-23mar/


I get the below error:

03/23/2020 12:53:12 - INFO - transformers.modeling_utils -   loading weights file /scratch/oe7/uk1594/BioBERT/BioBERT-PyTorch/QA_output_squad/BioASQ-factoid-6b/BioASQ-factoid-6b-1-issue-23mar/pytorch_model.bin
03/23/2020 12:53:15 - INFO - __main__ -   Creating features from dataset file at .

  0%|          | 0/1 [00:00<?, ?it/s]
  0%|          | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_squad.py", line 856, in <module>
    main()
  File "run_squad.py", line 845, in main
    result = evaluate(args, model, tokenizer, prefix=global_step)
  File "run_squad.py", line 299, in evaluate
    dataset, examples, features = load_and_cache_examples(args, tokenizer, evaluate=True, output_examples=True)
  File "run_squad.py", line 475, in load_and_cache_examples
    examples = processor.get_dev_examples(args.data_dir, filename=args.predict_file)
  File "/scratch/oe7/uk1594/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 522, in get_dev_examples
    return self._create_examples(input_data, "dev")
  File "/scratch/oe7/uk1594/lib/python3.7/site-packages/transformers/data/processors/squad.py", line 549, in _create_examples
    answers = qa["answers"]
KeyError: 'answers'

非常感谢您的帮助。

非常感谢您的指导。

评估数据集如下所示：


  "version": "BioASQ6b", 
  "data": [
    
      "title": "BioASQ6b", 
      "paragraphs": [
        
          "context": "emMAW: computing minimal absent words in external memory. Motivation: The biological significance of minimal absent words has been investigated in genomes of organisms from all domains of life. For instance, three minimal absent words of the human genome were found in Ebola virus genomes",
          "qas": [
            
              "question": "Which algorithm is available for computing minimal absent words using external memory?", 
              "id": "5a6a3335b750ff4455000025_000"
            
          ]
        
    ]

]

【问题讨论】：

请向我们展示您将 BioASQ 转换为 Squad 格式的代码。错误消息听起来像是您的转换不正确。 @cronoik ...感谢您的回复。我已使用上面的评估数据集示例编辑了问题。当您提供 -do_eval 时，脚本会尝试评估（即计算 F1），因此每个问题都需要一个答案部分。您的示例中缺少此部分。 【参考方案1】：

BioASQ 评估文件是不包含答案的测试文件，仅用于预测。对于训练期间的评估，您可以使用部分训练文件

【讨论】：

以上是关于KeyError：使用 Huggingface Transformers 使用 BioASQ 数据集时出现“答案”错误的主要内容，如果未能解决你的问题，请参考以下文章