Pytorch 错误“RuntimeError: index out of range: 试图访问索引 512 out of table with 511 rows”

Posted

技术标签:

【中文标题】Pytorch 错误“RuntimeError: index out of range: 试图访问索引 512 out of table with 511 rows”【英文标题】:Pytorch error "RuntimeError: index out of range: Tried to access index 512 out of table with 511 rows" 【发布时间】:2020-10-17 06:07:51 【问题描述】:

我有句子,我使用 BiobertEmbedding python 模块 (https://pypi.org/project/biobert-embedding/) 的 sentence_vector() 方法进行矢量化。对于某些句子我没有问题,但对于其他一些句子我有以下错误消息:

文件 "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", 第 133 行,在 sentence_vector 中 encoded_layers = self.eval_fwdprop_biobert(tokenized_text) 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/biobert_embedding/embedding.py", 第 82 行,在 eval_fwdprop_biobert 编码层,_ = self.model(tokens_tensor,segments_tensors)文件 "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", 第 547 行,在 __call__ 中 结果 = self.forward(*input, **kwargs) 文件“/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py”, 第 730 行,向前 embedding_output = self.embeddings(input_ids, token_type_ids) 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", 第 547 行,在 __call__ 中 结果 = self.forward(*input, **kwargs) 文件“/home/nobunaga/.local/lib/python3.6/site-packages/pytorch_pretrained_bert/modeling.py”, 第 268 行,向前 position_embeddings = self.position_embeddings(position_ids) 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", 第 547 行,在 __call__ 中 结果 = self.forward(*input, **kwargs) 文件 "/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/modules/sparse.py", 第 114 行,向前 self.norm_type,self.scale_grad_by_freq,self.sparse)文件“/home/nobunaga/.local/lib/python3.6/site-packages/torch/nn/functional.py”, 第 1467 行,在嵌入中 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: index out of range: 试图 从 511 行的表中访问索引 512。在 /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:237

我发现对于某些句子组,问题与 <tb> 等标签有关。但是对于其他人,即使删除了标签,错误消息仍然存在。 (很遗憾,出于保密原因,我不能分享代码)

您对可能出现的问题有任何想法吗?

提前谢谢你

编辑:你说得对,举个例子会更好。

例子:

sentences = ["This is the first sentence.", "This is the second sentence.", "This is the third sentence."

biobert = BiobertEmbedding(model_path='./biobert_v1.1_pubmed_pytorch_model')

vectors = [biobert.sentence_vector(doc) for doc in sentences]

在我看来,这最后一行代码是导致错误消息的原因。

【问题讨论】:

请给我们一个最小的可重现示例,以便我们重现错误。 【参考方案1】:

问题在于 biobert 嵌入模块没有处理 512 的最大序列长度(令牌而不是单词!)。这是相关的source code。看看下面的例子来强制你收到的错误:

from biobert_embedding.embedding import BiobertEmbedding
#sentence has 385 words
sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

biobert = BiobertEmbedding()
print('sentence has  tokens'.format(len(biobert.process_text(sentence))))
#works
biobert.sentence_vector(sentence)
print('longersentence has  tokens'.format(len(biobert.process_text(longersentence))))
#didn't work
biobert.sentence_vector(longersentence)

输出:

sentence has 512 tokens
longersentence has 513 tokens
#your error message....

你应该做的是实现一个sliding window approach来处理这些文本:

import torch
from biobert_embedding.embedding import BiobertEmbedding

maxtokens = 512
startOffset = 0
docStride = 200

sentence = "The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control characters, or using value in the range from 128 to 255. Using values above 128 conflicts with using the 8th bit as a checksum, but the checksum usage gradually died out. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data. Text is considered plain-text regardless of its encoding. To properly understand or process it the recipient must know (or be able to figure out) what encoding was used; however, they need not know anything about the computer architecture that was used, or about the binary structures defined by whatever program (if any) created the data The near-ubiquity of ASCII was a great help, but failed to address international and linguistic concerns. The dollar-sign was not so useful in England, and the accented characters used in Spanish, French, German, and many other languages were entirely unavailable in ASCII (not to mention characters used in Greek, Russian, and most Eastern languages). Many individuals, companies, and countries defined extra characters as needed—often reassigning control"
longersentence = sentence + ' some'

sentences = [sentence, longersentence, 'small test sentence']
vectors = []
biobert = BiobertEmbedding()

#https://github.com/Overfitter/biobert_embedding/blob/b114e3456de76085a6cf881ff2de48ce868e6f4b/biobert_embedding/embedding.py#L127
def sentence_vector(tokenized_text, biobert):
    encoded_layers = biobert.eval_fwdprop_biobert(tokenized_text)

    # `encoded_layers` has shape [12 x 1 x 22 x 768]
    # `token_vecs` is a tensor with shape [22 x 768]
    token_vecs = encoded_layers[11][0]

    # Calculate the average of all 22 token vectors.
    sentence_embedding = torch.mean(token_vecs, dim=0)
    return sentence_embedding


for doc in sentences:
    #tokenize your text
    docTokens = biobert.process_text(doc)
    
    while startOffset < len(docTokens):
        print(startOffset)
        length = min(len(docTokens) - startOffset, maxtokens)

        #now we calculate the sentence_vector for the document slice
        vectors.append(sentence_vector(
                        docTokens[startOffset:startOffset+length]
                        , biobert)
                      )
        #stop when the whole document is processed (document has less than 512
        #or the last document slice was processed)
        if startOffset + length == len(docTokens):
            break
        startOffset += min(length, docStride)
    startOffset = 0

P.S.:您删除 &lt;tb&gt; 的部分成功是可能的,因为删除 &lt;tb&gt; 将删除 4 个标记('')。

【讨论】:

非常感谢,这很有帮助。如果我正确理解您发布的代码,它将向量化太长句子的第一部分,然后是第二部分,所以最后我们有两个维度为 768 的张量用于太长的句子?如果我错了,请告诉我。我问是因为这对我的用例可能有问题。无论如何,非常感谢。 是的,这是正确的,但句子不只是分成两半。它将始终保留前一部分的一些标记(在上面的示例中为 200),以便为新部分提供上下文。如果这有问题,您可能需要使用Longformer,它可以处理 4096 个令牌而不是 512 个。 Longformer 的使用可能很有趣。你能给我一个例子来说明如何在上面的例子中使用它吗? 因为当我查看 Longformer 文档时,我真的不知道如何使用它。 当然,但请为此提出一个新问题。 SO 旨在收集好的问题和答案,这些问题和答案不仅对您自己而且对其他人都有帮助。在我看来,混合 Longformer 和 Biobert 对其他人没有帮助。请记住,一个好的问题包含您的用例、示例数据、预期输出和研究工作。【参考方案2】:

由于原始 BERT 具有 512 (0 - 511) 大小的位置编码,并且 bioBERT 派生自 BERT,因此得到 512 的索引错误也就不足为奇了。但是,您能够访问 512 有点奇怪对于你提到的一些句子。

【讨论】:

以上是关于Pytorch 错误“RuntimeError: index out of range: 试图访问索引 512 out of table with 511 rows”的主要内容,如果未能解决你的问题,请参考以下文章

RuntimeError:梯度计算所需的变量之一已被就地操作修改:PyTorch 错误

Pytorch RNN 错误:RuntimeError:输入必须有 3 个维度得到 1

Pytorch 抛出错误 RuntimeError: result type Float can't be cast to the desired output type Long

Pytorch 错误“RuntimeError: index out of range: 试图访问索引 512 out of table with 511 rows”

在pytorch中运行py代码出现如下错误,求大神帮助 RuntimeError: CUDA error: unknown error

PyTorch:RuntimeError:变量元组的元素0是易失性的