弹性搜索 | TypeError:字符串索引必须是整数

Posted

技术标签:

【中文标题】弹性搜索 | TypeError:字符串索引必须是整数【英文标题】:ElasticSearch | TypeError: string indices must be integers 【发布时间】:2022-01-15 00:31:34 【问题描述】:

我正在使用这个Notebook,其中Apply DocumentClassifier部分更改如下。

Jupyter 实验室,内核:conda_mxnet_latest_p37


我理解错误意味着我传递的是 str 而不是 int。但是,这应该不是问题,因为它适用于原始 Notebook 中的其他 .pdf/ .txt 文件。

代码单元:

doc_dir = "GRIs/"  # contains 2 .pdfs

with open('filt_gri.txt', 'r') as filehandle:
    tags = [current_place.rstrip() for current_place in filehandle.readlines()]


doc_classifier = TransformersDocumentClassifier(model_name_or_path="cross-encoder/nli-distilroberta-base",
                                                task="zero-shot-classification",
                                                labels=tags,
                                                batch_size=2)

# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]

# classify using gpu, batch_size makes sure we do not run out of memory
classified_docs = doc_classifier.predict(docs_to_classify)

# let's see how it looks: there should be a classification result in the meta entry containing labels and scores.
print(classified_docs[0].to_dict())

all_docs = convert_files_to_dicts(dir_path=doc_dir)

preprocessor_sliding_window = PreProcessor(split_overlap=3,
                                           split_length=10,
                                           split_respect_sentence_boundary=False,
                                           split_by='passage')

输出错误:

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-82b54cd162ff> in <module>
     14 
     15 # classify using gpu, batch_size makes sure we do not run out of memory
---> 16 classified_docs = doc_classifier.predict(docs_to_classify)
     17 
     18 # let's see how it looks: there should be a classification result in the meta entry containing labels and scores.

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/haystack/nodes/document_classifier/transformers.py in predict(self, documents)
    144         for prediction, doc in zip(predictions, documents):
    145             if self.task == 'zero-shot-classification':
--> 146                 prediction["label"] = prediction["labels"][0]
    147             doc.meta["classification"] = prediction
    148 

TypeError: string indices must be integers

请让我知道是否还有其他需要添加的内容/澄清。

【问题讨论】:

【参考方案1】:

我将变量 docs_sliding_window 替换为 my_dsw

my_dsw 只保留长度为 &lt;= 1000 字符的行。这有助于更好地拟合我的数据形状。

my_dsw = []
for dsw in range(0, len(docs_sliding_window)-1):
    if len(docs_sliding_window[dsw]['content']) <= 1000:
        my_dsw.append(docs_sliding_window[dsw])

docs_to_classify 线换掉它:

# convert to Document using a fieldmap for custom content fields the classification should run on
docs_to_classify = [Document.from_dict(d) for d in docs_sliding_window]

诚然,我不确定这与错误有何具体关系;但它确实有助于更好地拟合数据;现在我可以增加batch_size=4

【讨论】:

机器学习只是对模型和数据输入的实验。

以上是关于弹性搜索 | TypeError:字符串索引必须是整数的主要内容,如果未能解决你的问题,请参考以下文章

读取 JSON 字符串 | TypeError:字符串索引必须是整数

Python:TypeError:字符串索引必须是整数[关闭]

TypeError:字符串索引必须是整数 Python 2

为啥在尝试从 api 获取数据时出现此错误“TypeError:字符串索引必须是整数”?

TypeError:使用Python解析JSON时字符串索引必须是整数?

如何使用弹性搜索索引 10 亿行 CSV 文件?