使用短语中的信息查找句子中单词的索引

Posted

技术标签:

【中文标题】使用短语中的信息查找句子中单词的索引【英文标题】:find index of word in sentence with information from phrase 【发布时间】:2021-12-18 14:16:44 【问题描述】:

我需要sentence 中的word 的索引。但有时会出现重复的单词。 phrase 信息会很有帮助。或word 列中的上一行或下一行。

基本上,我只需要识别话语中的单词,例如如果word 是“海边”,我想知道它在句子中是哪个“海边”。我有来自phrase 的额外信息可以帮助识别。它们在数据框中的出现顺序也有帮助。

我现在拥有的是这样的:

file_id phrase word sentence word_indices
A I am I I am a happy bird. I sing every day. I eat worms. [0, 5, 9]
B the seaside is the she is by the seaside. The seaside is packed. [3, 5]
B the seaside is seaside she is by the seaside. The seaside is packed. [4, 6]
B the seaside is is she is by the seaside. The seaside is packed. [1, 7]
C nobody knows nobody nobody knows what is going on. She can find nobody [0, 9]
C find nobody nobody nobody knows what is going on. She can find nobody [0, 9]
D it is such a sunny day sunny it is such a sunny day ah I am so happy when it's sunny such a sunny day is the best [4, 13, 16]

但我想得到的是target 列中的内容。

file_id phrase word sentence word_indices target
A I am I I am a happy bird. I sing every day. I eat worms. [0, 5, 9] [0]
B the seaside is the she is by the seaside. The seaside is packed. [3, 5] [5]
B the seaside is seaside she is by the seaside. The seaside is packed. [4, 6] [6]
B the seaside is is she is by the seaside. The seaside is packed. [1, 7] [7]
C nobody knows nobody nobody knows what is going on. She can find nobody [0, 9] [0]
C find nobody nobody nobody knows what is going on. She can find nobody [0, 9] [9]
D it is such a sunny day sunny it is such a sunny day ah I am so happy when it's sunny such a sunny day is the best [4, 13, 16] [4]

我在这里发现了一个类似的问题:Find index of words in matched text 但不幸的是,这是在 java 中,我需要使用 python 来回答。

非常感谢!

【问题讨论】:

你能给出更准确的定义吗?我假设,如果word 在句子中不是唯一的,算法将查找phrase 术语并返回该短语第一次出现的单词的索引,对吗?如果phrase 出现多次会怎样?如果word多次出现但phrase没有出现怎么办? 感谢您的评论。是的,你问的问题也是我的问题。基本上,我只需要识别话语中的单词,例如如果word 是“海边”,我想知道它在句子中是哪个“海边”。我有来自phrase 的额外信息可以帮助识别。它们在数据框中出现的顺序也有帮助。 【参考方案1】:

我会将其分为两个步骤。找出导致句子中短语的单词数,然后找到短语中单词的单词索引号:如下所示:

def get_index_of_word_in_sentence(word, phrase, sentence):
    index1 = sentence.index(phrase)
    word_num1 = len(sentence[:index1].split())
    word_num2 = phrase.split().index(word)
    return word_num1 + word_num2

df["target"] = df.apply(lambda x: get_index_of_word_in_sentence(x["word"], x["phrase"], x["sentence"]), axis=1)

【讨论】:

感谢您的回答!看来这是在计算字符串的索引?我得到以下结果:[0, 23, 27, 35, 0, 44, 13]。这似乎不是我需要的...... 啊啊是的,我误读了这个问题。我已经修改了我的答案。 成功了!谢谢! @SuperDuperMario,你能告诉我你是怎么得到这个词的索引的吗? @DanielWyatt 如果我没有它们,我该如何修改你的函数来获取单词索引(基本上,我需要填写 word_indices 列)?谢谢!!

以上是关于使用短语中的信息查找句子中单词的索引的主要内容,如果未能解决你的问题,请参考以下文章

在字符串中查找字符/单词的周围句子

如何替换包含完整句子的列的每一行中的多个单词?

MySQL FULLTEXT全文索引

Javascript在字符串中查找单词的索引(不是单词的一部分)

查找字符串中重复单词的索引[重复]

什么是置换索引?