根据 pandas 中的先前值标记字符串

Posted 2023-03-31

技术标签:

【中文标题】根据 pandas 中的先前值标记字符串【英文标题】：flag strings based on previous values in pandas 【发布时间】：2021-08-23 02:31:38 【问题描述】：

我想标记位于 pandas 数据框中的句子。正如您在示例中看到的那样，一些句子被分成多行（这些是来自 srt 文件的字幕，我最终希望将其翻译成不同的语言，但首先我需要将它们放在一个单元格中）。句末由句末句号决定。我想创建一个类似于列句的列，我为每个句子编号（不一定是字符串，也可以是数字）

values=[
        ['This is an example of subtitle.','sentence_1'],
        ['I want to group by sentences, which','sentence_2'],
        ['the end is determined by a period.','sentence_2'],
        ['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
        ['should have sentence_2.','sentence_2'],
        ['and this last row should have sentence_3.','sentence_3']
        ]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.contains('\.')
df

output:

    subtitle                                         sentence_number    presence_of_period
0   This is an example of subtitle.                  sentence_1         True
1   I want to group by sentences, which              sentence_2         False
2   the end is determined by a period.               sentence_2         True
3   row 0 should have sentece_1, rows 1 and 2        sentence_3         False
4   should have sentence_2. and this                 sentence_3         True
5   last row should have sentence_3.                 sentence_4         True

我如何创建 sentence_number 列，因为它必须读取 subtitle 列上的先前单元格？我在考虑一个窗口函数或 shift() 但不知道如何使它工作。我添加了一列来显示单元格是否有句点，表示句子的结尾。另外，如果可能的话，我想将“and this”从第 4 行移到第 5 行的开头，因为它是一个新句子（不确定这是否需要不同的问题）。

有什么想法吗？

【问题讨论】：

【参考方案1】：

要修正句号，这里有一个供您选择的选项。

import pandas as pd
values=[
        ['This is an example of subtitle.','sentence_1'],
        ['I want to group by sentences, which','sentence_2'],
        ['the end is determined by a period.','sentence_2'],
        ['row 0 should have sentece_1, rows 1 and 2 ','sentence_3'],
        ['should have sentence_2.','sentence_2'],
        ['and this last row should have sentence_3.','sentence_3']
        ]
df=pd.DataFrame(values,columns=['subtitle','sentence_number'])
df['presence_of_period']=df.subtitle.str.count('\.')
df['end'] = df.subtitle.str.endswith('.').astype(int)
df['sentence_#'] = 'sentence_' + (1 + df['presence_of_period'].cumsum() - df['end']).astype(str)
#print (df['subtitle'])
#print (df[['sentence_number','presence_of_period','end','sentence_#']])
df.drop(['presence_of_period','end'],axis=1, inplace=True)
print (df[['subtitle','sentence_#']])

输出如下：

                                     subtitle  sentence_#
0             This is an example of subtitle.  sentence_1
1         I want to group by sentences, which  sentence_2
2          the end is determined by a period.  sentence_2
3  row 0 should have sentece_1, rows 1 and 2   sentence_3
4                     should have sentence_2.  sentence_3
5   and this last row should have sentence_3.  sentence_4

如果你需要将部分句子移到下一行，我需要了解更多细节。

如果连续有两个以上的句子，你想做什么。例如，'This is first sentence. This second. This is'。

在这种情况下你想做什么。将第一个拆分为一行，第二个拆分为另一行，并将第三个连接到下一行数据？

一旦我明白这一点，我们可以使用df.explode() 来解决它。

【讨论】：

哇，这太棒了。谢谢你。我没有想到有 2 个点的场景。我想在这种情况下，我希望每个句子都在自己的行中：“这是第一句话”“第二个”“第三个”。我想我也应该小心使用诸如“12.5%”之类的词的情况，句号可能会触发句子的结尾而不是。 @chulo，我接下来要问同样的事情。您将如何区分句点和浮点数或www.google.com 之类的东西。这将有多个时期。它应该是一个词的结尾吗？如果是这样，那么我们可能不得不使用正则表达式\.\b 在我们最终确定解决方案之前需要弄清楚很多事情。这就是为什么我没有锁定解决方案。很多要被冲走我认为，如果句号之后有一个空格，我会觉得很舒服，那么它应该被视为句子的结尾，否则它应该被忽略，因为它是字符串的一部分。我知道有时您会遇到“这是句子 1。这是句子 2”的场景，但我认为这在我的数据集中很少见。

以上是关于根据 pandas 中的先前值标记字符串的主要内容，如果未能解决你的问题，请参考以下文章