数据集列中的字符串替换不起作用

Posted

技术标签:

【中文标题】数据集列中的字符串替换不起作用【英文标题】:String replacement in a column of dataset not working 【发布时间】:2022-01-11 08:55:51 【问题描述】:

考虑如下给出的数据数组:

print((test_small_testval.features))
'premise': Value(dtype='string', id=None), 
 'hypothesis': Value(dtype='string', id=None), 
 'label': ClassLabel(num_classes=3, 
                    names=['entailment', 'neutral', 'contradiction'], 
                    names_file=None, id=None)
        
print(test_small_testval['hypothesis'][0:10])
        
['The owner threw the toy', 
 'The dog walked across the fallen log.', 
 'Woman eating pizza', 'The stove has nothing on it.', 
 'A girl is jumping off a bridge down into a river in a bungie cord.', 
 'The people are looking at a poster of Ronaldo', 
 'A man runs through a fountain.', 
 'The man is trying to get food for his family, as they are too poor to eat food from the supermarket.', 
 'The woman is asleep.', 'A room full of people is doing poses.']

当我在数据集的假设列中使用以下内容进行字符串替换时,没有任何反应。不知道为什么。

for i in range(len(test_small_testval)):
print(test_small_testval['hypothesis'][i])
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('she','them')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('he','them')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('her','them')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('him','them')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('cat','animal')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('dog','animal')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('woman','them')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('girl','them')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('guitar','instrument')
test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('field','outdoors')
print('>>>>after>>>')
print(test_small_testval['hypothesis'][i])

数据完全没有变化。谁能详细说明原因?

我所看到的:

The owner threw the toy
>>>>after>>>
The owner threw the toy
The dog walked across the fallen log.
>>>>after>>>
The dog walked across the fallen log.
Woman eating pizza
>>>>after>>>
Woman eating pizza
The stove has nothing on it.
>>>>after>>>
The stove has nothing on it.
A girl is jumping off a bridge down into a river in a bungie cord.
>>>>after>>>
A girl is jumping off a bridge down into a river in a bungie cord.

更新: 我可以通过完全保存到新列表来进行替换,但是它也在替换子字符串。有没有一种快速的方法来只替换整个单词而不是子字符串外观?

正则表达式方法:

import re

for i in range(len(test_small_testval)):
    #print(i)
    test_small_testval['hypothesis'][i] = re.sub(r'\bshe\b', r'them', test_small_testval['hypothesis'][i])
    test_small_testval['hypothesis'][i] = re.sub(r'\bhe\b', r'them', test_small_testval['hypothesis'][i])
    test_small_testval['hypothesis'][i] = re.sub(r'\bher\b', r'them', test_small_testval['hypothesis'][i])
    test_small_testval['hypothesis'][i] = re.sub(r'\bhim\b', r'them', test_small_testval['hypothesis'][i])
    print(test_small_testval['hypothesis'][i])

输出没有变化

enter image description here

【问题讨论】:

type(test_small_testval['hypothesis']) 显示什么? 用 r'\bshe\b' 替换 r'\she\b' 等对于其他人来说,你没有正确输入 @JonClements。 【参考方案1】:

您的旧字符串只会暂时替换为您输入的新字符串。你需要存储它。这将完成工作-

for i in range(len(test_small_testval)):
    print(i)
    test_small_testval['hypothesis'][i] = test_small_testval['hypothesis'][i].replace('she','them'
                                                                                    ).replace('he','them'
                                                                                    ).replace('her','them'
                                                                                    ).replace('him','them')

更新 1:我测试了您的示例(使用 stringslist)并且它有效。也许问题在于您如何访问您的阵列。我不与 numpy arrays 合作,所以你必须自己看看。

test_small_testval = ['The owner threw the toy', 
                        'The dog walked across the fallen log.', 
                        'Woman eating pizza', 'The stove has nothing on it.', 
                        'A girl is jumping off a bridge down into a river in a bungie cord.', 
                        'The people are looking at a poster of Ronaldo', 
                        'A man runs through a fountain.', 
                        'The man is trying to get food for his family, as they are too poor to eat food from the supermarket.', 
                        'The woman is asleep.', 'A room full of people is doing poses.']

for i in range(len(test_small_testval)):
    print(i)
    test_small_testval[i] = test_small_testval[i].replace('she','them'
                                                            ).replace('he','them'
                                                            ).replace('her','them'
                                                            ).replace('him','them')

print(test_small_testval)

解决此问题后您将面临的另一个问题是替换的正确论据。

例如,如果我只是在“主人扔玩具”中将“他”替换为“他们”, 然后你会得到“Tthem owner throw tthem toy”,我相信 不是你想要的。

一个快速修复将是-

考虑到您只是替换 (she, he, 她,他和他们),一个技巧可以是用“她”替换“她”和 同样地 为他们所有人。这应该可以处理几乎所有的情况。

更新 2: 试试这个正则表达式解决方案,看看它是否有效(\b 表示单词边界)-

import re

for i in range(len(test_small_testval)):
    print(i)
    test_small_testval[i] = re.sub(r'\bshe\b', r'them', test_small_testval[i])
    test_small_testval[i] = re.sub(r'\bhe\b', r'them', test_small_testval[i])
    test_small_testval[i] = re.sub(r'\bher\b', r'them', test_small_testval[i])
    test_small_testval[i] = re.sub(r'\bhim\b', r'them', test_small_testval[i])

print(test_small_testval)

【讨论】:

我这样做了,但仍然没有改变。 用我看到的输出查看我更新的问题 我使用了一个新的列表名称,我看到了更改,但它甚至替换了子字符串。我们如何只替换整个单词? 我在答案中添加了一个快速修复,让我知道它是否有效。 除非我为 LHS 列表选择一个新名称,否则它不起作用。此外,我认为只有正则表达式可能能够以我想要的方式更正替换

以上是关于数据集列中的字符串替换不起作用的主要内容,如果未能解决你的问题,请参考以下文章

根据字典替换数据框列中的值不起作用[重复]

Pyspark根据另一列的模式替换列中的字符串

需要根据 1 列的值设置数据集列中的值

Pyspark 通过使用另一列中的值替换 Spark 数据框列中的字符串

如何用同一数据框中其他列的实际列值替换一列中的字符串值?

使用R函数将数据框列中的字符串替换为“”