比较两列并过滤列与相邻类
Posted
技术标签:
【中文标题】比较两列并过滤列与相邻类【英文标题】:Comparing two columns and filtering columns with neighboring classes 【发布时间】:2020-06-17 22:08:49 【问题描述】:所以这里的课程来自“八点二十”。这些数字是用字符写的……当分类器预测类时,我得到了一个表,其中预测不等于实际值。现在我想要一张表,其中分类器错过了一个相邻类的类。例如从上表我只想要列
predictions actual
8013 fifteen sixteen
5146 sixteen seventeen
5691 seventeen sixteen
13255 sixteen fifteen
13921 thirteen fourteen
13077 fourteen fifteen
【问题讨论】:
【参考方案1】:您可以使用以下代码将字符串中的数字更改为 int: Is there a way to convert number words to Integers?
或者,如果您的范围有限,可以使用两本词典手动完成 喜欢
prev_dict = 'sixteen':'fifteen', 'seventeen'
next_dict = 'sixteen':'seventeen'
然后:
predict[(predict['prediction'] == predict['actual'].map(prev_dict)) | (predict['prediction'] == predict['actual'].map(next_dict))]
【讨论】:
太棒了!这肯定以一种方式回答了我的问题。但是,如果我将单词转换为整数,我可以得到代码如何实现这一点,我应该从那一点得到结果 另外,当我运行该代码时,我收到错误“系列”对象是可变的,因此它们不能被散列。为什么会这样?提前谢谢您 我做错了,只成功了两个字典,错误是由于字典没有识别键,所以试图添加系列作为键。我编辑了我的答案,但 jezrael 的答案更好【参考方案2】:使用boolean indexing
将两列都转换为数字并过滤添加1
并从actual
列链接的|
中减去1
用于按位OR
,Series.eq
用于检查是否相等的值:
print (df)
predictions actual
8013 fifteen twenty
5146 sixteen seventeen
5691 seventeen sixteen
13255 sixteen fifteen
13921 nineteen fourteen
13077 fourteen fifteen
#https://***.com/a/493788/2901002
def text2int(textnum, numwords=):
if not numwords:
units = [
"zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
"nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
"sixteen", "seventeen", "eighteen", "nineteen",
]
tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]
scales = ["hundred", "thousand", "million", "billion", "trillion"]
numwords["and"] = (1, 0)
for idx, word in enumerate(units): numwords[word] = (1, idx)
for idx, word in enumerate(tens): numwords[word] = (1, idx * 10)
for idx, word in enumerate(scales): numwords[word] = (10 ** (idx * 3 or 2), 0)
current = result = 0
for word in textnum.split():
if word not in numwords:
raise Exception("Illegal word: " + word)
scale, increment = numwords[word]
current = current * scale + increment
if scale > 100:
result += current
current = 0
return result + current
p = df['predictions'].apply(text2int)
a = df['actual'].apply(text2int)
df1 = df[p.eq(a+1) | p.eq(a-1)]
或者:
df1 = df[(p == a+1) | (p == a-1)]
print (df1)
predictions actual
5146 sixteen seventeen
5691 seventeen sixteen
13255 sixteen fifteen
13077 fourteen fifteen
【讨论】:
以上是关于比较两列并过滤列与相邻类的主要内容,如果未能解决你的问题,请参考以下文章
在 Pyspark 中,我如何比较两列并在它们不相同时使用 x