比较两列并过滤列与相邻类

Posted

技术标签:

【中文标题】比较两列并过滤列与相邻类【英文标题】:Comparing two columns and filtering columns with neighboring classes 【发布时间】:2020-06-17 22:08:49 【问题描述】:

所以这里的课程来自“八点二十”。这些数字是用字符写的……当分类器预测类时,我得到了一个表,其中预测不等于实际值。现在我想要一张表,其中分类器错过了一个相邻类的类。例如从上表我只想要列

    predictions   actual
8013  fifteen     sixteen
5146  sixteen     seventeen
5691  seventeen   sixteen
13255 sixteen     fifteen
13921 thirteen    fourteen
13077 fourteen    fifteen

【问题讨论】:

【参考方案1】:

您可以使用以下代码将字符串中的数字更改为 int: Is there a way to convert number words to Integers?

或者,如果您的范围有限,可以使用两本词典手动完成 喜欢

prev_dict = 'sixteen':'fifteen', 'seventeen'
next_dict = 'sixteen':'seventeen'

然后:

predict[(predict['prediction'] == predict['actual'].map(prev_dict)) | (predict['prediction'] == predict['actual'].map(next_dict))]

【讨论】:

太棒了!这肯定以一种方式回答了我的问题。但是,如果我将单词转换为整数,我可以得到代码如何实现这一点,我应该从那一点得到结果 另外,当我运行该代码时,我收到错误“系列”对象是可变的,因此它们不能被散列。为什么会这样?提前谢谢您 我做错了,只成功了两个字典,错误是由于字典没有识别键,所以试图添加系列作为键。我编辑了我的答案,但 jezrael 的答案更好【参考方案2】:

使用boolean indexing 将两列都转换为数字并过滤添加1 并从actual 列链接的| 中减去1 用于按位ORSeries.eq 用于检查是否相等的值:

print (df)
      predictions     actual
8013      fifteen     twenty
5146      sixteen  seventeen
5691    seventeen    sixteen
13255     sixteen    fifteen
13921    nineteen   fourteen
13077    fourteen    fifteen

#https://***.com/a/493788/2901002
def text2int(textnum, numwords=):
    if not numwords:
      units = [
        "zero", "one", "two", "three", "four", "five", "six", "seven", "eight",
        "nine", "ten", "eleven", "twelve", "thirteen", "fourteen", "fifteen",
        "sixteen", "seventeen", "eighteen", "nineteen",
      ]

      tens = ["", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"]

      scales = ["hundred", "thousand", "million", "billion", "trillion"]

      numwords["and"] = (1, 0)
      for idx, word in enumerate(units):    numwords[word] = (1, idx)
      for idx, word in enumerate(tens):     numwords[word] = (1, idx * 10)
      for idx, word in enumerate(scales):   numwords[word] = (10 ** (idx * 3 or 2), 0)

    current = result = 0
    for word in textnum.split():
        if word not in numwords:
          raise Exception("Illegal word: " + word)

        scale, increment = numwords[word]
        current = current * scale + increment
        if scale > 100:
            result += current
            current = 0

    return result + current

p = df['predictions'].apply(text2int) 
a = df['actual'].apply(text2int) 

df1 = df[p.eq(a+1) | p.eq(a-1)]

或者:

df1 = df[(p == a+1) | (p == a-1)]

print (df1)
      predictions     actual
5146      sixteen  seventeen
5691    seventeen    sixteen
13255     sixteen    fifteen
13077    fourteen    fifteen

【讨论】:

以上是关于比较两列并过滤列与相邻类的主要内容,如果未能解决你的问题,请参考以下文章

Excel VBA代码过滤两列并提取数据

比较两个excel的两列并返回第三列

在 Pyspark 中,我如何比较两列并在它们不相同时使用 x

SQL Server:按两列分组,并将第三列与两组的分叉相加

在 Pandas 中将两列与 NaN 进行比较并获得差异

Excel宏比较两个工作表中的两列并插入行值