评估列表相似性

Posted 2023-03-11

技术标签:

【中文标题】评估列表相似性【英文标题】：Evaluating list similarities 【发布时间】：2021-12-27 15:22:34 【问题描述】：

我有一个数据框，其中包含各种项目推荐的列，并且元素表示为一个列表（实际上所有列表都有 10 个元素，但这并不重要）：

user_id     actual            predicted            popular             random
u1          [a,b,c]           [a,b,d]              [c,e,f]             [d,e,f]
u2          [a,b,d]           [a,b,c]              [c,e,f]             [a,b,c]
u3          [c,e,f]           [a,c,e]              [c,e,f]             [a,c,f]
u4          [c,e,f]           [a,e,f]              [c,e,f]             [a,d,f]
u5          [b,e,f]           [a,b,e]              [c,e,f]             [a,c,e]

虽然我有一些关于predicted 的单独统计数据，但我想比较actual 列表这些predicted、popular 和随机lists 的接近程度。 popular 总是相同的三个项目。

我正在考虑计算每个案例的百分比，然后取平均值：

 user_id                predicted            popular             random
 u1                     0.66                 0.33                0
 u2                     0.66                 0                   0.33
 u3                     0.66                 1                   0.66
 u4                     0.66                 1                   0.33
 u5                     0.66                 0.33                0.33

通常，我会这样做：

setA = set(listA)
setB = set(listB)

overlap = setA & setB
universe = setA | setB

result = float(len(overlap)) / len(setA) * 100

但是对于大型数据框，我怎样才能做到这一点？

编辑：我尝试了建议的答案并收到以下错误：

KeyError                                  Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2894             try:
-> 2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 0

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
<ipython-input-50-9dd44429ffdc> in <module>
      2 for i in range(df.shape[0]):
      3     for col in ["predicted", "popular", "random"]:
----> 4         df.loc[i, f"col_pct"] = (len(test.loc[i, "actual"] & test.loc[i, col]) / len(test.loc[i, "actual"]) * 100)

~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
    871                     # AttributeError for IntervalTree get_value
    872                     pass
--> 873             return self._getitem_tuple(key)
    874         else:
    875             # we by definition only have the 0th axis

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
   1042     def _getitem_tuple(self, tup: Tuple):
   1043         try:
-> 1044             return self._getitem_lowerdim(tup)
   1045         except IndexingError:
   1046             pass

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
    784                 # We don't need to check for tuples here because those are
    785                 #  caught by the _is_nested_tuple_indexer check above.
--> 786                 section = self._getitem_axis(key, axis=i)
    787 
    788                 # We should never have a scalar section here, because

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
   1108         # fall thru to straight lookup
   1109         self._validate_key(key, axis)
-> 1110         return self._get_label(key, axis=axis)
   1111 
   1112     def _get_slice_axis(self, slice_obj: slice, axis: int):

~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_label(self, label, axis)
   1057     def _get_label(self, label, axis: int):
   1058         # GH#5667 this will fail if the label is not present in the axis.
-> 1059         return self.obj.xs(label, axis=axis)
   1060 
   1061     def _handle_lowerdim_multi_index_axis0(self, tup: Tuple):

~\anaconda3\lib\site-packages\pandas\core\generic.py in xs(self, key, axis, level, drop_level)
   3489             loc, new_index = self.index.get_loc_level(key, drop_level=drop_level)
   3490         else:
-> 3491             loc = self.index.get_loc(key)
   3492 
   3493             if isinstance(loc, np.ndarray):

~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2895                 return self._engine.get_loc(casted_key)
   2896             except KeyError as err:
-> 2897                 raise KeyError(key) from err
   2898 
   2899         if tolerance is not None:

KeyError: 0

【问题讨论】：

【参考方案1】：

所以，给定以下数据框：

import pandas as pd

df = pd.DataFrame(
    
        "user_id": 0: "u1", 1: "u2", 2: "u3", 3: "u4", 4: "u5",
        "actual": 
            0: ["a", "b", "c"],
            1: ["a", "b", "d"],
            2: ["c", "e", "f"],
            3: ["c", "e", "f"],
            4: ["b", "e", "f"],
        ,
        "predicted": 
            0: ["a", "b", "d"],
            1: ["a", "b", "c"],
            2: ["a", "c", "e"],
            3: ["a", "e", "f"],
            4: ["a", "b", "e"],
        ,
        "popular": 
            0: ["c", "e", "f"],
            1: ["c", "e", "f"],
            2: ["c", "e", "f"],
            3: ["c", "e", "f"],
            4: ["c", "e", "f"],
        ,
        "random": 
            0: ["d", "e", "f"],
            1: ["a", "b", "c"],
            2: ["a", "c", "f"],
            3: ["a", "d", "f"],
            4: ["a", "c", "e"],
        ,
    
)

你可以试试这个：

# Convert lists into sets
df = df.applymap(lambda x: set(x) if isinstance(x, list) else x)

# Iterate to create new columns with percentages
for i in range(df.shape[0]):
    for col in ["predicted", "popular", "random"]:
        df.loc[i, f"col_pct"] = (
            len(df.loc[i, "actual"] & df.loc[i, col]) / len(df.loc[i, "actual"]) * 100
        )

# Cleanup
df = df[["user_id", "predicted_pct", "popular_pct", "random_pct"]]

这是预期的结果：

print(df)
# Outputs
  user_id  predicted_pct  popular_pct  random_pct
0      u1      66.666667    33.333333    0.000000
1      u2      66.666667     0.000000   66.666667
2      u3      66.666667   100.000000   66.666667
3      u4      66.666667   100.000000   33.333333
4      u5      66.666667    66.666667   33.333333

【讨论】：

你可以使用 .applymap 代替 for 循环这种情况下不需要 lambda 部分你是对的。与此同时，我有另一个想法来摆脱列名并更新我的答案。谢谢你们的cmets。这对我来说非常有意义，谢谢，但是我在第二个代码块的最后一行出现错误，我将在我的原始帖子中发布它（KeyError：0）。我担心这是因为我的原始数据框中没有索引。在这种情况下，您可以重置数据框的索引：pandas.pydata.org/pandas-docs/stable/reference/api/…

以上是关于评估列表相似性的主要内容，如果未能解决你的问题，请参考以下文章