评估列表相似性
Posted
技术标签:
【中文标题】评估列表相似性【英文标题】:Evaluating list similarities 【发布时间】:2021-12-27 15:22:34 【问题描述】:我有一个数据框,其中包含各种项目推荐的列,并且元素表示为一个列表(实际上所有列表都有 10 个元素,但这并不重要):
user_id actual predicted popular random
u1 [a,b,c] [a,b,d] [c,e,f] [d,e,f]
u2 [a,b,d] [a,b,c] [c,e,f] [a,b,c]
u3 [c,e,f] [a,c,e] [c,e,f] [a,c,f]
u4 [c,e,f] [a,e,f] [c,e,f] [a,d,f]
u5 [b,e,f] [a,b,e] [c,e,f] [a,c,e]
虽然我有一些关于predicted
的单独统计数据,但我想比较actual
列表这些predicted
、popular
和随机lists
的接近程度。 popular
总是相同的三个项目。
我正在考虑计算每个案例的百分比,然后取平均值:
user_id predicted popular random
u1 0.66 0.33 0
u2 0.66 0 0.33
u3 0.66 1 0.66
u4 0.66 1 0.33
u5 0.66 0.33 0.33
通常,我会这样做:
setA = set(listA)
setB = set(listB)
overlap = setA & setB
universe = setA | setB
result = float(len(overlap)) / len(setA) * 100
但是对于大型数据框,我怎样才能做到这一点?
编辑:我尝试了建议的答案并收到以下错误:
KeyError Traceback (most recent call last)
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2894 try:
-> 2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 0
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
<ipython-input-50-9dd44429ffdc> in <module>
2 for i in range(df.shape[0]):
3 for col in ["predicted", "popular", "random"]:
----> 4 df.loc[i, f"col_pct"] = (len(test.loc[i, "actual"] & test.loc[i, col]) / len(test.loc[i, "actual"]) * 100)
~\anaconda3\lib\site-packages\pandas\core\indexing.py in __getitem__(self, key)
871 # AttributeError for IntervalTree get_value
872 pass
--> 873 return self._getitem_tuple(key)
874 else:
875 # we by definition only have the 0th axis
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_tuple(self, tup)
1042 def _getitem_tuple(self, tup: Tuple):
1043 try:
-> 1044 return self._getitem_lowerdim(tup)
1045 except IndexingError:
1046 pass
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_lowerdim(self, tup)
784 # We don't need to check for tuples here because those are
785 # caught by the _is_nested_tuple_indexer check above.
--> 786 section = self._getitem_axis(key, axis=i)
787
788 # We should never have a scalar section here, because
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _getitem_axis(self, key, axis)
1108 # fall thru to straight lookup
1109 self._validate_key(key, axis)
-> 1110 return self._get_label(key, axis=axis)
1111
1112 def _get_slice_axis(self, slice_obj: slice, axis: int):
~\anaconda3\lib\site-packages\pandas\core\indexing.py in _get_label(self, label, axis)
1057 def _get_label(self, label, axis: int):
1058 # GH#5667 this will fail if the label is not present in the axis.
-> 1059 return self.obj.xs(label, axis=axis)
1060
1061 def _handle_lowerdim_multi_index_axis0(self, tup: Tuple):
~\anaconda3\lib\site-packages\pandas\core\generic.py in xs(self, key, axis, level, drop_level)
3489 loc, new_index = self.index.get_loc_level(key, drop_level=drop_level)
3490 else:
-> 3491 loc = self.index.get_loc(key)
3492
3493 if isinstance(loc, np.ndarray):
~\anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
2895 return self._engine.get_loc(casted_key)
2896 except KeyError as err:
-> 2897 raise KeyError(key) from err
2898
2899 if tolerance is not None:
KeyError: 0
【问题讨论】:
【参考方案1】:所以,给定以下数据框:
import pandas as pd
df = pd.DataFrame(
"user_id": 0: "u1", 1: "u2", 2: "u3", 3: "u4", 4: "u5",
"actual":
0: ["a", "b", "c"],
1: ["a", "b", "d"],
2: ["c", "e", "f"],
3: ["c", "e", "f"],
4: ["b", "e", "f"],
,
"predicted":
0: ["a", "b", "d"],
1: ["a", "b", "c"],
2: ["a", "c", "e"],
3: ["a", "e", "f"],
4: ["a", "b", "e"],
,
"popular":
0: ["c", "e", "f"],
1: ["c", "e", "f"],
2: ["c", "e", "f"],
3: ["c", "e", "f"],
4: ["c", "e", "f"],
,
"random":
0: ["d", "e", "f"],
1: ["a", "b", "c"],
2: ["a", "c", "f"],
3: ["a", "d", "f"],
4: ["a", "c", "e"],
,
)
你可以试试这个:
# Convert lists into sets
df = df.applymap(lambda x: set(x) if isinstance(x, list) else x)
# Iterate to create new columns with percentages
for i in range(df.shape[0]):
for col in ["predicted", "popular", "random"]:
df.loc[i, f"col_pct"] = (
len(df.loc[i, "actual"] & df.loc[i, col]) / len(df.loc[i, "actual"]) * 100
)
# Cleanup
df = df[["user_id", "predicted_pct", "popular_pct", "random_pct"]]
这是预期的结果:
print(df)
# Outputs
user_id predicted_pct popular_pct random_pct
0 u1 66.666667 33.333333 0.000000
1 u2 66.666667 0.000000 66.666667
2 u3 66.666667 100.000000 66.666667
3 u4 66.666667 100.000000 33.333333
4 u5 66.666667 66.666667 33.333333
【讨论】:
你可以使用 .applymap 代替 for 循环 这种情况下不需要 lambda 部分 你是对的。与此同时,我有另一个想法来摆脱列名并更新我的答案。谢谢你们的cmets。 这对我来说非常有意义,谢谢,但是我在第二个代码块的最后一行出现错误,我将在我的原始帖子中发布它(KeyError:0)。我担心这是因为我的原始数据框中没有索引。 在这种情况下,您可以重置数据框的索引:pandas.pydata.org/pandas-docs/stable/reference/api/…以上是关于评估列表相似性的主要内容,如果未能解决你的问题,请参考以下文章