Pandas：如何测试 top-n-dataframe 是不是真的来自原始数据框

Posted 2023-03-31

技术标签:

【中文标题】Pandas：如何测试 top-n-dataframe 是不是真的来自原始数据框【英文标题】：Pandas: how to test that top-n-dataframe really results from original dataframePandas：如何测试 top-n-dataframe 是否真的来自原始数据框 【发布时间】：2016-02-22 19:15:28 【问题描述】：

我有一个 DataFrame，foo：

       A   B   C   D   E
    0  50  46  18  65  55
    1  48  56  98  71  96
    2  99  48  36  79  70
    3  15  24  25  67  34
    4  77  67  98  22  78

还有另一个 Dataframe，bar，它包含 foo 的每一行的最大 2 个值。所有其他值都已替换为零，以创建稀疏性：

        A  B   C   D   E
    0   0  0   0  65  55
    1   0  0  98   0  96
    2  99  0   0  79   0
    3   0  0   0  67  34
    4   0  0  98   0  78

如何测试 bar 中的每一行是否真的包含所需的值？

还有一件事：该解决方案应该适用于大型 DateFrame，即 20000 X 20000。

【问题讨论】：

【参考方案1】：

显然，您可以通过循环和高效排序来做到这一点，但也许更好的方法是：

n = foo.shape[0]

#Test1:
#bar dataframe has original data except zeros for two values:
diff = foo - bar
test1 = ((diff==0).sum(axis=1) == 2) == n

#Test2:
#bar dataframe has 3 zeros on each line
test2 = ((bar==0).sum(axis=1) == 3) == n

#Test3:
#these 2 numbers that bar has are the max
bar2=bar.replace(0:pandas.np.nan(), inplace=True
#the max of remaining values is smaller than the min of bar:
row_ok = (diff.max(axis=1) < bar.min(axis=1))
test3 = (ok.sum() == n)

我认为这涵盖了所有情况，但尚未全部测试...

【讨论】：

以上是关于Pandas：如何测试 top-n-dataframe 是不是真的来自原始数据框的主要内容，如果未能解决你的问题，请参考以下文章

Python: Pandas运算的效率探讨以及如何选择高效的运算方式

如何测试字符串包含列表中的元素并通过 Pandas 将目标元素分配给另一列

如何按百分比将 CSV 数据集拆分为训练集和测试集，并将拆分后的数据集与 pandas 一起保存到本地文件夹中？ [复制]

根据观察名称将数据拆分为训练和使用 pandas 进行测试

pandas如何统计excel中列数据的行数？

如何在支持多种数据格式的 Pandas 中合并日期？