在（一个非常大的）熊猫数据框中定位值并存储到字典中

Posted 2023-04-18

技术标签:

【中文标题】在（一个非常大的）熊猫数据框中定位值并存储到字典中【英文标题】：locating values in (a very large) pandas dataframe and store to dictionaries 【发布时间】：2018-09-18 18:49:44 【问题描述】：

我有一个非常大的熊猫数据框。数据框如下所示：

>> df
    "a_1"   "a_2"   "b_1"  "c_2"  ...
"d_1" nan   0.2   nan  nan
"d_2" 0.1   nan   nan   1
"e_1" nan   1     nan  0.2
"e_2" nan   0.05  0.1  0.7
"f_2" 0.2   0.5   0.3  0.9
...

现在我正在尝试查看包含一些行和列名称的元组列表：

t = [("d", "a"), ("d", "c") ...]

比如i = ("d", "a")，我想找出a_1 and a_2、d_1 and d_2对应的值，我用下面的代码来定位这些值：

s = df.loc[["d_1", "d_2" ], ["a_1", "a_2"]]

# print(s)
#       "a_1"  "a_2"
# "d_1"  nan    0.2
# "d_2   0.1    nan

# convert to list and sort the values
s = s.unstack().reset_index()
s.columns = ["A","B", "Score"]
scores = s.sort_values(by="Score", ascending=False).reset_index(drop=True)

# pick the index(rank) I want and save the not-nan data to dictionary 
rank = 1
try:
    s = scores.loc[rank,:]
except Exception:
    continue

if str(s.Score) != "nan":
    d[(s.A, s.B)] = s.Score # output dictionary

现在上面的代码可以工作了，但是时间太长了，给定len(t) = 28350，我需要测试 150 多组参数。一次迭代（一组参数）在集群上需要 3.5 分钟。

我想知道这个问题是否有更好的解决方案。提前致谢！

【问题讨论】：

【参考方案1】：

类似的东西呢：

d = 
for row, col in t:
    vals = df.loc[df.index.str.startswith(row),
                  df.columns.str.startswith(col)].stack().dropna()
    if len(vals):
        d[vals.idxmax()] = vals.max()

【讨论】：

以上是关于在（一个非常大的）熊猫数据框中定位值并存储到字典中的主要内容，如果未能解决你的问题，请参考以下文章