对数据帧字典执行操作的优化方式 w.r.t.聚合数据框

Posted 2023-03-12

技术标签:

【中文标题】对数据帧字典执行操作的优化方式 w.r.t.聚合数据框【英文标题】：Optimized way of performing operations on dictionary of dataframes w.r.t. an aggregated dataframe 【发布时间】：2020-10-01 20:41:58 【问题描述】：

我正在尝试通过aggregated_data 计算所有数据帧（以字典形式出现）中某些列的比率。

这里data 是一个字典，它包含级别名称作为键，其数据（作为数据框）作为值。

例如：

1)这就是数据的样子（只是一个例子）

data='State':State_data,'District':District_data
>>> State_data
         Time  level value  97E03K  90KFTO  FXRDW9  1I4OX9  N6HO97
0  2017-04-01  State    NY      15       7       8      19      17
1  2017-05-01  State    NY      11       8       9      16      11
2  2017-06-01  State    NY      17      16       6      12      17
3  2017-04-01  State   WDC       6      17      19       8      20
4  2017-05-01  State   WDC      19       9      20      11      17
5  2017-06-01  State   WDC      10      11       6      20      11
>>> District_data
         Time     level     value  97E03K  90KFTO  FXRDW9  1I4OX9  N6HO97
0  2017-04-01  District  Downtown       2       1       5       3       5
1  2017-05-01  District  Downtown       4       3       2       4       3
2  2017-06-01  District  Downtown       4       3       4       1       3
3  2017-04-01  District   Central       3       4       3       5       5
4  2017-05-01  District   Central       4       3       5       4       3
5  2017-06-01  District   Central       4       3       5       5       3

2)这是聚合数据的样子：

         Time       level       value  97E03K  90KFTO  FXRDW9  1I4OX9  N6HO97
0  2017-04-01  Aggregated  Aggregated      27      21      23      30      21
1  2017-05-01  Aggregated  Aggregated      27      29      26      22      30
2  2017-06-01  Aggregated  Aggregated      27      30      30      25      25
3  2017-04-01  Aggregated  Aggregated      22      27      30      22      25
4  2017-05-01  Aggregated  Aggregated      22      21      24      22      29
5  2017-06-01  Aggregated  Aggregated      25      27      23      22      24

我必须对每个级别进行迭代，并根据此字典找到每个级别与相应级别的聚合的比率：

columns_to_work = '97E03K': '97E03K', '90KFTO': '97E03K', 'FXRDW9': '97E03K', '1I4OX9': '1I4OX9', 'N6HO97': '97E03K'

在这里，对于每个键，我会找到它的值 wrt 与同一值在同一日期的聚合级别的比率，并将列名替换为 key+'_rank'。

例如。对于键 90KFTO，当前级别的值 97E03K 必须除以同一时间点的聚合的 97E03K 列。并且这个比率以密钥的名称存储为90KFTO_rank。

同样，我正在为每个级别查找并将每个级别附加到一个列表中，我最终将其连接以获得一个包含所有输入级别的'_rank' 列的平面数据框

4) 最终输出数据看起来像这样（聚合数据的比率）：

        Time     level     value  97E03K_rank  90KFTO_rank  FXRDW9_rank  1I4OX9_rank  N6HO97_rank
0   2017-04-01     State        NY     0.555556     0.555556     0.555556     0.633333     0.555556
1   2017-05-01     State        NY     0.407407     0.407407     0.407407     0.727273     0.407407
2   2017-06-01     State        NY     0.629630     0.629630     0.629630     0.480000     0.629630
3   2017-04-01     State       WDC     0.272727     0.272727     0.272727     0.363636     0.272727
4   2017-05-01     State       WDC     0.863636     0.863636     0.863636     0.500000     0.863636
5   2017-06-01     State       WDC     0.400000     0.400000     0.400000     0.909091     0.400000
6   2017-04-01  District  Downtown     0.074074     0.074074     0.074074     0.100000     0.074074
7   2017-05-01  District  Downtown     0.148148     0.148148     0.148148     0.181818     0.148148
8   2017-06-01  District  Downtown     0.148148     0.148148     0.148148     0.040000     0.148148
9   2017-04-01  District   Central     0.136364     0.136364     0.136364     0.227273     0.136364
10  2017-05-01  District   Central     0.181818     0.181818     0.181818     0.181818     0.181818
11  2017-06-01  District   Central     0.160000     0.160000     0.160000     0.227273     0.160000

现在这是需要优化的方法：

samp_data=list()
level=

for l,da in data.items(): #Here l is the key and da is the dataframe
    level[l] = da.copy()
    lev[l] = pd.DataFrame() #Just a copy to work with
    lev[l] = pd.concat([lev[l],level[l][[tim,'level','value']]],sort=False)
    
    for c,d in columns_to_work.items():
            
        level[l] = level[l].join(aggregated_data[[d]], on = tim, rsuffix = '_rank1')
        level[l].rename(columns = d+'_rank1':c+'_rank', inplace=True)

        level[l][c+'_rank'] = level[l][d]/level[l][c+'_rank'] 
        lev[l] = pd.concat([lev[l],level[l][c+'_rank']],axis=1,sort=False)
        
    samp_data.append(lev[l])

逻辑不清楚的代码说明：

在第一次迭代中，我对字典中存在的所有级别进行迭代，在第二次迭代中，我对列名进行迭代。但是在这里，`columns_to_work 是一个字典，键和值都是我数据框中的列。

我必须计算d 列与我当前级别的聚合数据的比率，并将列名重命名为c+"_rank"。

虽然上述代码适用于小型数据集，但在尝试扩展更大的数据集时却失败了。我正在寻找实现相同目标的优化方法。任何意见/建议将不胜感激:)

附：我尝试使用aggregated_data 作为列表字典来提高性能。但问题是aggregated_data 文件中存在的某些时间点可能不在level 数据中。因此，订单映射变得混乱。

【问题讨论】：

您能添加示例数据吗？（输入和期望的输出）请检查更新后的问题。感谢您更新问题。我不明白用于创建aggregated_data 的逻辑。你能添加那个代码吗？当然。 aggregated_data 已经可用..我没有创建它。 tim == 'Time'? 【参考方案1】：

这应该可行：

第 1 步：连接州和地区数据

df = pd.concat([State_data, District_data])

第 2 步：将州和地区数据连接到聚合数据（使用索引，因为同一个 `Time` 有多个不同的行）

df = pd.merge(
    left=df,
    left_index=True, 
    right=aggregated_data.drop(columns=['level', 'value', 'Time']), 
    right_index=True,
    suffixes=['', '_agg']
)

第 3 步：遍历 `columns_to_work`

for k, v in columns_to_work.items():
    df[f'k_rank'] = df[v]/df[f'v_agg']

第 4 步：对`df` 进行排序并删除不必要的列

df = df[['Time', 'level', 'value', '97E03K_rank', '90KFTO_rank', 'FXRDW9_rank', '1I4OX9_rank', 'N6HO97_rank']].sort_values('level', ascending=False)

最终结果：

       Time     level     value  97E03K_rank  90KFTO_rank  FXRDW9_rank  1I4OX9_rank  N6HO97_rank
 2017-04-01     State        NY        0.556        0.556        0.556        0.633        0.556
 2017-05-01     State        NY        0.407        0.407        0.407        0.727        0.407
 2017-06-01     State        NY        0.630        0.630        0.630        0.480        0.630
 2017-04-01     State       WDC        0.273        0.273        0.273        0.364        0.273
 2017-05-01     State       WDC        0.864        0.864        0.864        0.500        0.864
 2017-06-01     State       WDC        0.400        0.400        0.400        0.909        0.400
 2017-04-01  District  Downtown        0.074        0.074        0.074        0.100        0.074
 2017-05-01  District  Downtown        0.148        0.148        0.148        0.182        0.148
 2017-06-01  District  Downtown        0.148        0.148        0.148        0.040        0.148
 2017-04-01  District   Central        0.136        0.136        0.136        0.227        0.136
 2017-05-01  District   Central        0.182        0.182        0.182        0.182        0.182
 2017-06-01  District   Central        0.160        0.160        0.160        0.227        0.160

【讨论】：

我忘了更新问题。我实际上是在我的解决方案中执行步骤 1 和 2。您的第 3 步在很大程度上提高了我的代码性能。非常感谢您的回复哥们！！

以上是关于对数据帧字典执行操作的优化方式 w.r.t.聚合数据框的主要内容，如果未能解决你的问题，请参考以下文章

对数据帧字典执行操作的优化方式 w.r.t.聚合数据框

第 1 步：连接州和地区数据

第 2 步：将州和地区数据连接到聚合数据（使用索引，因为同一个 Time 有多个不同的行）

第 3 步：遍历 columns_to_work

第 4 步：对df 进行排序并删除不必要的列

最终结果：

第 2 步：将州和地区数据连接到聚合数据（使用索引，因为同一个 `Time` 有多个不同的行）

第 3 步：遍历 `columns_to_work`

第 4 步：对`df` 进行排序并删除不必要的列