Python：使用上限重新分配权重

Posted 2023-02-16

技术标签:

【中文标题】Python：使用上限重新分配权重【英文标题】：Python: reallocate weights with a cap 【发布时间】：2022-01-11 21:11:41 【问题描述】：

如何重新分配标准化数据帧的权重并设置上限。

例如，如果我有以下一行数据：

0.1 0.3 0.5 0.1

我不希望任何大于 0.4 的权重。如何裁剪 0.5 权重并重新分配权重，以便最大化每个条目。所以我会得到：

0.1 0.4 0.4 0.1

因此，将 0.5 裁剪为 0.4，将剩余的 0.1 添加到 0.3 得到 0.4。请注意，在这两种情况下，条目总和为 1（标准化）。

这可以通过python来完成吗？即没有循环。

理想情况下，我希望将其应用于这样的数据框：

df = pd.DataFrame('a': [5003, 54.06, 53.654, 55.2], 'b': [np.nan, 54.1121, 53.98, 55.12], 'c': [np.nan, 2, 53.322, 54.99],
               'd': [np.nan, 53.1, 53.212, 55.002], 'e': [np.nan, 53, 53.2, 55.021], 'f': [np.nan, 53.11, 53.120, 55.3])
N = 5 # 1/np.sqrt(N) = 0.447214
df = df.div(df.sum(axis=1), axis=0)
df:
        a           b            c          d           e           f
    0   1.000000    NaN          NaN        NaN         NaN         NaN
    1   0.200681    0.200875    0.007424    0.197118    0.196747    0.197155
    2   0.167413    0.168431    0.166378    0.166034    0.165997    0.165747
    3   0.166952    0.166711    0.166317    0.166354    0.166411    0.167255

谢谢。

【问题讨论】：

我不知道如何回答这个问题，但我想更好地理解问题参数。在您的示例中，为什么将 0.1 分配给第二个条目？在更大数组的更一般情况下应该如何决定？它应该按降序分配，以便您最大化每个条目。另一个例子是，如果我们有这一行： 0.01 0.5 0.45 0.04 那么 0.5 将被剪裁为 0.4，0.45 将被剪裁为 0.4，剩下的 0.15 将分配如下： 0.15 到 0.04（因为它是下一个最大的数字），我们会得到：0.01 0.4 0.4 0.19 【参考方案1】：

当我使用它时，它会起作用，但如果你发现它坏了，我肯定很想知道。总体思路是将其融合为长格式数据帧，以允许 groupby 操作避免显式循环

import pandas as pd
import numpy as np

#Df from your example
df = pd.DataFrame('a': [5003, 54.06, 53.654, 55.2], 'b': [np.nan, 54.1121, 53.98, 55.12], 'c': [np.nan, 2, 53.322, 54.99],
               'd': [np.nan, 53.1, 53.212, 55.002], 'e': [np.nan, 53, 53.2, 55.021], 'f': [np.nan, 53.11, 53.120, 55.3])

df = df.div(df.sum(axis=1), axis=0).fillna(0) #assume the nulls should be zeros so we can add to them

nrows,ncols = df.shape
min_cap = 1/ncols #note that the cap has to be at the very least larger than this value for rows to sum to 1

cap = 0.2 #just using 0.2 as an example

#convert to long form to allow for groupbys
long_df = df.reset_index().melt(id_vars=['index']).set_index(['index','variable'])['value']

#calculate excess per row and cap the overfilled entries
excess = long_df[long_df.ge(cap)].sub(cap).groupby('index').sum()
long_df[long_df.ge(cap)] = cap

#fill underfilled entries than can be completely filled
fill_space = cap-long_df
cumsum_fill = fill_space.sort_values().groupby('index').cumsum()
full_fill = excess.ge(cumsum_fill)
long_df[full_fill] = cap

#add remaining fill to largest elements of each row
final_excess = excess-cumsum_fill[full_fill].groupby('index').max()
ind_last_excess = long_df[long_df.lt(cap)].groupby('index').idxmax()
long_df[ind_last_excess] += final_excess

#pivot back to the same df shape as original
res_df = long_df.reset_index().pivot_table(values='value',index='index',columns='variable').fillna(0)
print(res_df)

输出：

variable         a         b         c         d         e         f
index                                                               
0         0.200000  0.200000  0.000000  0.200000  0.200000  0.200000
1         0.200000  0.200000  0.007424  0.197118  0.196747  0.198711
2         0.167413  0.000000  0.166378  0.166034  0.165997  0.165747
3         0.166952  0.166711  0.166317  0.166354  0.166411  0.000000

【讨论】：

如何为数据帧而不是数组做到这一点？我试图编辑现有代码但没有用。谢谢！您可以编辑您的问题并发布您的 df 负责人的样子吗？我提供了一个新答案效果很好。但是在第一行中，为什么 c 列的条目为零？由于 b、c、d、e、f 的条目是相同的（NaN），我希望权重在所有这些条目中平均分配。是否有对此的解释以及任何方法可以使如果条目相等，则裁剪后的权重将在它们之间平均分配？很抱歉，我不知道如何为关系添加新逻辑。我认为该方法将包括一个新的 groupby 将关系组合在一起并遍历它们。祝你好运！

以上是关于Python：使用上限重新分配权重的主要内容，如果未能解决你的问题，请参考以下文章