获取比较多列的最大值并返回特定值

Posted 2023-03-11

技术标签:

【中文标题】获取比较多列的最大值并返回特定值【英文标题】：Get Max value comparing multiple columns and return specific values 【发布时间】：2020-01-22 03:07:36 【问题描述】：

我有一个像这样的数据框：

Sequence    Duration1   Value1  Duration2   Value2  Duration3   Value3
1001        145         10      125         53      458         33
1002        475         20      175         54      652         45
1003        685         57      687         87      254         88
1004        125         54      175         96      786         96
1005        475         21      467         32      526         32
1006        325         68      301         54      529         41
1007        125         97      325         85      872         78
1008        129         15      429         41      981         82
1009        547         47      577         52      543         83
1010        666         65      722         63      257         87

我想在(Duration1,Duration2,Duration3)中找到Duration的最大值，并返回对应的Value&Sequence。

我想要的输出：

Sequence,Duration3,Value3
1008,    981,      82

【问题讨论】：

【参考方案1】：

试试下面这个非常短的代码，主要基于Numpy：

vv = df.iloc[:, 1::2].values
iRow, iCol = np.unravel_index(vv.argmax(), vv.shape)
iCol = iCol * 2 + 1
result = df.iloc[iRow, [0, iCol, iCol + 1]]

结果是一个系列：

Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

如果你想“改造”它（首先是索引值，然后是实际值），你可以得到这样的执行：

pd.DataFrame([result.values], columns=result.index)

【讨论】：

不过，这在很大程度上取决于列顺序（我想这可以确保在开头使用 .reindex 以确保安全）先生，我非常喜欢您的回答。我有一个类似的问题，需要你的帮助。 ***.com/questions/58102325/…【参考方案2】：

对于宽数据，首先使用wide_to_long 进行整形会更容易。这将创建 2 列 ['Duration', 'Value']，MultiIndex 告诉我们它是哪个数字。不依赖于任何特定的列排序。

import pandas as pd

df = pd.wide_to_long(df, i='Sequence', j='num', stubnames=['Duration', 'Value'])
df.loc[[df.Duration.idxmax()]]

              Duration  Value
Sequence num                 
1008     3         981     82

【讨论】：

【参考方案3】：

不使用`numpy`魔法：

首先，其他人为这个问题提供了一些非常好的解决方案。数据将是问题中提供的数据，如df

# find the max value in the Duration columns
max_value = max(df.filter(like='Dur', axis=1).max().tolist())

# get a Boolean match of the dataframe for max_value
df_max = df[df == mv]

# get the row index
max_index = df_max.dropna(how='all').index[0]

# get the column name
max_col = df_max.dropna(axis=1, how='all').columns[0]

# get column index
max_col_index = df.columns.get_loc(max_col)

# final
df.iloc[max_index, [0, max_col_index, max_col_index + 1]]

输出：

Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

更新

昨晚，实际上是凌晨 4 点，我放弃了更好的解决方案，因为我太累了。我使用max_value = max(df.filter(like='Dur', axis=1).max().tolist())，返回Duration 列中的最大值而不是max_col_name = df.filter(like='Dur', axis=1).max().idxmax()，返回最大值出现的列名我这样做是因为我昏昏沉沉的大脑告诉我，我返回的是列名的最大值，而不是列中的最大值。例如：

test = ['Duration5', 'Duration2', 'Duration3']
print(max(test))
>>> 'Duration5'

这就是为什么过度疲劳是一种糟糕的问题解决条件睡眠和咖啡，更有效的解决方案和其他人类似，在idmax的使用上

新的和改进的解决方案：

# column name with max duration value
max_col_name = df.filter(like='Dur', axis=1).max().idxmax()

# index of max_col_name
max_col_idx =df.columns.get_loc(max_col_name)

# row index of max value in max_col_name
max_row_idx = df[max_col_name].idxmax()

# output with .loc
df.iloc[max_row_idx, [0, max_col_idx, max_col_idx + 1 ]]

输出：

Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

使用的方法：

pandas.DataFrame.max pandas.DataFrame.filter pandas.DataFrame.idxmax pandas.Index.get_loc pandas.DataFrame.iloc

【讨论】：

需要注意的是，这会对数据帧进行大量冗余计算。 @MateenUlhaq 我认为这次聚会更多的是看看有多少种方法可以解决这个问题。这不是最优雅的解决方案，但我很满意我从我的努力和其他答案中学到了一些东西。此外，这些是您个人资料中的一些很棒的照片。【参考方案4】：

您可以使用以下方法获取列最大值的索引：

>>> idx = df['Duration3'].idxmax()
>>> idx
7

而相关栏目仅使用：

>>> df_cols = df[['Sequence', 'Duration3', 'Value3']]
>>> df_cols.loc[idx]
Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

所以，只需将所有这些封装成一个不错的函数：

def get_max(df, i):
    idx = df[f'Durationi'].idxmax()
    df_cols = df[['Sequence', f'Durationi', f'Valuei']]
    return df_cols.loc[idx]

然后循环1..3:

>>> max_rows = [get_max(i) for i in range(1, 4)]
>>> print('\n\n'.join(map(str, max_rows)))
Sequence     1003
Duration1     685
Value1         57
Name: 2, dtype: int64

Sequence     1010
Duration2     722
Value2         63
Name: 9, dtype: int64

Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

如果您想将这 3 个减少到单个最大行，您可以执行以下操作：

>>> pairs = enumerate(max_rows, 1)
>>> by_duration = lambda x: x[1][f'Durationx[0]']
>>> i, max_row = max(pairs, key=by_duration)
>>> max_row
Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

【讨论】：

先生，是否可以仅过滤 Max Duration 并结果为 "Sequence,Duration3,Value3" "1008, 981, 82"【参考方案5】：

这是另一种方式，

m=df.set_index('Sequence') #set Sequence as index
n=m.filter(like='Duration') #gets all columns with the name Duration
s=n.idxmax()[n.eq(n.values.max()).any()]
#output Duration3    1008
d = dict(zip(m.columns[::2],m.columns[1::2])) #create a mapper dict
#'Duration1': 'Value1', 'Duration2': 'Value2', 'Duration3': 'Value3'
final=m.loc[s.values,s.index.union(s.index.map(d))].reset_index()

   Sequence  Duration3  Value3
0      1008        981      82

【讨论】：

【参考方案6】：

如果我正确理解了这个问题，请给出以下数据框：

df = pd.DataFrame(data='Seq': [1, 2, 3], 'Dur1': [2, 7, 3],'Val1': ['x', 'y', 'z'],'Dur2': [3, 5, 1], 'Val2': ['a', 'b', 'c'])
    Seq  Dur1 Val1  Dur2 Val2
0    1     2    x     3    a
1    2     7    y     5    b
2    3     3    z     1    c

这 5 行代码解决了你的问题：

dur_col = [col_name for col_name in df.columns if col_name.startswith('Dur')] # ['Dur1', 'Dur2'] 
max_dur_name = df.loc[:, dur_col].max().idxmax()
val_name = "Val" + str([int(s) for s in max_dur_name if s.isdigit()][0])

filter_col = ['Seq', max_dur_name, val_name]

df_res = df[filter_col].sort_values(max_dur_name, ascending=False).head(1)

你会得到：

   Seq  Dur1 Val1 
1    2     7    y

代码说明：

我自动获取以'Dur'开头的列，我找到了持续时间较长的列名：

dur_col = [col_name for col_name in df.columns if col_name.startswith('Dur')] # ['Dur1', 'Dur2'] 
max_dur_name = df.loc[:, dur_col].max().idxmax()
val_name = "Val" + str([int(s) for s in max_dur_name if s.isdigit()][0])

选择我感兴趣的栏目：

filter_col = ['Seq', max_dur_name, val_name]

过滤我感兴趣的列，我订购max_dur_name 并得到搜索结果：

df_res = df[filter_col].sort_values(max_dur_name, ascending=False).head(1)

# output:
   Seq  Dur1 Val1 
1    2     7    y

【讨论】：

先生，我的要求是，如果 Dur1 具有最大值，那么输出将只有“Seq”，“Dur1”“Val1”。如果 Dur2 具有最大值，那么输出将是“Seq”， "Dur2 "Val2"【参考方案7】：

有点类似于@Massifox's answer，但我认为不同之处足以值得添加。

mvc = df[[name for name in df.columns if 'Duration' in name]].max().idxmax()
mvidx = df[mvc].idxmax()
valuecol = 'Value' + mvc[-1]
df.loc[mvidx, ['Sequence', mvc, valuecol]]

mvc

'Durantion3'

mvidx

7

valuecol

'Value3'

最后用loc我选择了想要的输出，也就是：

Sequence     1008
Duration3     981
Value3         82
Name: 7, dtype: int64

【讨论】：

【参考方案8】：

if len(df[df[dur1]>=df[dur2].max()])==0:
    if len(df[df[dur2]>=df[dur3].max()])==0:
        print(df[df[dur3].idmax()][[seq,dur3,val3]])
    else:
        print(df[df[dur2].idmax()][[seq,dur2,val2]])
else:
   if len(df[df[dur1]>=df[dur3].max()])==0:
       print(df[df[dur3].idmax()][[seq,dur3,val3]])
   else:
       print(df[df[dur1].idmax()][[seq,dur1,val1]])

【讨论】：

以上是关于获取比较多列的最大值并返回特定值的主要内容，如果未能解决你的问题，请参考以下文章

获取比较多列的最大值并返回特定值

不使用numpy魔法：

输出：

更新

新的和改进的解决方案：

输出：

使用的方法：

代码说明：

不使用`numpy`魔法：