获取 TypeError:尝试使用 idxmax() 时,此 dtype 不允许缩减操作 'argmax'

Posted

技术标签:

【中文标题】获取 TypeError:尝试使用 idxmax() 时,此 dtype 不允许缩减操作 \'argmax\'【英文标题】:Getting TypeError: reduction operation 'argmax' not allowed for this dtype when trying to use idxmax()获取 TypeError:尝试使用 idxmax() 时,此 dtype 不允许缩减操作 'argmax' 【发布时间】:2018-07-21 01:21:27 【问题描述】:

在 Pandas 中使用 idxmax() 函数时,我不断收到此错误。

Traceback (most recent call last):
  File "/Users/username/College/year-4/fyp-credit-card-fraud/code/main.py", line 20, in <module>
    best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)
  File "/Users/username/College/year-4/fyp-credit-card-fraud/code/Classification.py", line 39, in print_kfold_scores
    best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 1369, in idxmax
    i = nanops.nanargmax(_values_from_object(self), skipna=skipna)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/nanops.py", line 74, in _f
    raise TypeError(msg.format(name=f.__name__.replace('nan', '')))
TypeError: reduction operation 'argmax' not allowed for this dtype

我使用的 Pandas 版本是0.22.0

main.py

import ExploratoryDataAnalysis as eda
import Preprocessing as processor
import Classification as classify
import pandas as pd


data_path = '/Users/username/college/year-4/fyp-credit-card-fraud/data/'

if __name__ == '__main__':
    df = pd.read_csv(data_path + 'creditcard.csv')
    # eda.init(df)
    # eda.check_null_values()
    # eda.view_data()
    # eda.check_target_classes()
    df = processor.noramlize(df)

    X_training, X_testing, y_training, y_testing, X_training_undersampled, X_testing_undersampled, \
    y_training_undersampled, y_testing_undersampled = processor.resample(df)

    best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)

分类.py

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import KFold, cross_val_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, auc, \
    roc_auc_score, roc_curve, recall_score, classification_report
import pandas as pd
import numpy as np


def print_kfold_scores(X_training, y_training):
    print('\nKFold\n')

    fold = KFold(len(y_training), 5, shuffle=False)

    c_param_range = [0.01, 0.1, 1, 10, 100]

    results = pd.DataFrame(index=range(len(c_param_range), 2), columns=['C_parameter', 'Mean recall score'])
    results['C_parameter'] = c_param_range

    j = 0
    for c_param in c_param_range:
        print('-------------------------------------------')
        print('C parameter: ', c_param)
        print('\n-------------------------------------------')

        recall_accs = []
        for iteration, indices in enumerate(fold, start=1):
            lr = LogisticRegression(C=c_param, penalty='l1')
            lr.fit(X_training.iloc[indices[0], :], y_training.iloc[indices[0], :].values.ravel())

            y_prediction_undersampled = lr.predict(X_training.iloc[indices[1], :].values)
            recall_acc = recall_score(y_training.iloc[indices[1], :].values, y_prediction_undersampled)
            recall_accs.append(recall_acc)
            print('Iteration ', iteration, ': recall score = ', recall_acc)

        results.ix[j, 'Mean recall score'] = np.mean(recall_accs)
        j += 1
        print('\nMean recall score ', np.mean(recall_accs))
        print('\n')

    best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter'] # Error occurs on this line

    print('*****************************************************************')
    print('Best model to choose from cross validation is with C parameter = ', best_c_param)
    print('*****************************************************************')

    return best_c_param

导致问题的行是这个

best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']

程序的输出如下

/Library/Frameworks/Python.framework/Versions/3.6/bin/python3.6 /Users/username/College/year-4/fyp-credit-card-fraud/code/main.py
/Users/username/Library/Python/3.6/lib/python/site-packages/sklearn/cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
Dataset Ratios

Percentage of genuine transactions:  0.5
Percentage of fraudulent transactions 0.5
Total number of transactions in resampled data:  984


Whole Dataset Split

Number of transactions in training dataset:  199364
Number of transactions in testing dataset:  85443
Total number of transactions in dataset:  284807


Undersampled Dataset Split

Number of transactions in training dataset 688
Number of transactions in testing dataset:  296
Total number of transactions in dataset:  984

KFold

-------------------------------------------
C parameter:  0.01

-------------------------------------------
Iteration  1 : recall score =  0.931506849315
Iteration  2 : recall score =  0.917808219178
Iteration  3 : recall score =  1.0
Iteration  4 : recall score =  0.959459459459
Iteration  5 : recall score =  0.954545454545

Mean recall score  0.9526639965


-------------------------------------------
C parameter:  0.1

-------------------------------------------
Iteration  1 : recall score =  0.849315068493
Iteration  2 : recall score =  0.86301369863
Iteration  3 : recall score =  0.915254237288
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.909090909091

Mean recall score  0.89652397189


-------------------------------------------
C parameter:  1

-------------------------------------------
Iteration  1 : recall score =  0.86301369863
Iteration  2 : recall score =  0.86301369863
Iteration  3 : recall score =  0.983050847458
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.924242424242

Mean recall score  0.915853322981


-------------------------------------------
C parameter:  10

-------------------------------------------
Iteration  1 : recall score =  0.849315068493
Iteration  2 : recall score =  0.876712328767
Iteration  3 : recall score =  0.983050847458
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.939393939394

Mean recall score  0.918883626012


-------------------------------------------
C parameter:  100

-------------------------------------------
Iteration  1 : recall score =  0.86301369863
Iteration  2 : recall score =  0.876712328767
Iteration  3 : recall score =  0.983050847458
Iteration  4 : recall score =  0.945945945946
Iteration  5 : recall score =  0.924242424242

Mean recall score  0.918593049009


Traceback (most recent call last):
  File "/Users/username/College/year-4/fyp-credit-card-fraud/code/main.py", line 20, in <module>
    best_c_param = classify.print_kfold_scores(X_training_undersampled, y_training_undersampled)
  File "/Users/username/College/year-4/fyp-credit-card-fraud/code/Classification.py", line 39, in print_kfold_scores
    best_c_param = results.loc[results['Mean recall score'].idxmax()]['C_parameter']
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/series.py", line 1369, in idxmax
    i = nanops.nanargmax(_values_from_object(self), skipna=skipna)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/core/nanops.py", line 74, in _f
    raise TypeError(msg.format(name=f.__name__.replace('nan', '')))
TypeError: reduction operation 'argmax' not allowed for this dtype

Process finished with exit code 1

【问题讨论】:

两个 cmets 对于那些最终来到这里的人: 1. argmax 方法已被弃用,请改用 idxmax。 2. 在做任何事情之前用print(df['columnName'].dtype)检查你的类型并确保它是数字的(即整数,浮点数......)。如果它只返回object,则使用df['columnName'].astype(float) 【参考方案1】:

默认情况下,单元格值的类型是非数字的。 argmin()idxmin()argmax() 和其他类似函数需要 dtypes 为数字。

最简单的解决方案是使用pd.to_numeric() 将您的系列(或列)转换为数字类型。带有 df 列的数据框 'a' 的示例是:

df['a'] = pd.to_numeric(df['a'])

关于 pandas 类型转换的更完整答案可以在 here 找到。

希望有帮助:)

【讨论】:

【参考方案2】:

如果存在 NaN(我们可以通过堆栈跟踪看到这一点),那么当您认为您正在使用数字数据框时,您很可能有混合类型,特别是数字之间的字符串。让我给你 3 个代码示例,前 2 个有效,最后一个无效,很可能是你的情况。

这代表所有数字数据,它将与 idxmax 一起使用

the_dict = 
the_dict['a'] = [0.1, 0.2, 0.5]
the_dict['b'] = [0.3, 0.4, 0.6]
the_dict['c'] = [0.25, 0.3, 0.9]
the_dict['d'] = [0.2, 0.1, 0.4]
the_df = pd.DataFrame(the_dict)

这代表一个数字nan,它会工作idxmax

the_dict = 
the_dict['a'] = [0.1, 0.2, 0.5]
the_dict['b'] = [0.3, 0.4, 0.6]
the_dict['c'] = [0.25, 0.3, 0.9]
the_dict['d'] = [0.2, 0.1, np.NaN]
the_df = pd.DataFrame(the_dict)

这可能是 OP 报告的确切问题,但如果事实证明我们以任何方式混合了类型,我们将收到 OP 报告的错误。

the_dict = 
the_dict['a'] = [0.1, 0.2, 0.5]
the_dict['b'] = [0.3, 0.4, 0.6]
the_dict['c'] = [0.25, 0.3, 0.9]
the_dict['d'] = [0.2, 0.1, 'NaN']
the_df = pd.DataFrame(the_dict)

【讨论】:

【参考方案3】:

简而言之,试试这个

best_c = results_table.loc[results_table['Mean recall score'].astype(float).idxmax()]['C_parameter']

而不是

best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']

【讨论】:

【参考方案4】:
#best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']

我们应该替换这行代码

主要问题:

1) “平均召回分数”的类型是对象,不能使用“idxmax()”计算值 2)您应该将“平均召回分数”从“object”更改为“float” 3) 你可以使用 apply(pd.to_numeric, errors = 'coerce', axis = 0) 来做这样的事情。

best_c = results_table
best_c.dtypes.eq(object) # you can see the type of best_c
new = best_c.columns[best_c.dtypes.eq(object)] #get the object column of the best_c
best_c[new] = best_c[new].apply(pd.to_numeric, errors = 'coerce', axis=0) # change the type of object
best_c
best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter'] #calculate the mean values

【讨论】:

欢迎来到 SO!您能详细说明一下吗?仅代码的答案可能被视为低质量,因此被删除。 TypeError: reduction operation 'argmax' not allowed for this dtype 问题:1)“平均召回分数”的类型是object,不能使用“idxmax()”计算值2)您应该将“平均召回分数”从“object”更改为“float” 3)您可以使用 apply(pd.to_numeric, errors = 'coerce', axis = 0) 来做这些事情。 请用所有相关信息编辑问题,谢谢!

以上是关于获取 TypeError:尝试使用 idxmax() 时,此 dtype 不允许缩减操作 'argmax'的主要内容,如果未能解决你的问题,请参考以下文章

pandas使用idxmax函数获取dataframe每个数据行中最大值对应的列名称(column label of max value in each row in dataframe)

如何重新采样日内间隔并使用 .idxmax()?

尝试使用 jQuery 获取 JSON 数据时出现 TypeError

TypeError:“尝试获取资源时出现网络错误。”

“源映射错误:TypeError:尝试获取资源时出现 NetworkError。”在本地页面上

尝试测试获取所有路由时出现“TypeError:无法读取 null 的属性 '1'”?