使用 scikit learn 计算 pandas Dataframe 中各行之间的准确率、召回率和准确率

Posted 2023-03-12

技术标签:

【中文标题】使用 scikit learn 计算 pandas Dataframe 中各行之间的准确率、召回率和准确率【英文标题】：Calculate precision, recall, accuracy between the rows in pandas Dataframe with scikit learn 【发布时间】：2021-10-13 19:24:04 【问题描述】：

我有多个熊猫数据框如下：

data1 = '1':[4], '2':[2], '3':[6]
original= pd.DataFrame(data1)


data2 = '1':[3], '2':[5], '5':[5]
predect1 = pd.DataFrame(data2)

data3 = '1':[2], '3':[4], '5':[5], '6':[2]
predect2 = pd.DataFrame(data3)

data4 = '1':[4], '2':[2], '3':[6]
predect3= pd.DataFrame(data4)

与原始数据帧相比，我如何（分别）计算 predect1、predect2 和 predect3 的精度、准确度和召回率。

注意：与原始数据框相比，它可能有一些额外的列。所以，我需要考虑可用列的数量并处理额外的列。有没有办法找到准确度并计算 Precison & Recall

列名：

Index(['1', '2', '3'], dtype='object')
Index(['1', '2', '5'], dtype='object')
Index(['1', '3', '5', '6'], dtype='object')
Index(['1', '2', '3'], dtype='object')

【问题讨论】：

能否列出 predict 和 actual 的列名，或者字典中提到的列名

Index(['1', '2', '3'], dtype='object') Index(['1', '2', '5'], dtype='object') Index(['1', '3', '5', '6'], dtype='object') Index(['1', '2', '3'], dtype='object')

【参考方案1】：

这是我的答案。首先，我创建了一些数据集。您共享的数据集不能用于创建示例。此外，由于每列中有多个类，所以我在 precision 和 recall 中都使用了“宏观”平均值。

from sklearn.metrics import precision_score, recall_score, accuracy_score

data1 = '1':[0,1,2,3,1], '2':[0,1,2,3,1], '3':[0,1,2,3,1]
original= pd.DataFrame(data1)

data2 = '1':[0,1,2,3,0], '2':[1,1,2,2,1], '5':[0,1,2,3,3]
predect1 = pd.DataFrame(data2)

# Get the set of columns of original dataset
orig_col = set(original.columns)

# Define the function to get all valeus
def get_all(pred, orig):
    # Get the set of columns in pred
    pred_col = set(pred.columns)
    # Get the columns which are present in both
    check_col = orig_col.intersection(pred_col)

    # List to return
    recalls = []
    precisions = []
    accuracies = []
   
    # iterate on each column to check
    for col in check_col:
        print(orig[col].values, pred[col].values)
        precisions.append(precision_score(orig[col].values, pred[col].values, average='macro', labels=np.unique(pred[col].values)))
        recalls.append(recall_score(orig[col].values, pred[col].values, average='macro', labels=np.unique(pred[col].values)))
        accuracies.append(accuracy_score(orig[col].values, pred[col].values))

    # return the values
    return precisions, recalls, accuracies

# Finally run the function
get_all(predect1, original)

根据需要，您可以取平均值等。另外，为简单起见，我只运行了 1 个预测数据帧。

【讨论】：

我收到了一个错误UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use zero_division` 参数来控制此行为。` 我已经编辑了答案。有关此错误的更多详细信息，请参阅this

以上是关于使用 scikit learn 计算 pandas Dataframe 中各行之间的准确率、召回率和准确率的主要内容，如果未能解决你的问题，请参考以下文章