计算差异或比较两个字典 - Groundtruth 和聚类

Posted 2023-03-12

技术标签:

【中文标题】计算差异或比较两个字典 - Groundtruth 和聚类【英文标题】：Calculate difference or compare two dictionaries - Groundtruth and clustering 【发布时间】：2020-10-20 11:25:32 【问题描述】：

我有两个字典 h 和 c。这里 1,2,3 是文件夹名称，IMG_0001... 是每个特定文件夹中包含的所有图像文件。

这是我的真相

h = '1': ['IMG_0001.png', 'IMG_0002.png', 'IMG_0003.png', 'IMG_0004.png'], 
     '2': ['IMG_0020.png', 'IMG_0021.png', 'IMG_0022.png', 'IMG_0023.png'], 
     '3': ['IMG_0051.png', 'IMG_0052.png', 'IMG_0053.png', 'IMG_0054.png']

这是我的聚类输出图像

c = '1': ['IMG_0001.png', 'IMG_0002.png', 'IMG_0053.png', 'IMG_0054.png'], 
     '2': ['IMG_0020.png', 'IMG_0021.png', 'IMG_0022.png', 'IMG_0023.png'], 
     '3': ['IMG_0003.png', 'IMG_0004.png', 'IMG_0051.png', 'IMG_0052.png']

现在，我必须检查和比较两个字典并为每个文件夹生成一个 accuracy_score。如何用python编写代码。有一个集群评估指标 - Adjusted Rand Index (ARI)，但不知道我应该如何在这里使用它来比较 groundtruth 和集群字典。感谢你的帮助。非常感谢您的参与。我是python初学者。

import os, pprint
pp = pprint.PrettyPrinter()
h=
for subdir, dirs, files in os.walk(r"folder_paths"):    
    for file in files:
        key, value = os.path.basename(subdir), file  #Get basefolder name & file name
        h.setdefault(key, []).append(value)          #Form DICT
pp.pprint(h)


#####################################

import os, pprint
pp = pprint.PrettyPrinter()
c=
for subdir, dirs, files in os.walk(r"folder_paths"):    
    for file in files:
        key, value = os.path.basename(subdir), file  #Get basefolder name & file name
        c.setdefault(key, []).append(value)          #Form DICT
pp.pprint(c)

#####################################


# diff = 
# #value = set(h.values()).intersection(set(c.values()))
# value =  k : second_dict[k] for k in set(second_dict) - set(first_dict) 
# print(value)

print("Changes in Ground Truth and Clustering")
import dictdiffer
for diff in list(dictdiffer.diff(h, c)):         
    print(diff)

【问题讨论】：

嗨，你能描述一下你想要达到的目标吗？即你想成为比较代码的输出是什么？（假设我不知道您在这种情况下所说的“基本事实”或“聚类”是什么意思） 【参考方案1】：

from sklearn.metrics import accuracy_score

h = '1': ['IMG_0001.png', 'IMG_0002.png', 'IMG_0003.png', 'IMG_0004.png'],
     '2': ['IMG_0020.png', 'IMG_0021.png', 'IMG_0022.png', 'IMG_0023.png'],
     '3': ['IMG_0051.png', 'IMG_0052.png', 'IMG_0053.png', 'IMG_0054.png']

c = '1': ['IMG_0001.png', 'IMG_0002.png', 'IMG_0053.png', 'IMG_0054.png'],
     '2': ['IMG_0020.png', 'IMG_0021.png', 'IMG_0022.png', 'IMG_0023.png'],
     '3': ['IMG_0003.png', 'IMG_0004.png', 'IMG_0051.png', 'IMG_0052.png']

images = []
for key, value in h.items():
    images.extend(value)
print(images)  # ['IMG_0001.png', 'IMG_0002.png', 'IMG_0003.png', 'IMG_0004.png', 'IMG_0051.png', ..., 'IMG_0023.png']

reverse_h = 
for key, value in h.items():
    for img in value:
        reverse_h[img] = key
print(reverse_h)  # 'IMG_0003.png': '1', 'IMG_0051.png': '3', 'IMG_0004.png': '1', ..., 'IMG_0054.png': '3'

y_true = [reverse_h[img] for img in images]
print(y_true)  # ['1', '1', '1', '1', '3', '3', '3', '3', '2', '2', '2', '2']

reverse_c = 
for key, value in c.items():
    for img in value:
        reverse_c[img] = key

print(reverse_c)  # 'IMG_0053.png': '1', 'IMG_0020.png': '2', 'IMG_0003.png': '3', ..., 'IMG_0054.png': '1'

y_pred = [reverse_c[img] for img in images]
print(y_pred)  # ['1', '1', '3', '3', '3', '3', '1', '1', '2', '2', '2', '2']

score = accuracy_score(y_true, y_pred)
print(score)  # 0.6666666666666666

【讨论】：

我应该如何创建这个 y_true 和 y_pred 列表以便从 sklearn 导入指标 y_true = [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3] y_pred = [1, 1, 3, 3, 2, 2, 2, 2, 3, 3, 1, 1] metrics.adjusted_rand_score(y_true, y_pred) .........从 2图像在第一个文件夹和第三个文件夹中根据上面的示例进行了更改

以上是关于计算差异或比较两个字典 - Groundtruth 和聚类的主要内容，如果未能解决你的问题，请参考以下文章

比较两个或多个 JTable 和“突出显示”差异