Python列表查找具有接近值的元素
Posted
技术标签:
【中文标题】Python列表查找具有接近值的元素【英文标题】:Python list find elements with close value 【发布时间】:2016-11-23 04:41:00 【问题描述】:首先,这不是重复的帖子;我的问题与我在本网站上搜索的问题不同,但如果您发现已回答的问题,请随时链接
说明:
如果您认为您自己的想法如何在下面的 A 和 B 中发现 10 和 2.10 是不接近的第一个元素,这就是我正在尝试以编程方式执行的操作。硬编码阈值不是最佳选择。当然,这里我们需要一个阈值,但是函数应该根据提供的值找到阈值,所以对于 A,阈值可能在 1.1 左右,对于 B,阈值可能在 0.01 左右。如何?那么,“这很有意义”对吗?我们查看了这些值并弄清楚了。这就是我的意思,“动态阈值”本身,如果您的答案包括使用阈值。
A = [1.1, 1.02, 2.3, 10, 10.01, 10.1, 12, 16, 18, 18]
B = [1.01, 1.02, 1.001, 1.03, 2.10, 2.94, 3.01, 8.99]
Python 问题:
我在 Python 中有一个 2D 列表,如下所示,现在如果想要缩小距离较近的项目,仅从上到下开始(列表已经排序,如您所见),我们可以轻松发现前四个比第四个和第五个更接近。
subSetScore = [
['F', 'H', 0.12346022214809049],
['C', 'E', 0.24674283702138702],
['C', 'G', 0.24675055907681284],
['E', 'G', 0.3467125665641178],
['B', 'D', 0.4720531092083966],
['A', 'H', 0.9157739970594413],
['A', 'C', 0.9173801845880128],
['A', 'G', 0.9174496830868454],
['A', 'B', 0.918924595673178],
['A', 'F', 0.9403919097569715],
['A', 'E', 0.9419672090638398],
['A', 'D', 0.9436390340635308],
['B', 'H', 1.3237456293166292],
['D', 'H', 1.3237456293166292],
['D', 'F', 1.3238460160371646],
['B', 'C', 1.3253518168452008],
['D', 'E', 1.325421315344033],
['D', 'G', 1.325421315344033],
['B', 'F', 1.349344243053239],
['B', 'E', 1.350919542360107],
['B', 'G', 1.350919542360107],
['C', 'H', 1.7160260449485403],
['E', 'H', 1.7238716532611786],
['G', 'H', 1.7238716532611786],
['E', 'F', 1.7239720399817142],
['C', 'F', 1.7416246586851503],
['C', 'D', 1.769389308968704],
['F', 'G', 2.1501908892101267]
]
结果:
closest = [
['F', 'H', 0.12346022214809049],
['C', 'E', 0.24674283702138702],
['C', 'G', 0.24675055907681284],
['E', 'G', 0.3467125665641178],
['B', 'D', 0.4720531092083966]
]
与我在这里观察到的其他问题相反,其中给出了 1D 或 2D 列表和任意值,比如说 0.9536795380033108,那么函数必须找到 0.9436390340635308 是最接近列表的,并且大多数解决方案使用绝对差来计算,但这里好像不适用。
一种似乎部分可靠的方法是计算累积差异,如下所示。
consecutiveDifferences = []
for index, item in enumerate(subSetScore):
if index == 0:
continue
consecutiveDifferences.append([index, subSetScore[index][2] - subSetScore[index - 1][2]])
这给了我以下信息:
consecutiveDifferences = [
[1, 0.12328261487329653],
[2, 7.722055425818386e-06],
[3, 0.09996200748730497],
[4, 0.1253405426442788],
[5, 0.4437208878510447],
[6, 0.0016061875285715566],
[7, 6.949849883253201e-05],
[8, 0.0014749125863325885],
[9, 0.021467314083793543],
[10, 0.001575299306868283],
[11, 0.001671824999690985],
[12, 0.3801065952530984],
[13, 0.0],
[14, 0.00010038672053536146],
[15, 0.001505800808036195],
[16, 6.949849883230996e-05],
[17, 0.0],
[18, 0.0239229277092059],
[19, 0.001575299306868061],
[20, 0.0],
[21, 0.36510650258843325],
[22, 0.007845608312638364],
[23, 0.0],
[24, 0.00010038672053558351],
[25, 0.01765261870343604],
[26, 0.027764650283553793],
[27, 0.38080158024142263]
]
现在,大于第 0 个索引的差异的索引是我的截止索引,如下所示:
cutoff = -1
for index, item in enumerate(consecutiveDifferences):
if index == 0:
continue
if consecutiveDifferences[index][1] > consecutiveDifferences[0][1]:
cutoff = index
break
cutoff = cutoff+1
closest = subSetScore[:cutoff+1]
我的列表(最接近)如下:
consecutiveDifferences = [
['F', 'H', 0.12346022214809049],
['C', 'E', 0.24674283702138702],
['C', 'G', 0.24675055907681284],
['E', 'G', 0.3467125665641178],
['B', 'D', 0.4720531092083966]
]
但显然这个逻辑是有问题的,它不适用于以下场景:
subSetScore = [
['A', 'C', 0.143827143333704],
['A', 'G', 0.1438310043614169],
['D', 'F', 0.15684652878164498],
['B', 'H', 0.1568851390587741],
['A', 'H', 0.44111469414482873],
['A', 'F', 0.44121508086536443],
['A', 'E', 0.4441224347331875],
['A', 'B', 0.4465394380814708],
['A', 'D', 0.4465394380814708],
['D', 'H', 0.7595452327118624],
['B', 'F', 0.7596456194323981],
['B', 'E', 0.7625529733002212],
['D', 'E', 0.7625529733002212],
['B', 'C', 0.7635645625610041],
['B', 'G', 0.763661088253827],
['D', 'G', 0.763661088253827],
['B', 'D', 0.7649699766485044],
['C', 'G', 0.7891593152699012],
['G', 'H', 1.0785858136575361],
['C', 'H', 1.0909217972002916],
['C', 'F', 1.0910221839208274],
['C', 'E', 1.0939295377886504],
['C', 'D', 1.0963465411369335],
['E', 'H', 1.3717343427604187],
['E', 'F', 1.3718347294809543],
['E', 'G', 1.3758501983023834],
['F', 'H', 2.0468234552800326],
['F', 'G', 2.050939310821997]
]
由于截止值为 2,因此最接近的如下所示:
closest = [
['A', 'C', 0.143827143333704],
['A', 'G', 0.1438310043614169],
['D', 'F', 0.15684652878164498]
]
但这是预期的结果:
closest = [
['A', 'C', 0.143827143333704],
['A', 'G', 0.1438310043614169],
['D', 'F', 0.15684652878164498],
['B', 'H', 0.1568851390587741]
]
更多数据集:
subSetScore1 = [
['A', 'C', 0.22406316023573888],
['A', 'G', 0.22407088229116476],
['D', 'F', 0.30378179942424355],
['B', 'H', 0.3127393837182006],
['A', 'F', 0.4947366470217576],
['A', 'H', 0.49582931786451195],
['A', 'E', 0.5249800770970015],
['A', 'B', 0.6132933639744492],
['A', 'D', 0.6164207964219085],
['D', 'H', 0.8856811470650012],
['B', 'F', 0.8870402288199465],
['D', 'E', 0.916716087821392],
['B', 'E', 0.929515394689697],
['B', 'C', 1.0224773589334915],
['D', 'G', 1.0252457158036496],
['B', 'G', 1.0815974152736079],
['B', 'D', 1.116948985013035],
['G', 'H', 1.1663971669323054],
['C', 'F', 1.1671269011700458],
['C', 'G', 1.202339473911808],
['C', 'H', 1.28446739439317],
['C', 'E', 1.4222597514115916],
['E', 'F', 1.537160075120155],
['E', 'H', 1.5428705351075527],
['C', 'D', 1.6198555666753154],
['E', 'G', 1.964274682777963],
['F', 'H', 2.3095586690883034],
['F', 'G', 2.6867154391687365]
]
subSetScore2 = [
['A', 'H', 0.22812496138972285],
['A', 'C', 0.23015200093900193],
['A', 'B', 0.2321751794605681],
['A', 'G', 0.23302074452969593],
['A', 'D', 0.23360762074205865],
['A', 'F', 0.24534900601702558],
['A', 'E', 0.24730268603975933],
['B', 'F', 0.24968107911091342],
['B', 'E', 0.2516347591336472],
['B', 'H', 0.2535228016852614],
['B', 'C', 0.25554984123454044],
['C', 'F', 0.2766387746024686],
['G', 'H', 0.2767739105724205],
['D', 'F', 0.2855654706747223],
['D', 'E', 0.28751915069745604],
['D', 'G', 0.30469686299220383],
['D', 'H', 0.30884360675587186],
['E', 'F', 0.31103280946909323],
['E', 'H', 0.33070474566638247],
['B', 'G', 0.7301435066780336],
['B', 'D', 0.7473019138342167],
['C', 'E', 0.749630113545103],
['C', 'H', 0.7515104340412913],
['F', 'H', 0.8092791306818884],
['E', 'G', 0.8506307374871814],
['C', 'G', 1.2281311390340637],
['C', 'D', 1.2454208211324858],
['F', 'G', 1.3292051225026873]
]
subSetScore3 = [
['A', 'F', 0.06947533266614773],
['B', 'F', 0.06947533266614773],
['C', 'F', 0.06947533266614773],
['D', 'F', 0.06947533266614773],
['E', 'F', 0.06947533266614773],
['A', 'H', 0.07006993093393628],
['B', 'H', 0.07006993093393628],
['D', 'H', 0.07006993093393628],
['E', 'H', 0.07006993093393628],
['G', 'H', 0.07006993093393628],
['A', 'E', 0.09015499709650715],
['B', 'E', 0.09015499709650715],
['D', 'E', 0.09015499709650715],
['A', 'C', 0.10039444259115113],
['A', 'G', 0.10039444259115113],
['B', 'C', 0.10039444259115113],
['D', 'G', 0.10039444259115113],
['A', 'D', 0.1104369756724366],
['A', 'B', 0.11063388808579513],
['B', 'G', 2.6511978452376543],
['B', 'D', 2.6612403783189396],
['C', 'H', 2.670889086573508],
['C', 'E', 2.690974152736078],
['C', 'G', 5.252017000877225],
['E', 'G', 5.252017000877225],
['C', 'D', 5.262059533958511],
['F', 'H', 5.322704696245228],
['F', 'G', 10.504651766188518]
]
在不使用任何库(NumPy 和 SciPy 除外)的情况下,我应该如何修复它?
请注意:我使用的是 Python 2.7,任何作为 Python 一部分的库(例如 itertools、operator、math 等)都可以使用。
更新: 我可以使用 SciPy,但不确定没有集群会有什么影响,所以我认为 2 可能就足够了,但无论如何我都不是集群方面的专家,请随时提出建议,不胜感激!
【问题讨论】:
如何判断两个元素是否接近?只是数字吗?您是否尝试过 k-means 聚类? 我建议你看看聚类算法以及如何使用 Numpy 实现它们。聚类算法基本上根据您指定的标准对彼此相似的项目进行分组。有很多聚类算法 - K-means 是最流行的算法之一。 (注意:如果你愿意使用 scipy 库,你会发现 k-means 已经实现为一个函数,你只需要调用它。它可以让你免于重新发明***的麻烦.) 即使 Kmeans 也不够,因为问题似乎非常模糊。不想在这方面考虑太多,但采用百分比差异并尝试对其分布进行建模以发现异常可能有助于自动识别差异 好建议!我认为这只能通过 Pycluster 实现 Vivek,另一个很棒的建议。但由于这是动态数据,百分比差异阈值的 X% 可能对一个问题有意义,但对另一个问题则没有意义。我似乎无法用同一根棍子统治他们。 【参考方案1】:请查看此代码:
t1 = [0.12,0.24,0.24,0.34,0.47,0.91,0.91,0.91,0.91,0.94,0.94,0.94,1.32,1.32,1.32,1.32,1.32,1.32,1.34,1.35,1.35,1.71,1.72,1.72,1.72,1.74,1.76,2.15]
t2 = [0.22,0.22,0.30,0.31,0.49,0.49,0.52,0.61,0.61,0.88,0.88,0.91,0.92,1.02,1.02,1.08,1.11,1.16,1.16,1.20,1.28,1.42,1.53,1.54,1.61,1.96,2.30,2.68]
t3 = [0.22,0.23,0.23,0.23,0.23,0.24,0.24,0.24,0.25,0.25,0.25,0.27,0.27,0.28,0.28,0.30,0.30,0.31,0.33,0.73,0.74,0.74,0.75,0.80,0.85,1.22,1.24,1.32]
t4 = [0.06,0.06,0.06,0.06,0.06,0.07,0.07,0.07,0.07,0.07,0.09,0.09,0.09,0.10,0.10,0.10,0.10,0.11,0.11,2.65,2.66,2.67,2.69,5.25,5.25,5.26,5.32,0.50]
ts = [t1,t2,t3,t4]
threshold = 0.5 # 0.3 worked a bit better
for t in ts:
abs_differences = [abs(t[idx]-t[idx+1]) for idx in range(len(t)-1)]
# remove all elements after cut off index
cutOffIndex = [p > max(abs_differences) * threshold for p in abs_differences].index(True)
# Print index + values.
print zip(t,[p > max(abs_differences) * threshold for p in abs_differences])
# Print only indices.
# print [p > max(abs_differences) * threshold for p in abs_differences]
这使您可以确定信号电平变化的指标。您可以使用阈值调整差异的阈值,阈值是最大可能信号变化的百分比。
【讨论】:
OP 中的 cmets 中提到,硬编码阈值没有意义。对不起。 那么你如何确定你上面所说的足够接近?如果你能告诉我,我很乐意改进我的建议 :-) 你知道集群的数量吗? 我不明白为什么这对你没有帮助。假设您不知道集群的数量,如果您将阈值调整为适合您需要的值,这可能会对您有所帮助。阈值与信号的最大百分比变化有关,因此它通常也适用于其他信号...... 原因是,上面的 2D 列表的例子只是一个数据集,我必须对我没见过的数据集做同样的事情,所以我首先怎么知道像你建议的那样调整阈值?相信我,我已经在尽可能使用非常相似的东西了。 我很感激!是的,只有第一次出现,然后我将列表更新为 x=x[0:cutOffIndex+1]【参考方案2】:特别感谢 Ohumeronen 的大力帮助,但实际上我最终尝试了另一种启发式方法来寻找阈值更少的解决方案。所以在下面的比较中,如果我在第一个和第二个位置有相同的字母,对于相同的索引,那么这些被认为是相关的。但是,这种策略并不能完全证明,我确实看到了一个失败,但经过进一步调查后发现罪魁祸首是坏数据。到目前为止,我取得了一些成功,但更多的测试会让我更好地理解。
matches = []
for index in range(len(subSetIntersectScore)):
if subSetIntersectScore[index][0:2] == subSetUnionScore[index][0:2] or (index + 1< len(subSetIntersectScore) and subSetIntersectScore[index][0:2] == subSetUnionScore[index+1][0:2]):
matches.append(subSetIntersectScore[index][0:2])
elif index > 0 and subSetIntersectScore[index][0:2] == subSetUnionScore[index - 1][0:2]:
matches.append(subSetIntersectScore[index][0:2])
else:
break
积极的结果
比赛:[(F, H), (C, E), (C, G), (E, G), (B, D)]
匹配 [(A, C), (A, G), (D, F), (B, H)]
负面结果
【讨论】:
【参考方案3】:我为您提供了一些基于https://codereview.stackexchange.com/questions/80050/k-means-clustering-algorithm-in-python 的代码:
# kmeans clustering algorithm
# data = set of data points
# k = number of clusters
# c = initial list of centroids (if provided)
#
def kmeans(data, k, c):
centroids = []
centroids = randomize_centroids(data, centroids, k)
old_centroids = [[] for i in range(k)]
iterations = 0
while not (has_converged(centroids, old_centroids, iterations)):
iterations += 1
clusters = [[] for i in range(k)]
# assign data points to clusters
clusters = euclidean_dist(data, centroids, clusters)
# recalculate centroids
index = 0
for cluster in clusters:
old_centroids[index] = centroids[index]
centroids[index] = np.mean(cluster, axis=0).tolist()
index += 1
print("The total number of data instances is: " + str(len(data)))
print("The total number of iterations necessary is: " + str(iterations))
print("The means of each cluster are: " + str(centroids))
print("The clusters are as follows:")
for cluster in clusters:
print("Cluster with a size of " + str(len(cluster)) + " starts here:")
print(np.array(cluster).tolist())
print("Cluster ends here.")
return
# Calculates euclidean distance between
# a data point and all the available cluster
# centroids.
def euclidean_dist(data, centroids, clusters):
for instance in data:
# Find which centroid is the closest
# to the given data point.
mu_index = min([(i[0], np.linalg.norm(instance-centroids[i[0]])) \
for i in enumerate(centroids)], key=lambda t:t[1])[0]
try:
clusters[mu_index].append(instance)
except KeyError:
clusters[mu_index] = [instance]
# If any cluster is empty then assign one point
# from data set randomly so as to not have empty
# clusters and 0 means.
for cluster in clusters:
if not cluster:
cluster.append(data[np.random.randint(0, len(data), size=1)])
return clusters
# randomize initial centroids
def randomize_centroids(data, centroids, k):
for cluster in range(0, k):
centroids.append(data[np.random.randint(0, len(data), size=1)])
return centroids
# check if clusters have converged
def has_converged(centroids, old_centroids, iterations):
MAX_ITERATIONS = 1000
if iterations > MAX_ITERATIONS:
return True
return old_centroids == centroids
###############################################################################
# STARTING COMPUTATION #
###############################################################################
A = [1.1, 1.02, 2.3, 10, 10.01, 10.1, 12, 16, 18, 18]
B = [1.01, 1.02, 1.001, 1.03, 2.10, 2.94, 3.01, 8.99]
T = [A,B]
k = 3
for t in T:
cent = np.random.permutation(t)[0:3]
print kmeans(t, k, cent)
print
您必须确定一个值 k,它是您的数据将被拆分成的块数。该代码将您提供的两个数组 A 和 B 拆分为 3 个块。您必须做出决定:要么设置固定数量的块,要么设置固定阈值。
您还应该知道,kmeans 是一种基于随机的算法,它并不总是(但经常)产生最佳结果。因此,最好多次运行它并对结果取平均值。
这是我最喜欢的 Sebastian Thrun 对 kmeans 聚类的介绍 :-)
https://www.youtube.com/watch?v=zaKjh2N8jN4&index=15&list=PL34DBDAC077F8F90D
这对您有帮助吗?这应该允许您开发适合您需要的自己的 kmeans 版本。你可以设置一个固定的k值吗?你还没有回答这个问题。
编辑:基于Kmeans without knowing the number of clusters?,如果这个解决方案还不够好,我可能还会想出一个动态值为 k 的解决方案。
【讨论】:
希望我能给你多个支持。一个给你很大的帮助,是的,我目前的问题实际上是关于解决一个 AI 问题,而且,Sebastian Thrun 太棒了!他还在 DARPA 挑战赛中支持“斯坦利”汽车 更新,事实证明,我真的可以使用 SciPy!更新了 OP。以上是关于Python列表查找具有接近值的元素的主要内容,如果未能解决你的问题,请参考以下文章