在Python中将数字分组为范围[重复]
Posted
技术标签:
【中文标题】在Python中将数字分组为范围[重复]【英文标题】:Group numbers into ranges in Python [duplicate] 【发布时间】:2017-07-13 22:12:22 【问题描述】:我想自动对不同区域的数字进行分组。根据图中(样本数据集),我们可以看到它们是三个数字所在的区域,即 [0,100]、[650,750]、[1220, 1300] .我只想指出这些地区。可以有任意数量的这样的区域。我们需要自动找到号码。这些区域的数量和这些区域的范围。两个区域之间的距离将非常大。 有什么办法可以在 Python 中做到这一点?
Sample data = [69, 8, 30, 45, 89, 61, 80, 45, 9, 18, 19, 11, 1255, 1299, 1296, 1293, 1287, 1250, 1265, 1291, 1281, 1250, 1286, 1286, 1251, 1287, 1266, 1288, 1254, 1260, 1260, 1254, 1267, 1299, 1273, 1250, 1300, 1250, 1279, 1255, 1293, 1292, 1278, 1277, 1252, 1299, 1278, 1258, 1268, 1274, 1285, 1258, 1279, 1270, 1278, 1286, 1278, 1253, 1267, 1300, 1295, 1298, 1285, 1288, 1274, 1272, 1252, 1256, 1283, 1289, 1251, 1258, 1253, 1257, 1297, 1269, 1292, 1253, 1273, 1281, 1251, 1280, 1253, 1274, 1275, 1287, 1296, 1298, 1296, 1291, 1284, 1261, 1267, 1290, 1273, 1281, 1263, 1270, 1264, 1269, 1278, 1284, 67, 8, 40, 59, 97, 64, 45, 72, 45, 90, 94, 7, 33, 58, 97, 97, 1252, 1297, 1265, 1278, 1272, 1252, 1258, 1261, 1287, 1260, 1260, 1258, 1280, 1263, 1256, 1296, 1269, 1270, 1296, 1282, 696, 678, 665, 700, 700, 691, 689, 688, 650, 663, 662, 698, 655, 660, 662, 684, 690, 657, 653, 663, 670, 691, 687, 675, 694, 670, 676, 659, 661, 664, 664, 689, 683, 675, 687, 691, 676, 659, 689, 657, 659, 656, 654, 679, 669, 687, 666, 662, 691, 1260, 1276, 1252, 1295, 1257, 1277, 1281, 1257, 1295, 1269, 1265, 1290, 1266, 1269, 1286, 1254, 1260, 1265, 1290, 1294, 1286, 1279, 1254, 1256, 1276, 1285, 1282, 1251, 1282, 1261, 1253, 56, 74, 85, 94, 18, 83, 38, 80, 8, 4, 78, 43, 7, 79, 68, 78, 1275, 1250, 1268, 1297, 1284, 1255, 1294, 1262, 1250, 1252, 680, 693, 677, 676, 670, 653, 670, 661, 658, 695, 665, 671, 656, 686, 662, 691, 675, 658, 671, 650, 667, 653, 652, 686, 667, 682, 694, 654, 689, 682, 667, 658, 651, 652, 692, 652, 655, 651, 650, 698, 655, 650, 679, 672, 697, 696, 696, 683, 1277, 1264, 1274, 1260, 1285, 1285, 1283, 1259, 1260, 1288, 1281, 1284, 1281, 1257, 1285, 1295, 1273, 1264, 1283, 1284, 1300, 1299, 1257, 1297, 1254, 1257, 1270, 1257, 1295, 34, 5, 73, 42, 27, 36, 91, 85, 19, 50, 34, 21, 73, 38, 18, 73]
【问题讨论】:
在这种情况下你想要什么输出?你能提供一个数据样本(不是图片)吗? 如果可能,我希望将其作为输出:[[0,100], [650,750], [1220, 1300]]。这些是所有数据所在的范围。基本上有一个很大的数据列表,其中很少有。位于 0,100 范围内,然后有很大的差距,然后很少。介于 650 到 750 之间,并且在 1220 到 1300 之间存在较大差距之后,类似的数据很少。 【参考方案1】:我按照@schwobaseggl 的建议引用了Unsupervised clustering with unknown number of clusters ,并根据我的需要稍微更改了代码。 这是新代码:
import numpy
import scipy.cluster.hierarchy as hcluster
temp_data = [31,68,74,46,47,83,29,11,9,52,1272,1288,1297,1285,1294,1251,1265,1257,1280,1265,1292,1297,1271,1273,1253,1273,1291,1251,1295,1298,1264,1281,1294,1280,1250,1279,1298,1290,1294,1299,1266,1260,1298,1292,1280,1259,1266,1276,1253,1252,1271,1280,1284,1266,1254,1259,1291,1268,1253,1298,1288,1271,1298,1300,1274,1294,1263,1298,1270,1254,1266,1269,1283,1285,1286,1276,1257,1266,1272,1298,1261,1251,1272,1260,1291,1269,1260,1294,1287,1256,1253,1284,1269,1287,1292,1269,1272,1275,1250,1289,56,35,19,80,47,22,92,8,10,24,87,76,60,63,64,0,1295,1268,1280,1281,1277,1300,1278,1273,1250,1296,1266,1269,1282,1281,1272,1260,1292,1272,1253,1255,1299,1269,1268,1294,1250,1299,1292,1254,1281,1289,1259,1290,1271,1280,1272,1300,1258,1290,1289,1300,1299,1261,1300,1276,1290,1299,1280,1267,1283,1282,1269,1260,1285,1252,1250,1263,1297,1300,1292,1266,1260,1263,1292,1296,1289,1297,1251,1261,1250,1294,1278,1284,1291,1281,1269,1261,1257,1267,1265,1288,1291,1257,1296,1251,1260,1272,1294,1285,1269,1283,1297,1287,1253,1292,1299,1295,1286,1288,1283,1290,20,73,81,6,49,88,96,61,49,94,57,16,61,16,17,19,1280,1257,1259,1277,1257,1262,1263,1280,1292,1250,1287,1272,1258,1253,1285,1285,1257,1291,1273,1260,1267,1250,1280,1281,1263,1269,1292,1250,1282,1263,1274,1288,1296,1266,1291,1271,1273,1281,1261,1289,1269,1287,1296,1283,1280,1298,1259,1270,1259,1289,1269,1284,1295,1297,1256,1300,1281,1296,1284,1288,1285,1296,1277,1251,1279,1295,1281,1264,1280,1263,69,8,30,45,89,61,80,45,9,18,19,11,1255,1299,1296,1293,1287,1250,1265,1291,1281,1250,1286,1286,1251,1287,1266,1288,1254,1260,1260,1254,1267,1299,1273,1250,1300,1250,1279,1255,1293,1292,1278,1277,1252,1299,1278,1258,1268,1274,1285,1258,1279,1270,1278,1286,1278,1253,1267,1300,1295,1298,1285,1288,1274,1272,1252,1256,1283,1289,1251,1258,1253,1257,1297,1269,1292,1253,1273,1281,1251,1280,1253,1274,1275,1287,1296,1298,1296,1291,1284,1261,1267,1290,1273,1281,1263,1270,1264,1269,1278,1284,67,8,40,59,97,64,45,72,45,90,94,7,33,58,97,97,1252,1297,1265,1278,1272,1252,1258,1261,1287,1260,1260,1258,1280,1263,1256,1296,1269,1270,1296,1282,696,678,665,700,700,691,689,688,650,663,662,698,655,660,662,684,690,657,653,663,670,691,687,675,694,670,676,659,661,664,664,689,683,675,687,691,676,659,689,657,659,656,654,679,669,687,666,662,691,1260,1276,1252,1295,1257,1277,1281,1257,1295,1269,1265,1290,1266,1269,1286,1254,1260,1265,1290,1294,1286,1279,1254,1256,1276,1285,1282,1251,1282,1261,1253,56,74,85,94,18,83,38,80,8,4,78,43,7,79,68,78,1275,1250,1268,1297,1284,1255,1294,1262,1250,1252,680,693,677,676,670,653,670,661,658,695,665,671,656,686,662,691,675,658,671,650,667,653,652,686,667,682,694,654,689,682,667,658,651,652,692,652,655,651,650,698,655,650,679,672,697,696,696,683,1277,1264,1274,1260,1285,1285,1283,1259,1260,1288,1281,1284,1281,1257,1285,1295,1273,1264,1283,1284,1300,1299,1257,1297,1254,1257,1270,1257,1295,34,5,73,42,27,36,91,85,19,50,34,21,73,38,18,73]
ndata = [[td, td] for td in temp_data]
data = numpy.array(ndata)
# clustering
thresh = (11.0/100.0) * (max(temp_data) - min(temp_data)) #Threshold 11% of the total range of data
clusters = hcluster.fclusterdata(data, thresh, criterion="distance")
total_clusters = max(clusters)
clustered_index = []
for i in range(total_clusters):
clustered_index.append([])
for i in range(len(clusters)):
clustered_index[clusters[i] - 1].append(i)
clustered_range = []
for x in clustered_index:
clustered_index_x = [temp_data[y] for y in x]
clustered_range.append((min(clustered_index_x) , max(clustered_index_x)))
print clustered_range
我已选择阈值 (thres) 作为数据总范围的 11%
所以这个数据集的输出是:
[(0, 97), (1250, 1300), (650, 700)]
【讨论】:
【参考方案2】:您可能正在寻找k-means clustering。
from typing import Tuple, Iterable, Sequence, List, Dict, DefaultDict
from random import sample
from math import fsum, sqrt
from collections import defaultdict
from functools import partial
Point = Tuple[int, ...]
Centroid = Point
def mean(data: Iterable[float]) -> float:
'Accurate arithmetic mean'
data = list(data)
return fsum(data) / len(data)
def dist(p: Point, q: Point, sqrt=sqrt, fsum=fsum, zip=zip) -> float:
'Euclidean distance'
return sqrt(fsum((x - y) ** 2.0 for x, y in zip(p, q)))
def assign_data(centroids: Sequence[Centroid], data: Iterable[Point]) -> Dict[Centroid, Sequence[Point]]:
'Assign data the closest centroid'
d = defaultdict(list) # type: DefaultDict[Point, List[Point]]
for p in data:
centroid = min(centroids, key=partial(dist, p)) # type: Point
d[centroid].append(p)
return dict(d)
def compute_centroids(groups: Iterable[Sequence[Point]]) -> List[Centroid]:
'Compute the centroid of each group'
return [tuple(map(mean, zip(*pts))) for pts in groups]
def k_means(data: Iterable[Point], k:int=2, iterations:int=10) -> List[Point]:
'Return k-centroids for the data'
data = list(data)
centroids = sample(data, k)
for i in range(iterations):
labeled = assign_data(centroids, data)
centroids = compute_centroids(labeled.values())
return centroids
这里适用于您的问题:
data = [69, 8, 30, 45, 89, 61, 80, 45, 9, 18, 19, 11, 1255, 1299,
1296, 1293, 1287, 1250, 1265, 1291, 1281, 1250, 1286,
1286, 1251, 1287, 1266, 1288, 1254, 1260, 1260, 1254,
1267, 1299, 1273, 1250, 1300, 1250, 1279, 1255, 1293,
1292, 1278, 1277, 1252, 1299, 1278, 1258, 1268, 1274,
1285, 1258, 1279, 1270, 1278, 1286, 1278, 1253, 1267,
1300, 1295, 1298, 1285, 1288, 1274, 1272, 1252, 1256,
1283, 1289, 1251, 1258, 1253, 1257, 1297, 1269, 1292,
1253, 1273, 1281, 1251, 1280, 1253, 1274, 1275, 1287,
1296, 1298, 1296, 1291, 1284, 1261, 1267, 1290, 1273,
1281, 1263, 1270, 1264, 1269, 1278, 1284, 67, 8, 40, 59,
97, 64, 45, 72, 45, 90, 94, 7, 33, 58, 97, 97, 1252, 1297,
1265, 1278, 1272, 1252, 1258, 1261, 1287, 1260, 1260,
1258, 1280, 1263, 1256, 1296, 1269, 1270, 1296, 1282, 696,
678, 665, 700, 700, 691, 689, 688, 650, 663, 662, 698,
655, 660, 662, 684, 690, 657, 653, 663, 670, 691, 687,
675, 694, 670, 676, 659, 661, 664, 664, 689, 683, 675,
687, 691, 676, 659, 689, 657, 659, 656, 654, 679, 669,
687, 666, 662, 691, 1260, 1276, 1252, 1295, 1257, 1277,
1281, 1257, 1295, 1269, 1265, 1290, 1266, 1269, 1286,
1254, 1260, 1265, 1290, 1294, 1286, 1279, 1254, 1256,
1276, 1285, 1282, 1251, 1282, 1261, 1253, 56, 74, 85, 94,
18, 83, 38, 80, 8, 4, 78, 43, 7, 79, 68, 78, 1275, 1250,
1268, 1297, 1284, 1255, 1294, 1262, 1250, 1252, 680, 693,
677, 676, 670, 653, 670, 661, 658, 695, 665, 671, 656,
686, 662, 691, 675, 658, 671, 650, 667, 653, 652, 686,
667, 682, 694, 654, 689, 682, 667, 658, 651, 652, 692,
652, 655, 651, 650, 698, 655, 650, 679, 672, 697, 696,
696, 683, 1277, 1264, 1274, 1260, 1285, 1285, 1283, 1259,
1260, 1288, 1281, 1284, 1281, 1257, 1285, 1295, 1273,
1264, 1283, 1284, 1300, 1299, 1257, 1297, 1254, 1257,
1270, 1257, 1295, 34, 5, 73, 42, 27, 36, 91, 85, 19, 50,
34, 21, 73, 38, 18, 73]
points = [(x,) for x in data]
centroids = k_means(points, k=3, iterations=100)
clusters = assign_data(centroids, points).values()
for cluster in clusters:
print(f'min(cluster)[0] to max(cluster)[0]')
这个输出:
4 to 97
650 to 700
1250 to 1300
【讨论】:
@schwobaseggl 具有未知数量的集群的无监督聚类处理 3 维数据。我只想要我的数字存在的范围。 但在 K-Means 聚类中,K 的数量是预定义的。但在我的数据集中,我不知道我将拥有多少个集群, 那么你的问题定义不明确。没有约束,有一些简单的解决方案(所有点都在一个集群中,或者每个点都是它自己的大小为 1 的集群)。 此外,k-means 代码适用于任何维度。该示例给出了三个维度,但它只适用于一个维度。【参考方案3】:一种可能性是使用 sklearns 高斯混合模型来查找集群。由于只有一个维度,也许这有点矫枉过正,但无论如何它应该可以工作。
下面是一个例子:
import numpy as np
from sklearn import mixture
# Generate some example data
centers = [50, 700, 1250] # Approximate
y = []
for i in range(20):
y.append(np.random.randn(100) * 50 + centers[np.random.randint(len(centers))])
y = np.c_[np.concatenate(y)]
N = 3 # Clusters
gmix = mixture.GaussianMixture(n_components=3, covariance_type='full')
gmix.fit(y) # Now it thinks it is trained
for i in range(N):
center = gmix.means_[i][0]
std = np.sqrt(gmix.covariances_[i][0])
print "Range %d: %.0f to %.0f" % (i + 1, center - std, center + std)
它应该输出类似(使用 IPython 2.7):
Range 1: 1199 to 1298
Range 2: 649 to 752
Range 3: 4 to 102
这当然取决于它找到的集群,这可能并不总是一个好的解决方案。
【讨论】:
但您已经预定义了 N = 3 # 个集群。但我们一开始不会有任何这样的信息。我们需要自动找出数据集中有多少这样的区域。 好吧,在我给出答案之后你修改了你的问题,所以我很难知道你也想找到集群的数量。但例如,您可以使用贝叶斯信息标准 (BIC) 分数来确定这一点。以上是关于在Python中将数字分组为范围[重复]的主要内容,如果未能解决你的问题,请参考以下文章